This is the 5th post in the “Behind the scenes of a cloud service” blog series. You can read the previous posts here:

  1. The big picture
  2. Data plane clusters
  3. Customer databases
  4. Request routing

We ended the previous post with concerns about fault tolerance and latency. When a customer uses Business Central in the cloud – whether using the UI at https://businesscentral.dynamics.com or using the APIs at https://api.businesscentral.dynamics.com – there are several so-called global services involved, in particular the Fixed Client Endpoint, the Fixed Web Service Endpoint, and the Global Tenant Manager. Since these global services are used in all customer interactions with Business Central in the cloud, we need to make them both fast and fault tolerant.

 

The concern

Let’s start by examining the concern, and let’s use the Fixed Web Service Endpoint (FWSE) to illustrate it. For this discussion, we will assume that the FWSE is hosted in an Azure data center in the US.

Consider this situation:

  • A customer in Asia signs up for Business Central, and the BC tenant gets created in an Azure region in Asia
  • The customer uses an application that runs on his/her laptop, and which calls Business Central web services. This could be a custom-developed application, or it could be a standard application such as Power BI Desktop.
  • Finally, imagine that the customer is on a business trip to Europe (this is not really required to illustrate the problem, but it makes the picture nicer)

When the customer runs his/her local application, it makes API calls to https://api.businesscentral.dynamics.com. As we covered in the previous blog post, all these API calls go to the FWSE, which then forwards the calls to the data plane cluster that hosts the customer’s tenant. The following picture illustrates the scenario:

By now, the two problems should be clear:

  1. The latency for API calls to Business Central can be very high. In this case, each call will lead to 4 transatlantic trips!
  2. If the FWSE goes down for any reason, the customer cannot make API calls to Business Central. The FWSE is a single point of failure.

 

The solution

It doesn’t take a rocket scientist to conclude that we need multiple instances of FWSE in different parts of the world to solve these problems. So let’s start by adding a second instance of the FWSE in Europe:

If the customer is in Europe, and his/her application makes requests to https://api.businesscentral.dynamics.com, we would like the request to go to the instance of FWSE in Europe. But how do we make that happen?

 

Routing to nearest instance using the DNS system

As you probably know, when you enter “https://businesscentral.dynamics.com” in a browser – or make a web service call to https://api.businesscentral.dynamics.com from an application – the first thing that happens is that the domain gets translated to an IP address. At the end of the day, network traffic is sent from one IP address to another IP address, not between domains.

The translation from a domain to an IP address is handled by DNS name servers as part of the global DNS system. A traditional DNS name server is configured with a set of static entries:

The DNS lookup is done once, and the IP address is cached. When the application subsequently sends HTTP requests, they go straight to the target IP address, never through the DNS name server.

Now, what if we had two instances of the Fixed Web Service Endpoint, hosted on two different clusters in different parts of the world? They would have two different IP addresses. So we need to map one domain to two IP addresses. But that’s not even enough: How do we choose the best one of them in each case?

In Business Central in the cloud, we solve this using the Azure Traffic Manager service. Azure Traffic Manager is essentially an intelligent DNS name server. It is more than a static list of mappings between domains and IP addresses: For a given domain such as “api.businesscentral.dynamics.com” it can have multiple IP addresses as well as rules for which one to return to callers:

We have configured our Traffic Manager to return the instance of FWSE that is closest to the caller.

If the customer is in Europe, his/her laptop gets a European IP address such as 192.66.100.141. When the customer’s application looks up “api.businesscentral.dynamics.com” by calling the DNS system, our Traffic Manager gets that call and can see that the call is coming from 192.66.100.141, i.e., from Europe. And now it knows that it needs to return the IP address for the European instance of FWSE – 40.113.99.50.

With sufficiently many instances of the FWSE scattered around the world, we can ensure low latency for all customers. You can easily check which instance of FWSE you will get routed to. This is what I get:

Can you see in which Azure region “my” instance of FWSE is hosted?

 

Fault tolerance

We have now covered how we minimize latency by having multiple instances of the FWSE deployed, but what happens if one of them goes down? If the Traffic Manager keeps routing requests to the bad instance of FWSE, our customers will not be able to connect.

Azure Traffic Manager solves this by periodically probing all the instances by sending HTTP requests to each one per our configuration. If the European instance of FWSE is down, Traffic Manager will soon discover it and stop directing clients to it. It means that the Asian customer in our example will now have his/her requests routed via the US, which is slower, but at least it works!

 

We use the same approach for other global services. This is only a part of our High Availability story, however, because we also have all the regional services. The solution for those is quite different from what we have described in this post, but we’ll save that for another post.