Cloud Agents are one of the most important keys in ThousandEyes. Our customers use them to monitor their network from all around the world. Recently we surpassed 100 Cloud Agent locations, so I thought it would be a good time to explain how we deploy these monitoring points and how you can use similar techniques to test out the performance of data centers around the world.
We are constantly increasing the number of locations for our Cloud Agents, including the latest locations in Jakarta, Indore and Albuquerque. We’re vigorous in our approach to acquiring new locations. We need to be confident that the network performs and has the right peering. We use ThousandEyes to perform these tests, because it gives us good visibility into any network related detail we need to know.
Scouting New Locations
Everything starts by looking for dedicated servers in a location requested by customers or that we think might be interesting based on population or location. As you can imagine, we have to deal with tons of server providers, speaking different languages, and we also have to deal with government rules or restrictions in countries that limit DNS, IP addresses, etc. Once we contact them we explain our server requirements, which normally are easy to fulfill, and then the network requirements, which are not so easy. We want our agents’ connectivity to be consistent, with no packet loss, and stable peerings, so we ensure that by testing.
Testing the Data Center Network
We ask the providers for an IP located in the datacenter where our server will be located; these IPs must have a TCP port open, ideally 80. Testing ICMP alone is not enough to measure latencies; we want a deeper test. You might be thinking of a traceroute, mtr, etc., but no, we won’t ever do that; we will check our Path Visualization and HTTP Server dashboards.
We create a network or HTTP test to the given IP and TCP port using other Cloud Agents, a bunch of them, covering most of the world. We include all the Cloud Agents that belong to the same country as the new one, so we can test local peerings.
We let the test run for several days and then we check the results. We discard or go ahead depending on these checks:
Packet loss: This is the most obvious check, if there is packet loss, we discard the server. There have been cases where we detected packet loss, shared a snapshot with the service provider explaining their issues, and they managed to fix their problems just by checking it.
Peerings and BGP routes: We check the peerings and routes to ensure they are following a normal and logical path. For example, if you send a packet from a location in a country to another location in the same country, and it is routed through an external country, then it fails this test. An example of a bad route of a server located in Barcelona being tested by our Madrid agent can be seen below.
Backhauling: Some hosting companies are global and they own their own network among their multiple datacenters. That means that they route the traffic within their network instead of through transit and peering networks. This is not a realistic measurement for our customers, so we also discard these providers.
If a server passes these tests, we move ahead and get it.
Testing the New Server
Besides some performance tests for the server itself, we also do Page Load tests, again using our own product.
This new location will be a real agent, but it will be in testing mode for some days, not shown to our customers. During this time it runs page load tests to multiple popular websites like Google, MSN, LinkedIn, Baidu, etc. We have set an average load time target, so if the new agent performs better than the average, it is good to go and we remove the testing mode so it can be officially used by our customers.
Given the rigorous set of tests, a large proportion of tested service providers don’t pass. For the last 36 locations we added, only 40% of the service providers passed the tests, the other 60% were discarded.
Once the agents reach production and customers start using them, to be sure that we don’t lose that quality we ensured with the previous tests, we set up an ongoing validation in which all our agents perform network tests among themselves. With these tests, we can detect if there is any issue in any agent like higher latencies, peering changes, higher packet loss, etc.
Looking for hosting providers or co-location space in far-flung markets around the world is a time-consuming process. But having to move providers after the fact, because of poor performance, is much more disruptive. If you’d like to test new or existing hosting environments using the tactics we do, sign up for ThousandEyes Lite and baseline performance from our Cloud Agents to the service provider.