Content Delivery Networks (CDNs) have become the architectural cornerstone of network design for enterprises looking to deliver fast and highly available digital experience to their customers. When you rely on a third-party service, like a CDN, to deliver business critical services, it is crucial to continuously monitor their performance and SLA commitments. Because when any one of the cogs in your well oiled digital experience machine fails, your business fails. In today’s blog post we will walk through an example of how a glitch in Fastly’s CDN network affected crucial services like Grubhub, CNN and Reddit. As we step through the Fastly outage, we will articulate best practices for monitoring CDN performance. We will also discuss why it is critical to understand CDN architectural dependencies on DNS and how to build a complete network monitoring strategy for CDN-hosted services.
Basics: How CDNs Work
The goal of a CDN network is to optimize application delivery and improve end-user experience. CDNs achieve this by caching content in many distributed servers placed close to end-users. That way, when a user tries to access a CDN-fronted website, the HTTP request will be sent to the cache server nearest to the user. These distributed cache servers, also known as edge servers, are selected based on the location of the user. CDN service providers adopt different approaches to picking the closest edge location to keep latency to a minimum. For example, CDN vendors like Cloudflare, CacheFly and Edgecast use an anycast approach, which relies on BGP routing to find the closest edge node. Akamai, Limelight and Fastly rely on a DNS-based global load balancing (GSLB) approach which offers more granular centralized control.
With DNS-based redirection and selection, edge servers are picked based on the source IP address of the DNS query. When a user accesses a website, the first step is DNS resolution. A DNS query is sent to an authoritative name server that performs the resolution. In the case of a CDN-fronted website, the authoritative name server returns an alias name, usually a CDN domain name, through a CNAME DNS record. This redirects the user (or the DNS recursive server) to the CDN provider’s authoritative server for the IP resolution. When the request is received by the CDN authoritative server, the most optimized edge server (based on the source IP address DNS request) is selected. With a DNS-based approach, multiple edge locations with different IP addresses will serve the same content, depending on where the user connects from. This is significantly different than how an anycast-based selection works, which involves managing how BGP routes are advertised to the Internet. A DNS redirection-based CDN architecture is relatively easy to implement, which makes it a popular architecture.
Fastly’s Outage Impacts Critical Services
Last year, Fastly, suffered a rather short but heavily felt network outage impacting popular websites like The New York Times, CNN, Reddit, Grubhub and Pinterest. Fastly later attributed the outage to an internal error that resulted in traffic degradation. But what exactly happened? Let’s go behind the scenes of the outage to find out.
Grubhub started seeing first signs of trouble on June 28th, 2017 at 6:40am Pacific Time. The outage lasted for only 30 minutes, but it was long enough to cause panic among customers. As seen in Figure 1 below, all our Cloud Agents within the US were unable to connect to grubhub.com. Connect and SSL errors indicate an underlying network issue. Check out the sharelink and see the full picture of how the story unfolds.
Packets Go for a Ride. Around the World.
As we dig into the network layer, it becomes obvious that this is indeed a network issue. High levels of packet loss correspond to the dip in HTTP server availability as shown in Figure 2.
As we hover over the network path to detect the source of the packet loss (Figure 3), we see a pattern emerge. First off, Grubhub is front-ended by Fastly, as seen from the target IP address. Second off, the packet loss is concentrated within service providers upstream to Fastly. The oddity is not so much the packet drops, but where they are being dropped—within Japan! If the goal of a CDN is to optimize network path and latency and serve content from locations closest to the end-user, why is traffic being routed to Japan from Cloud Agents located within the United States? This definitely explains the high latency we noticed earlier in Figure 2.
Before and After Comparison
So why are packets being routed through Japan? Was this always the case or is it a manifestation of the outage? Let’s go back in time and look at steady state network behavior (Figure 4) to understand what might be happening.
We notice that the network before the outage is relatively clean – there are no packet drops and traffic is well distributed to the multiple CDN edge locations. Recalling our discussion earlier, Fastly uses a DNS-based edge location selection mechanism to serve content with minimum latency to the end-users. This explains the different target IP address you see in Path Visualization in Figure 4. However, during the outage (Figure 3) we see a poorly load balanced network, a sub-optimal network path and a different set of target IP addresses. If you follow the sharelink, you will also notice that during steady state, traffic does not go through Japan.
So what happened within Fastly’s network that caused packets to go halfway around the world, only to get dropped? It is possible that an internal change within Fastly’s network resulted in sub-optimal edge server selection. Traffic, irrespective of the user location, was being sent to Fastly edge servers located in Japan. The sudden influx of traffic (from multiple Fastly customers) was probably too much to handle for the upstream service providers, thereby resulting in a 100% packet loss. It could also be that the edge servers in Japan did not cache content and were unable to reach the origin servers due to internal changes. Either way, it impacted businesses relying on Fastly.
CDN Monitoring Best Practices
- Monitor Your CDN Edge Locations: Irrespective of how your CDN provider selects edge server locations, monitor them from the vantage point of your customers. ThousandEyes Cloud Agents are available in 160 cities globally, so you can pick and choose the agents to best represent your customer distribution.
- Monitor Application Performance and Network Latency: While it is important to monitor application uptime and availability, the cross-correlation between application performance and network behavior is critical to identify the root cause.
- Don’t take DNS lightly!
For additional details on CDN monitoring best practices including what metrics to monitor, read our Intro to CDN Monitoring blog post.
The Fastly outage is yet another reminder of how reliance on third-party networks and service providers can impact the performance of your applications and impact user experience. Want to keep your CDN service providers in check? Request a demo to learn how your business can benefit from ThousandEyes or sign up for a free trial to start getting actionable insights into networks and services your business relies on.