Visualizing Network Ops @ Twitter

Posted by on August 19, 2015

This is the last of our reports from ThousandEyes Connect, held in San Francisco in May. Read the previous talks about optimizing eBay’s CDN performance, monitoring Oracle’s cloud services and improving internet reliability at Bloomberg. Today we will move on to Twitter’s global media delivery. Matt Lee, Network Engineer at Twitter, spoke about how his team ensures that tweets, images and videos get delivered to users around the world. Or in his words, that “the Tweets keep flowing.”

Delivering Content at a Global Scale

Matt kicked off his talk with a visualization of the “geography of tweets.” Everything Twitter does is at a global scale: more than 500M tweets per day, with 80% of users on mobile and 77% outside of the United States. And Matt reminded us that it’s not just 140 characters, but also photos, videos (Vine) and live streaming (Periscope).

Figure 1
Figure 1: Map of a tweet in Tokyo, showing activity along major transport networks.

Twitter relies heavily on Content Delivery Networks (CDNs) to get tweets, images and videos to users as fast as possible. Matt shared a bit about the Twitter architecture: redundant CDNs with round robin DNS between one external provider and one co-located provider. IP Anycast is used to direct requests to the closest POP.

In the case of the co-located CDN, Twitter has partnered with a third party vendor, co-locating them in their data centers. As Matt describes, “It gives us a lot of control. We can see link errors going down to the cache server, running Varnish. We control the BGP session and policy. Our partner manages the server.” Most traffic flows to this co-located CDN edge, with approximately 20 POPs around the world. And traffic back to the origin flows primarily through Twitter’s own backbone.

Bringing Visibility to the Edge

Matt spends his day working on network operations, CDN, load balancing and data center automation projects. And this means watching over a lot of performance data. According to Matt, Twitter has “really great data on service delivery inside the data center, whether servers are throwing 500s or link errors. We also get great data from Twitter-owned clients, either on mobile or web. The problem is it’s not real-time enough. So we needed something else to monitor the CDN edge.”

Matt poses a scenario of a user in Singapore reporting problems or slowness. How do you troubleshoot that? How do you replicate and measure the experience of users with a redundant, Anycast CDNs serving up the content? In Matt’s view,” If you’re working with a provider, someone else managing your caches, you can’t just claim that the ‘Internet is not working.’ It’s not going to cut it. What you really need to find out is where the problem is. What is the solution? Which cache are users hitting that are creating this problem?”

If you “run a CURL from San Francisco, you’re going to hit a different cache server.” Troubleshooting remotely, following user complaints, getting screenshots and filing a ticket with the service provider was too much of a pain. Visualizing paths helps Matt understand if slowness is the result of traffic taking a route around the world or using the closest cache.

So what do you need to do to solve CDN delivery issues? For Matt, “HTTP headers are the greatest thing. There’s lots of data in there that we get back from the caches that helps us understand what’s going on.” Matt uses ‘X-served-by’ headers for test objects to understand local issues, with the hostname telling him which POP the cache is in, which router it’s connected to and which server it is. And he has MD5 content to check the cache state; for example, when someone reports receiving the wrong image. You can check the origin against the edge to see if there is a cache state issue and to tune performance.

The Case of the Missing Selfies

So what’s normal? Figure 2 shows a normal routing pattern to Twitter’s photo endpoint. The average response time is about 7ms. According to Matt, “everyone’s pretty happy in this scenario.”

Figure 2
Figure 2: Performance to an Anycast photo endpoint, with 7ms average global response time.

Then Matt talked us through a real scenario. “Until one day, when this happened. A normal day; no maintenance or anything, and all of a sudden traffic just disappeared. What the hell?”

Figure 3
Figure 3: Traffic in and out of the photo service.

“There were some working theories. Maybe it was DNS? But that was working. Maybe it was a reporting black hole, but that wasn’t it. Maybe it was BGP? But no one had any reproducible data. Until we saw this. All of the traffic was going through a router in Tokyo, with 93% packet loss and response time spiked.” With this data in hand, Matt and team resolved the routing configurations and made sure the photos were flowing again.

Figure 4
Figure 4: Global traffic routing through a single router in Tokyo.

Matt also spoke with us on camera about his experience as a Network Engineer at Twitter. Check out the short video below:

So there’s a quick peek into how to run a massive, real-time media network. And stay tuned for details about the next ThousandEyes Connect happening in New York this Fall.