Watch on YouTube – The Internet Report – Ep. 15: July 13 – July 19, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, we cover a couple of significant application-layer outages at Github and WhatsApp that occurred over the past week. Then, Archana and I do a deep-dive into a network-related outage at Cloudflare that affected the availability of its popular DNS service for approximately 30 minutes. We’ll share what we saw through our vantage points in the ThousandEyes platform, and you can read Cloudflare’s full explanation of the incident on their blog here.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 15: July 13 – July 19, 2020
ThousandEyes T-shirt Offer

Follow Along with the Transcript

Angelique Medina:
Welcome to the Internet Report where we uncover what’s breaking and what’s working on the Internet and why. Last year, or last week rather, was a big week. We had three major outages, starting off on Monday with a big bang with GitHub and their outage, and then midweek we had WhatsApp go down for a period of time. Then finally we exited the week with the biggest outage, which was Cloudflare having a significant network issue that impacted their DNS services.

Archana Kesavan:
Right. Angelique, not to forget the outage that we were a part of too. If you guys signed in for State of the Internet, you would have noticed us drop off for about 10 minutes there, but we were back up. It was a small glitch in the backend system that we were relying on for the live stream. That was another live outage that you guys witnessed.

Angelique Medina:
It was a busy week.

Archana Kesavan:
It was. It was a busy week. The GitHub outage was definitely one of the smallest outages this week, it lasted for about an hour and 45 minutes. Nothing essentially wrong with the network connectivity piece, getting to data centers, but it looks like there was an application-level issue that tripped their network and connectivity to Github.

Angelique Medina:
Yeah. And it took place during a period of time in which it’s typical to do maintenance. It was what? 2:15-2:30 Eastern time. And like you said, lasted around 90 minutes and it wasn’t network related. It appears to be something to do with the application, maybe they were making an update. They haven’t really come out with a full statement indicating what the issue was. So given that it’s been a week, probably not likely to hear more about that, but it doesn’t seem to have been too impactful from a user standpoint.

Archana Kesavan:
I think the second outage was the WhatsApp one that lasted for, I think under an hour and that definitely took place during, it was around 3:45 Eastern. So, that was definitely disruptive in the sense that it was impacting users. And the WhatsApp outage was interesting in the sense that, what we noticed is, connecting to WhatsApp, CDN edge was not an issue. However, while we were, our tests that we’re running to WhatsApp’s connection servers, c.whatsapp.net. So essentially servers hosted in AWS’s Ashburn region. We started seeing packet drops to that particular service. And that service is actually really critical to initiate sending messages or texts and so on. So it was necessary, it was kind of a drop in one of their most critical services. And I’m not sure, did WhatsApp come out with a, what actually happened in there, Angelique?

Angelique Medina:
I don’t believe I’ve seen a fuller explanation of the issue, but like you said they did in fact, well, they didn’t know that this was the result of an internal server upgrade that caused this. And the manifestation of it was that there was… The traffic couldn’t connect to the server and there was packet loss. And initially, in looking at this service, which we knew it was crucial for WhatsApp to work, we saw the packet loss. Initially, we were unsure because this was taking place in AWS’ network, if this was something related to AWS, but it didn’t seem to be impacting any of their other customers. And so it did start to look like it might’ve been due to some configuration change they might’ve made. And then that was confirmed when they came out with their statement, which actually was pretty, it came out pretty soon after the incident was resolved.

Archana Kesavan:
Right. And the fact that the outage was kind of global in nature. We noticed that all our agents testing to that particular critical service was showing pretty significant packet drops in there. Also suggest that it was something going on at the application, the software layer that was causing this outage.

Angelique Medina:
Right. So two application-related outages. And then we had this outage on Friday, which initially given the reports, it looked like it was an application related outage, like DNS, but it turned… But it also some of the symptoms or some of the characteristics of this outage were a little bit odd. They kind of looked, maybe like they were network-related as well. And then they later confirmed that this was the result of the router configuration error that caused widespread systemic congestion within their backbone. But that was particularly interesting.

Archana Kesavan:
Right. Totally. And with that, we’re going to go under the hood, where we’re going to walk through what happened, what we saw, and some commentary about some things that don’t necessarily add up for us.

Angelique Medina:
Yeah, absolutely. So in their post-mortem, they noted this, which is really exactly what we saw, which was at around, between 21:10 and 21:15, there was this configuration change. So they said it was 21:12. And so very shortly after that, we started seeing that there was a 50% drop in availability, near 50% drop in availability, which aligns with what they said in terms of the amount of traffic or the infrastructure that was impacted as a result of this change. And it lasted for about 30 minutes or so, until about 21:39, 40 or 41, in which it was resolved. So during the particular incident, we could see here, for example, that these vantage points in these various locations are not able to reach the name server Quad One.

Angelique Medina:
So we’re testing two quad one, which is a Cloudflare’s public DNS resolver, and we’re querying for just an example .com a record. And we’re not able to reach the server and so that’s the issue there. Now, even though we’re just looking right now at Quad One, the managed DNS service that they have was affected in a similar way. And then also Cloudflare is one of the folks that supports two of the root servers, F and E root. And we saw a similar dip in availability as the result of this particular incident. So it was sort of simultaneously impacting these different parts of their DNS service.

Archana Kesavan:
Yeah, I think on top of that, something that we noticed is some handful of services that relied on Cloudflare’s CDN, the edge service, the edge network was also showing packet drops that, you sign that we actually have a thing find that, we noticed there’s this one particular service they actually swapped on Cloudflare and tried to connect directly into AWS, which we get to pretty quickly.

Angelique Medina:
Yeah, that’s particularly interesting. So just kind of in drilling down further we see that there was a lot of people on social media complaining. They weren’t able to reach sites or certain sites were down because people couldn’t resolve their domain name and reach their website. And so we saw this as, this was effectively packet loss that was occurring, which is really unusual when there are so many different vantage points, all connecting to a service, a service that is anycast. So quad one is in anycast service, which means that it’s served from many locations.

Archana Kesavan:
It’s omnipresent.

Angelique Medina:
Yeah. Exactly. So the idea that this would be a network issue… Okay. That would point to something pretty foundational, like a control plane layer, but in…

Archana Kesavan:
Also, just that the packet loss that we were seeing across possibly all places where just anycast existed. I think resulted in people speculating if this was a DDoS attack. And as we know it’s not, and we’ll get into that, once you start seeing the pattern of the loss through path visualization just a little bit, it does have symptoms of it being a DDoS, but it also has a very close resemblance to an outage that happened last year, which Angelic, I think is the Google outage that happened.

Angelique Medina:
Right. There is a Google network outage that took place in the early part of June, so I think June 3rd. And I believe it was a Sunday and it was a pretty significant outage. It lasted, I believe about four hours, which we’ll touch on in just a second. But to your point, this somewhat resembles a DDoS attack in the sense that there’s a variation of packet loss. So pretty extreme on one hand and then there’s others where you still see some level of packet loss. If you look at this from the standpoint of path visualization, to your point, this could very well… This isn’t unlike similar scenarios we’ve seen with major DDoS attacks, whether that be against GitHub, which we brought up earlier, the path traces looked pretty similar.

Angelique Medina:
Or even when there was that route leak last year, that was propagated by Verizon. There was a whole bunch of traffic that was getting funneled into Verizon’s network and to a few other ISPs because of the particular small ISP that was advertising themselves as Cloudflare, or path to cloud Cloudflare. And because of that, it was just funneling far too much traffic through these networks. And that was leading to just pretty significant packet loss. So folks speculating that this could be a DDoS attack, it’s not out of the realm of possibility.

Archana Kesavan:
Definitely.

Angelique Medina:
But it also looks like Google’s outage, which was the result of their control plane being taken offline by themselves because they accidentally deactivated them as part of the maintenance window. And so they effectively took their control plane for a big part of their network in the U.S offline, which meant that their infrastructure, even though it was working, they didn’t have routes internally. So all the traffic that was headed towards Google’s network was just getting dropped at the edge. So at the gateway to their network, because there were no internal routes to it.

Archana Kesavan:
Yeah. And that’s similar, you’re right. The drop you’re seeing is actually right at the edge of Cloudflare’s network. So if you just hover around each of those nodes there, you’ll start seeing that, you see different ISPs in there that are all impacted. So again, it has the symptoms or the patterns of a DDoS attack, but it definitely was not, and Cloudflare confirmed that as well in their pretty detailed RCA.

Angelique Medina:
Yeah, that’s right. So everything that they’ve said in terms of their post-mortem is very much consistent with what we were seeing which was, they said that there was just congestion in their backbone and that was… It’s kind of like a traffic jam where all of the traffic that just piles up and is just not able to get through to its destination and then just gets dropped as a result of that.

Archana Kesavan:
Yeah. And I think all nodes within your backbone started seeing, not necessarily all, but 50%. I think they have a list of all your backbone PoPs that were affected. This congestion that happened in Atlanta because of a BGP misconfiguration, IBGP misconfiguration, kind of resulted in a bottleneck across their backbone. So while the router was in Atlanta, that’s where the issue started, originated and the congestion started there. Then it cascaded all through their backbone as well.

Angelique Medina:
Well, they were still advertising externally BGP, their routes to their service from each of those pops, but they weren’t servicing from those PoPs. They were sending everything internally because of, to your point, internal BGP configuration. But what was interesting was that as the incident started to get resolved, so this was within just a couple of minutes of them implementing an internal change to effectively take the Atlanta router offline, which was what led to the service, getting restored. They made some BGP changes. So this was pretty unusual. And it was something that we saw only across the, not only for the /24 for the Quad Nine service. So that prefix, but also for prefixes related to their CDN service, as well as those related to F route. So they made this same BGP path change at around the time when the service or router was taken offline.

Archana Kesavan:
So, I think just to draw a distinction, as Angelique is working through this. What you’re seeing here is a change in their external BGP announcements. The misconfiguration and the root cause that Cloudflare pointed to, was on your internal BGP announcements and their backbone, they’re actually pretty distinct and separate. It’s just …

Angelique Medina:
Yeah. So it’s more of a question that we have, why would a change like this or an announcement change like this being made, just as they were taking off the offending router and service was getting restored? Was this related? Was it unrelated? Again, it’s the same announcement they made across these other prefixes. And it seems like it’s a pretty minor change. I mean, basically the only change they made was rather than advertise a path directly through this particular ASN AISG, they put them through a path through Cogent. So they made them indirectly connect through them.

Archana Kesavan:
Right. And to add to that, apart from the fact that this was kind of a pattern seen across, not just your quad one service and also some services that were relying on Cloudflare’s CDN, I think it was a /20 network. We noticed this pattern reversed itself nine hours later.

Angelique Medina:
Yeah.

Archana Kesavan:
So sometime later we noticed that the pairing here, which was removed as in, instead of going by Cogent to AISG. Cloudflare then started peering directly with the AISG, which is what was the case before this outage, exactly like you see here. So it’s curious because is this just coincidence or was this something that was planned and also critical for the recovery? It’ll be good for us to understand that.

Angelique Medina:
Yeah. A couple of questions still kind of remaining for us is one, what was the role of this BGP announcement change, if any, in relation to the outage? And then also when the misconfiguration occurred on that router in Atlanta, was it just impacting certain prefixes or many different prefixes? Those would be interesting things to get answers on.

Archana Kesavan:
The impact we saw with DNS. Just the global impact, over 50%, it was pretty massive.

Angelique Medina:
Yeah.

Archana Kesavan:
Cloudflare is also a really popular CDN provider, apart from providing DNS. And we didn’t notice that much disruption when it came to their CDN edge services. We did see one particular service that was impacted. I’m not saying none were, but the blast radius of that impact was much smaller than the DNS services. So when it came to the question around prefixes. The prefixes that were leaked, where the only the DNS prefixes and then maybe a small set of the edge servers or not necessarily, we don’t know. So that would be good to get some clarification too.

Angelique Medina:
Yeah. Well, also to your point, in terms of their CDN customers, some of their CDN customers or maybe even many of their CDN customers are also using their internal DNS. So they could have been indirectly impacted even if the CDN services themselves weren’t. So that’s because it’s just a foundational service.

Archana Kesavan:
The other thing that I thought was interesting and we were discussing earlier on is this one particular service that was relying on Cloudflare’s CDN. Right as Cloudflare was recovering and coming back up, they seem to switch directly to their, I think it’s their origin. So instead of going to the edge, going through CDN provider’s edge network, they started directly connecting users to AWS, which I assume is where their origin was hosted. And I think the interesting thing there was, over the weekend, this outage happened on Friday, towards the evening. It was almost 5:00 Eastern. And this particular service, as Cloudflare was recovering, they switched over to AWS, and throughout the weekend they had this pattern where they were switching back and forth. At some point in time, some locations were connecting to Cloudflare. Some locations were connecting to AWS directly.

Archana Kesavan:
And around Monday morning, around 1:30 AM, Monday morning, they completely switched back over to Cloudflare. It’s interesting because in the RCA, that Cloudflare had, they were talking about implementing a prefix around reducing or capping the number of prefixes on their BGP sessions. So I’m wondering… They said they were going to do that July 20 as well on the Monday. So I’m just wondering if there was some relationship there in terms of switching back safely to Cloudflare.

Angelique Medina:
Right. Well, and also what’s interesting is that they… There could have been multiple things at play, but this particular service is very oriented towards enterprises. So it’s not really a consumer service that they’re offering. And so you’re not going to get a lot of people coming to your site necessarily on the weekends, because it’s really kind of more for enterprises. So maybe they just decided, “Hey, we want to enjoy our weekend. Let’s have Cloudflare and AWS in the mix just to make sure nothing goes offline. And if it runs a little bit slower because it’s not being served from distributed CDN caching nodes, then so be it because we’re going to have fewer visitors.”

Archana Kesavan:
And the impact that we saw, which was not necessarily that debilitating because it was the weekend, again, was we did see the response time to that particular service, just looking at it from Amsterdam. But if you look at those overall response times kind of increased, and that makes sense. It’s the whole point of having CDNs in the first place, you want to get the information and the data as close as few users as possible. So when you’re going to send them directly to the origin, which in this case was on AWS’s Ashburn region. It’s not surprising to see a little uptake in response time. Again, did not really disrupt customer service or anything because this particular service it’s an enterprise service, as Angelic was saying and this only happened over the weekend.

Archana Kesavan:
And then come Monday morning, they’ve restored back to Cloudflare again.

Angelique Medina:
And from a standpoint of page load time. It seemed like that HTTP response time was kind of absorbed into that. And overall, page load time didn’t really go up all that much. So at the end, it didn’t seem like it was over impacting… But again, we don’t know in terms of where users are coming from and what networks. There are many reasons why you may want to have a CDN fronting your origin, not just for performance reasons, but also for security and just even shielding the origin.

Archana Kesavan:
Absolutely. Yeah. It was definitely a busy weekend. And the thing I really appreciated about this Cloudflare outage is, there’s always the RCA was really, really fast. I thought we were fast.

Angelique Medina:
And not just fast. I mean, they’re very transparent. They want to provide as much information as possible. And you could tell that there was certainly some internal grief around the whole incident. And they also kind of put out an open invitation for people to ask questions. I think it would be very interesting, certainly, we have questions and if anybody in our audience does as well we should just kind of bring them to their door.

Archana Kesavan:
Absolutely. And a couple of questions again, just to reiterate what we were talking about is, why was there a change in an external BGP announcement around the same time as the recovery? Why was that prevalent across multiple prefixes? Services, enterprise services, DNS services, as well as the approved service as well. And then the third question is, in terms of the routes that leaked, how big was that prefix list? DNS versus other services or not. So if you guys have answers to these questions and you want to enlighten us, definitely write to us. We definitely be curious and interested and find out what was actually going on in there.

Angelique Medina:
Absolutely. So that concludes our show this week. I mean, this was really focused on Cloudflare, but as you can see, there was quite a lot to dig into. So as Archana mentioned, if you have questions, comments, have ideas about what we talked about today, feel free to post a note at internetreport@thousandeyes.com, or you can reach out to us on Twitter as well.

Angelique Medina:
Also, we had the State of the Internet virtual summit last week, and those videos from some of the great sessions we had are going up on YouTube, so feel free to check them out. In particular, there’s Marcel Flores from Verizon who had a great session on optimizing BGP for user performance, which was really interesting. And then Geoff Huston of APNIC is always just wildly entertaining. And so many more of Akamai, CenturyLink, and others—just so many great talks.

Archana Kesavan:
Yep. Definitely check that out. And again, if you have questions, if you have some ideas for later episodes, email us at internetreport@thousandeyes.com and that’s what you do to also obtain your free tee-shirts. So send us your address and the tee-shirt size and we’ll get that right over to you. Until then, have great week.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog