Watch on YouTube – The Internet Report – Ep. 17: July 27 – Aug 2, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On this week’s episode, Archana and I discuss a small number of outages that hit certain regions of the globe over the past week. This includes an outage that caused a midday disruption for people trying to connect to Reddit, a weekend DNS issue at Telstra, and a Cogent outage in EMEA and NA that had the signatures of a maintenance window. We also revisit Cloudflare’s root cause analysis concerning their recent DNS outage and answer some of the open-ended questions we had.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 17: July 27 – Aug 2, 2020
ThousandEyes T-shirt Offer

Follow Along with the Transcript

Angelique Medina:
Hi everyone. This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet and why. This was a pretty quiet week, this past week. Not a lot was going on. We had a few issues. One of the things that we noticed was some user complaints around Reddit. This happened on the 29th, for a few hours where there were some errors, some unusually high number of errors for people connecting.

Archana Kesavan:
Yeah, so it lasted for about three hours from 9:00 AM Pacific time to about 12:30. Reddit acknowledged the issue and resolved it pretty quickly. That was the first outage of the week. I think the more interesting one was the outage on Sunday and this was to Telstra’s DNS service.

Angelique Medina:
Yeah, so we still don’t have a full RCA or root cause analysis for this outage, and initially, Telstra was saying that this was some kind of massive DDoS attack on their DNS. And then they walked that back and said that it seemed to be some issue just with their DNS service. It’s not clear if it’s an internal issue or if it was some kind of flooding of traffic that maybe was non-malicious. Hopefully, we’ll get a little bit more information on that that we can share. But we do know that given that this was a DNS issue, if you had a secondary DNS set up in your system preferences or if you changed your DNS provider to say Google, like 8.8.8.8 or Cloudflare 1.1.1.1, then that would have fixed your issue. Wasn’t a sort of a catastrophic thing and it seemed to have resolved itself fairly quickly. Stay tuned on that one.

Angelique Medina:
But we did have a bigger outage closer to home and also in Europe. This was something that impacted Cogent’s network. We saw some of the symptoms of this in San Francisco, but also more broadly in parts of Europe. Germany, UK, Netherlands and that showed up in some really interesting ways because it was pretty big.

Archana Kesavan:
Yeah. Right, exactly. Let’s go under the hood with this Cogent outage right now. And there you go. This comes up. This is when we see that it’s pretty late last night when we saw this outage.

Angelique Medina:
Yeah. Around 8:20 PM Pacific time, 11:20 PM Eastern time. And then this would have been around 4:20 AM in the UK.

Archana Kesavan:
UK, right.

Angelique Medina:
Kind of has all the hallmarks (and especially given how many interfaces and how distributed this was in their network) probably they made some kind of control plane change and then the availability. It could have been a maintenance window, too.

Archana Kesavan:
Right. And this is interesting, Angelique, especially with your research report coming out tomorrow. There’s a lot that you talk about in terms of how to characterize the impact of an outage by understanding when it’s happening. And some of the things that we’ve been seeing over the last few months post COVID, a lot of these outages fall in these like an interesting maintenance windows.

Angelique Medina:
Yeah, that’s right. We talk a little bit about the different characteristics of an outage and how those can point to what the underlying cause of the outage is. And that seemed to be something that we noticed over the last few months was that there seemed to be many more outages that were related to providers making changes to their network, which makes a lot of sense, given all of the accommodations they would have had to make for different types of traffic, where users were sitting, whether that be consumer network, for example. Lots of really interesting stuff to take away from that report. And that will be available tomorrow to download.

Archana Kesavan:
Definitely. I think the other outage that we wanted to talk about and this didn’t happen this week, I guess this was a couple of weeks ago when we saw Cloudflare’s DNS service that was impacted. And Cloudflare came out in classic style, they were really transparent in terms of what happened, turned out it was an internal BGP misconfiguration on their router, which impacted their DNS backbone. It took down their DNS servers as well and also impacted some of their edge services. We do a deep dive on what we saw in our episode couple of weeks ago, but we wanted to come back and revisit one of the questions that we had during that episode.

Angelique Medina:
Right, yeah. We had kind of posed this question just kind of in an open-ended way because we noticed that there was a route change right around the time that the issue was being resolved, which seemed kind of odd to us. We thought in the midst of dealing with this major outage, why would there suddenly be an announcement change where they would suddenly start announcing through Cogent versus directly through this other provider?

Archana Kesavan:
Right. And the interesting thing was, it was kind of at the tail-end. As the outage was recovering, we saw this BGP route change and just to clarify for if you’re just tuning into this episode and this view is something that you’re not familiar with, this is an external BGP route advertisement. And the Cloudflare issue at that particular time two weeks ago was an internal BGP issue.

Angelique Medina:
Right, right. We were sort of wondering about this and then after kind of discussing it a little bit more, we thought actually, this could very well be that this particular provider, AISG, made the actual change. And this is not uncommon, whether you’re an enterprise or you’re a provider, where if you’re one of your peers that you’re connected to, you find that traffic going to a particular service, it could be Cloudflare, It could be, for example, a cloud provider like AWS or Azure, if you’re not able to route traffic that way, or the traffic that you’re sending that way is getting dropped or there’s some major systemic issue, you might then change how you’re routing traffic so that you can effectively go around the issue. It’s likely, especially given that this was across a variety of Cloudflare DNS services, not just 1.1.1.1, but also traffic that was going through to their, not just the resolver, but the hosting service, as well as the root servers that that all got changed, as well.

Archana Kesavan:
Right. Right. And given how we saw this manifest, we saw a severe packet loss right before entering Cloudflare’s network. It feels probable that this provider AISG thought there might be a better way to reach Cloudflare. And unfortunately, in this particular case, there wasn’t.

Angelique Medina:
Right. Right. Well, it’s interesting because it’s sometimes hard to tell because Cloudflare’s DNS service is an anycast service, sometimes if you’re just sending it through to another provider because of just how anycasting works. It could be that you would be able to get around the issue and get routed to a different PoP, which was interesting because we did see in some service providers around that time where I think it was Dallas or something like that, where using a particular service provider, they were routing folks to Houston. And because of that, those users in Dallas were able to get the Cloudflare service but others weren’t.

Archana Kesavan:
Also, this issue impacted 50% of their backbone, so not all their PoPs were impacted. They have a list of PoPs that were impacted. You’re right, because it’s anycast, if it hit another server, you might as well try your luck there. Makes sense in terms of withdrawing that route. And I think eight hours later, once Cloudflare’s RCA came out, this AISG provider basically went back to pairing directly with server.

Angelique Medina:
Exactly. Yeah.

Archana Kesavan:
Anyways. All right. I think that’s all we have for today’s show.

Angelique Medina:
Yeah. It was a short and sweet episode. We’ll be back in full force next week. And I think we have some great, we have at least one great speaker lined up for next week. We’re excited about that. And don’t forget to download the internet report, which is going to be released tomorrow and subscribe.

Archana Kesavan:
Yep. Yeah, don’t forget to subscribe, especially if you want that t-shirt, InternetReport@thousandeyes.com. All right. Well then we’ll see you guys next week.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog