The Internet Report, Episode 2 – Week of March 30–April 3, 2020

Disclaimer: Due to some technical difficulties, we lovingly name this episode “the one without the video.” Enjoy our first audio edition, flip through the slides, and of course read through the transcript below. We’ll be back on video next week.

It was yet another eventful week on the Internet, folks. In this week’s episode of The Internet Report, Archana and I discuss the latest figures around global Internet performance, noting that, despite an elevation in outages last month, the Internet is holding up well. ISP outages declined slightly in the U.S. and globally last week, but that wasn’t the case for UCaaS providers, who had a particularly rough time last week, especially in the United States. There was also a fairly large BGP route hijack on April 1 courtesy of Russian ISP, Rostelecom — the same ISP responsible for a route hijacking incident back in 2017. The prefixes involved belonged to Amazon, Cloudflare, and other services, and impacted the reachability of sites like Yelp.com.

Listen along in the podcast embedded above and feel free to read along with the transcript below. Don’t forget to subscribe to our blog and our YouTube Channel to be the first to get these episodes moving forward. And don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

ThousandEyes T-shirt Offer

Show Links:

Follow Along with the Transcript

Angelique Medina:
Hello everyone and welcome to the Internet Report. This is a weekly show that we do where we break down all of the interesting events and trends that have been occurring on the Internet over the past week. The Internet is really important to everybody right now, and we just have to say that the Internet is doing fine, and we’re going to talk about that. And I’m joined by Archana Kesavan, I’m Angelique Medina. And so we’re going to start off by talking about some of the outage events that have occurred over the last week. So this is a pretty extreme event, and then we’re also going to talk about some interesting things that happened last week that were not outage related.

Global Network Outage Events
Figure 1: Global Network Outage Events

Angelique Medina:
So, just to go through some of the interesting numbers for last week, and we’re showing this in the context of going several weeks back all the way to the end of February. So, what we have seen is that overall through March outages were increasing, but then we saw them go down very slightly last week, which is good. So that’s a good sign. They finally dropped under 300, they had been holding steady over that for a couple of weeks. So that’s a good sign, but where are we seeing the numbers for this?

Archana Kesavan:
Hey Angelique. When we say outages here, that includes ISP providers, cloud providers, and some UCaaS providers that we are tracking as well. So it’s kind of a comprehensive view.

Angelique Medina:
That’s right. These are global numbers, and just to give you some ideas as Archana said, this includes ISPs, this includes public cloud providers, it includes UCaaS providers—so you can think of these as your video conferencing tools that you’re probably using a lot these days and we’re using now. And they also include edge services. These are services where most Internet users are probably not familiar with them, but they’re really foundational to how the Internet works. These are things like DNS, CDNs, other edge services, like security as a service, secure cloud gateways, DDoS mitigation solutions. So a lot of really important services, and we’re tracking outages that are occurring in the networks of those providers. And these are cumulatively what the numbers look like.

ISP Outages Global
Figure 2: ISP Outages — Global

Angelique Medina:
Now, if we just look at ISPs, we also see this same downward trend, still in the 200’s, but it has gone down slightly from the previous week. And what we had seen just over the last part of March, there was a pretty significant spike in outages over previous periods.

Angelique Medina:
Now, one of the things that we had been talking about was, what does this mean? Is there an increase in outages because there’s more congestion in networks? Could that be the reason? Well, actually it doesn’t seem that that’s the case because we’re not seeing a whole lot of congestion as folks have expected. So why are we seeing these outages?

Angelique Medina:
So one of the interesting things about these outages is that, overall, we’re seeing an increase in duration and scope of outages. So, what that means is that for a particular provider, a particular ISP, what we’re seeing is that when they’re having an outage event, it’s impacting more parts of their network across different regions. And even in some cases, different continents. And these outages are long. So we’re talking about 30 minutes, which is a very long time for an outage to be occurring. So that suggests that it’s not so much congestion or traffic stress, and it’s probably due to some state change in the network. So whether that’s an infrastructure change or a configuration change or a peering change, either intentionally or not—we don’t know—it indicates that there’s more traffic engineering going on, right?

Archana Kesavan:
So they’re basically making sure that their network can handle the surge that they’re seeing right now. And that’s one of the reasons why the Internet is not really toppling, right? We’re still able to get our work done. We’ll see some glitches here and there, but to the point you were making, these providers are trying to improve their network—be that through some sort of traffic engineering, some sort of maybe even upgrades to their network, opening up more circuits or paths—so they are ready to handle that surge of traffic that we’ve been recently seeing because of this.

Angelique Medina:
Right, absolutely. Yeah. So that definitely speaks to behind the scenes there’s a lot of activity to either accommodate or compensate for increased traffic levels. And even simple things. They’re probably getting a lot of inbound requests from their peers and their customers to increase bandwidth, for example. And so because of that, every time you have to make those changes to your network, you also risk something potentially going wrong and so, some of these outages just could be evidence that there are a lot more changes being made by providers.

Archana Kesavan:
How it’s manifesting as an outage, it’s not necessarily some things breaking, but in the means of making things work, you’re running into these unfortunate situations, is what we’re interpreting this as.

Angelique Medina:
Yep. And then, it’s going to be interesting to see how this trend goes because we saw a decrease last week and it may be that, certainly the network operators had to mobilize to start to accommodate this new traffic watermark, if you will. But once they make all their changes and things are normalized, then we would expect that we would start to see it drop-off to a more normal state in terms of the number of outages that we see. And looking at the numbers going down last week, that seems to support that. So it’s going to be interesting to see where this number goes, if it continues to stay high or it goes up or goes down, because if it goes up then that indicates there might be something else going on.

Archana Kesavan:
So we should also talk about some things that we’ve seen in the news in terms of providers working with each other, or just establishing better relationships, so that they can together handle all the traffic. So any changes that based on that, that comes off and it maybe possibly could create an increase or a little bit of spike in it. We’re hoping that doesn’t happen, but it’s definitely possible. So very interesting to see over the next two or three weeks if we see a general decline or if it just goes up and down.

Angelique Medina:
Yeah, absolutely. One other thing to note too, just on this whole subject of how the Internet is holding up, because yes, we see these outage events, which as we mentioned are not related to the congestion. There are isolated pockets of congestion when there’s just a lot of traffic that’s going to say a heavily used application or a site like an unemployment site or something like that. But, overall we’re not seeing a massive increase in latency or jitter or anything like that. But what’s interesting too is because folks keep mentioning Netflix, for example: “Oh, Netflix. Is Netflix going to break the Internet?” Because it’s assumed that we’re just sitting around all day watching Netflix. Actually, Netflix traffic doesn’t really use the Internet backbone. It doesn’t use transit providers. I mean, at least not very much, right? Because when you’re streaming a Netflix video, you’re actually connecting to a server that’s very close by. So probably less than a mile away from you. And all of the videos or films are cached at that location. Now, if you have very eccentric tastes, maybe that server might have to call home and get something to show that to you. But that’s not a common occurrence. So there’s not a lot of traffic flows normally across the Internet when you’re streaming Netflix.

Archana Kesavan:
To the point of the network latency that you mentioned earlier, it’s more of applications getting overloaded and unable to handle it, rather than the actual transport that carries the optic across. Right? The ISPs that you’re talking about here constitutes that backbone of how traffic’s moving across “the Internet.” It doesn’t mean that you’re not seeing an increase in traffic. It doesn’t mean that some applications are caving if they are hit. But that’s specifically an application type of overload situation, rather than the Internet or an ISP overload situation.

Angelique Medina:
That’s right. That’s right. Yep. Yeah. I think it’s partly, this is because people are now paying a lot of attention to the Internet. But in normal conditions, the Internet, there are outages, there are issues. And we’re going to talk about some of these issues, too.

Archana Kesavan:
It’s amazing, right? How the infrastructure was created so many years ago, and it’s actually holding up pretty fine.

Angelique Medina:
Absolutely, yeah.

Archana Kesavan:
It’s not caving under all that pressure. So it’s actually pretty fantastic to see that.

Global vs. U.S. ISP Outages
Figure 3: Global vs. U.S. ISP Outages

Angelique Medina:
Yeah. Yeah, totally. Looking at the U.S. slice of the pie. Here, we’re just also again seeing in the United States, a drop-off… not drop-off, but a lower reign of the number of outages. That’s good news.

Public Cloud Network Outages Global
Figure 4: Public Cloud Network Outages — Global

Angelique Medina:
Looking at the Cloud provider networks. Clearly, the Cloud providers are really important right now. I think a lot of folks are now realizing… Okay, the agility and just the size of networks and infrastructure that the Cloud providers can offer is very valuable in a situation like this because it allows you to very quickly stand up services and that sort of thing. So, very important service right now. How are they doing?

Angelique Medina:
Well, actually, the public Cloud providers have been doing pretty well, overall. I mean, there are some peaks and valleys here you see. But these are actually as high as they look, they are fairly normal numbers from what we’ve seen before. And, if we go back, say eight months or so, we’ve seen periods where there are massive spikes and outages in Cloud provider networks, and we’re not seeing anything near that. So this is business as usual, I think.

Archana Kesavan:
And we discussed this last week. It’s not really surprising to us that these providers are not necessarily seeing a lot of outages on their backbone. I mean, they’ve invested so much in their backbone. A lot of providers are going down the path of monetizing their own backbones through some services that they offer. So it’s not surprising at all that they have the capability and the capacity to deal with this, that’s going on right now.

Angelique Medina:
Yeah.

Archana Kesavan:
I mean, it’s not just expanding services that are already hosted in the Cloud. For instance, if you’re a SaaS provider hosted in a Cloud, and you have to scale for the demand. That’s one piece of it. But there are even completely new services that are being spun out of Cloud providers. Like VPN gateways and any type of security instances. So it’s not just being able to expand your existing services, but also add to the services that you are using these providers for. They are pretty much seeing a lot of influx right now.

Angelique Medina:
Absolutely. And the good news about the public Cloud providers is that they are used to actively engineering their network. They make a lot of enterprises, making changes to their network. That’s a planned thing. That’s a scary thing. The public Cloud providers are continuously, on a daily basis, making changes to move traffic around to accommodate the massive numbers and the massive traffic volumes that they see on a regular basis. So they’re really good at this. Especially if they’re paying attention, they know a lot of folks are going to be using their network, they’re on top of stuff.

Global vs. U.S. Public Cloud Network Outages
Figure 5: Global vs. U.S. Public Cloud Network Outages

Angelique Medina:
I think we’re in good shape as far as the public Cloud providers. Just looking at the U.S. again, they’re very, very low. Most of the numbers that we’re seeing in terms of issues are in the Asia-Pacific region. In the United States, they always look pretty low. We can see that the week before last, no outages. And then just one last week.

Collaboration App Network Outages Global
Figure 6: Collaboration App Network Outages — Global

Archana Kesavan:
I just want to get… How are the UCaaS providers doing?

Angelique Medina:
Ah. Yeah, yeah.

Archana Kesavan:
That’s the more interesting piece of this.

Global vs. U.S. Collaboration App Network Outages
Figure 7: Global vs. U.S. Collaboration App Network Outages

Angelique Medina:
Right, right. This is interesting because unlike the public Cloud providers, they are not used to seeing the volumes that they’re seeing now. They probably have been hit the hardest just from a volume standpoint, and from a change from a previous period. Right? Because now, it’s not just… These are not business collaboration applications anymore. These are like, everyone collaboration. Or just even a…

Archana Kesavan:
… Lifeline applications!

Angelique Medina:
Exactly, there you go. Right? They’re really, really important. We almost never see outage events within these type of providers. And we’ve seen a steady increase, particularly see there a spike in the middle of March. And then they declined a little bit, but still very high for typical… for what normal looks like. And then again last week, just a huge, huge number.

Angelique Medina:
Unfortunately, most of them are actually in the U.S., strangely enough. There definitely is, there are some growing pains here I think. What is interesting that we also talked about, because I think a lot of folks again have brought up this notion that maybe they’re being overrun from a network capacity standpoint, and that’s leading to packet loss. Could that be the reason for the outage? In looking a little bit deeper at this, that doesn’t appear to be the case.

Angelique Medina:
It does seem like that in trying to accommodate or just deal with this traffic increase, that they’re making some changes maybe to their infrastructure configurations that are causing these outage events, versus a network capacity thing. In some of the instances here with some of these same providers that are experiencing outages, the impact is hitting their application particularly hard. So it’s not so much that their network is getting overrun, it’s that their application is just not able to deal with, or is dealing with but maybe not as best as it could with the increase.

Archana Kesavan:
So when we say the application’s not able to hold it, if I can’t connect to a voice call, for instance, or it just takes longer for the application to dial my phone, that’s something I’ve noticed recently on a few of these platforms. It takes a little longer for the call to actually come through. Could those kinds of be manifestations of them trying to get their network better scale and then running into some kind of constraints while doing it?

Angelique Medina:
Possibly, I mean it could be that they’re also trying to make changes on the back end to the application to scale out and so they have to then make changes to their network to maybe send traffic in different directions and that could then be why we’re seeing one, outage events like this and two, just a slower response, slower pace of interacting with an application.

Archana Kesavan:
Also, a lot of times when we get a phone call on our phone, it doesn’t go necessarily through the Internet backbone, that goes through like a PSTN backbone, which is a completely different infrastructure than where these cloud providers are running. So, that could also mean maybe your PSTN network or that’s basically the telephone network is congested too. We’ve heard that a lot of people trying to get calls through to hospitals or a lot of other places during this time is overwhelming that infrastructure too and that’s completely different than what we are looking at right now.

Angelique Medina:
Yep. For sure. So that’s just a quick peek at some of the interesting numbers that we’ve seen over this last week. But so what is interesting also is that when we brought this up earlier, how is the Internet holding up? I mean, overall, it’s actually holding up incredibly well. That speaks to the resiliency in terms of how it was built. That’s very interesting because you would not expect that given this dramatic spike in usage, that things would be as performant as they are. What are your thoughts on that?

Archana Kesavan:
No, totally right. I think the Internet as a core infrastructure has been there for a while. I mean, this current situation has kind of put the spotlight on it, and it’s doing well. But what we should not forget about the Internet is that, by its very nature, it’s vulnerable, it’s fragile. So yes, these outages might seem a little enhanced because of the current situation, but where we’ve been looking at these outages over so many years, we’ve seen patterns to rate, we’ve seen how weak the fabric of the Internet can be from say a security or a trust perspective and that can cause some outages.

Angelique Medina:
Absolutely, absolutely. I mean, one of the things that we talk about a lot is that there’s no steady state in the cloud. There’s no steady state in the Internet and the Internet is a best-effort network. It’s effectively a collection of independently operated networks, whether you’re a provider or you’re an enterprise and collectively that builds the fabric of the Internet. The fact that it runs so well is pretty incredible. There’s some… it’s resilient, but it’s also at the same time, strangely enough, quite fragile and we’ve seen that it’s unpredictable. Things can happen as you mentioned from a security standpoint or performance standpoint. That’s normal.

Archana Kesavan:
That’s normal.

Angelique Medina:
There’s unpredictability, right? And evidence of this, there was a very interesting event that occurred last week that folks were talking about. But again, we have to remind ourselves these things happen, not all the time, but they happen, right?

Archana Kesavan:
They happen. I think we’re alluding to Rostelecom …

Angelique Medina:
So there was a route, I guess you can call it leak or maybe even hijacking. I think they were advertising themselves as Cloudflare …

Archana Kesavan:
… and AWS for that matter, so essentially whatever called Rostelecom, they inserted themselves into the traffic path by basically claiming that they were either Cloudflare or AWS for a couple of really big services that we saw were impacted. And the way they did that was essentially by advertising in BGP terms a more specific route.

Angelique Medina:
Just to add to that, so they were advertising themselves and what’s interesting though is because again, the Internet is effectively built on a chain of trust. So they were advertising themselves out. One of their peers is a Level 3, a transit provider. When you get an advertised route from one of your peers, you can filter it out. You don’t necessarily have to accept it as a legitimate route. What happened in this case was that one of their peers obviously, they have many peers, didn’t filter it out in this case Level 3, and they propagated it onto their own peers. And so it just basically created this chain of events where this was able to take place.

Archana Kesavan:
How BGP works is, okay, I advertise some things, somebody takes it and then they propagate it.

Angelique Medina:
Like a game of telephone.

Archana Kesavan:
A game of telephone, exactly. Now it’s surprising that it’s built on trust, you just assume there’s not going to be a bad actor in place claiming to be somebody else. The reason this has become more of a hijack is because in theoretical terms they hijacked a route that was supposed to go to a destination but changed destinations.

Angelique Medina:
We’re using that term, but it’s important to point out that this doesn’t appear to be malicious. Right?

Archana Kesavan:
Exactly, that’s what I was saying.

Angelique Medina:
This appears to be due to some error on their part and we can talk a little bit about some of the scenarios for that too. So, why don’t we walk through what we’re seeing here?

Archana Kesavan:
Right. Right. So what we’re seeing here is actually at the time when this hijack or leak happened, is that this Russian ISP, which is Rostelecom, basically inserted itself into the path for traffic going to Cloudflare, which was through this particular network prefix that we’re seeing here. And, to the point earlier, how this propagate it to the broader Internet is their upstream provider, in this particular case Level 3 did not have any filtering mechanisms in place. Now, we can get into that, and that’s a completely separate area in terms of how can you actually prevent this from happening. But, let’s just say that in this particular instance, there was really no way to prevent this. As in, Level 3 basically took the route that was coming in from Rostelecom and said, “Hey, you want to reach Cloudflare, I’m getting instructions to send traffic to Rostelecom,” and that’s what you see here.

Archana Kesavan:
This is actually where the dotted line that’s showing that a path existed and a path doesn’t exist right now, and that’s where that particular snapshot, but in reality like on a day-to-day basis when everything looks good, this is how the network path should have looked like. Level 3 directly talks to Cloudflare to that particular IP address.

Archana Kesavan:
Now, what ends up happening here is the situation that we saw where Cloudflare goes out of the picture, and then you have Rostelecom insert itself. And really, the way they do that is, when it comes to routing, the way it works is it picks a more specific or granular route that exists.

Archana Kesavan:
So on a good day, when everything’s working as expected, Slash 20 which is a broader prefix is announced by Cloudflare and traffic’s flowing as intended. But when things go wrong, where that’s what happened here, more specific prefix was introduced to the Internet and it was introduced by Rostelecom, right?

Angelique Medina:
Yeah. And you can think about this, for those unfamiliar with BGP routing as like when you’re getting a couple of different traffic routing options through ways, right? You have your application and then suddenly they’re saying, “Hey, we’ve detected a quicker route to wherever you’re going.” And then, you can say, “Oh, okay, well, I want to get there faster or sooner so I’m going to pick that path.” It’s a little bit like that in a sense.

Archana Kesavan:
Exactly. Exactly. Now what stood out for me is, and then, to the point where we’re making before is that this was not the first time we saw something like this happen. Actually, three years ago, we saw Rostelecom come into the picture in a very similar way. Again, not saying it was malicious because of the timeframe and when this lasted. The event last week was about 10 minutes, and the event three years ago was just seven minutes. Right?

Angelique Medina:
So they did it and then they corrected course, right? Because this was a mistake.

Archana Kesavan:
It’s a mistake, but this has happened in the past, right, which really goes to talk about this whole fabric of the Internet, which by default does not have security wrapped around it. It didn’t seem necessary when it was built decades ago, but right now in the environment we are in, I think it’s absolutely critical because there are cases. I think this was a couple of years ago, we actually saw this whole concept of hijack being used in a malicious way to actually slip out, it was $150,000 in cryptocurrency that was taken out by hijacking AWS’s DNS Route 53 servers, right?

Angelique Medina:
That was a very interesting event because it was a combination of BGP hijacking, and then also DNS.

Archana Kesavan:
It was really smartly-executed.

Angelique Medina:
It was very well done.

Archana Kesavan:
It turned out really well and …

Angelique Medina:
Quite sophisticated. It was pretty, pretty stunning. Yeah.

Archana Kesavan:
It definitely was. So it’s just very important to remember that the Internet, as a pool, is sensitive to these types of things that happen in it, and yes, while the current situation is putting pressure on it, you see that it’s actually doing really well, better than what we all initially expected going into this. But when you see these occurrences, just remember that it’s not necessarily a byproduct of traffic or surge, but it’s just the very nature of this infrastructure we rely on it.

Angelique Medina:
Absolutely. So that’s really important to keep in mind. Again, part of a lot of what’s happening, in terms of his chatter around the Internet has to do with just, we’re just paying attention.

Archana Kesavan:
Yup.

Angelique Medina:
And these things happen. That’s why you can’t assume that traffic’s going to be routed the same way, that we’re not going to have disruptive events. The Internet’s pretty massive. Lots of folks are involved, in terms of keeping it running, because again, this is about a collection of different networks coming together.

Angelique Medina:
So things will go wrong and we’ll continue to keep an eye on it and bring you all of this fresh information about what’s happened week-over-week. And we hope that this has been valuable for you to get a sense of what’s happened and where we are from a performance standpoint.

Angelique Medina:
So why don’t you …

Archana Kesavan:
Yeah, so one of the things, when we see these events that happen is actually do a much deeper dive into it.

Angelique Medina:
Absolutely.

Archana Kesavan:
We break it down in great detail and we blog about it. So these things are of interest to you, which it should be because we’re all relying on the Internet.

Angelique Medina:
That’s right.

Archana Kesavan:
So it’s important to know what’s going on. Subscribe to our blog: blog.thousandeyes.com and then we take that, we just break it down as we see it.

Angelique Medina:
And follow us on Twitter.

Archana Kesavan:
And follow us on Twitter as well. And then like Angelique was saying, we’re keeping our eyes open and every time we see something interesting that pops up, we’re making a note of it, educating the community, and then we’ll come back with it next week again to give you our insights.

Archana Kesavan:
All right, with that we’ll wrap today’s show. Thanks for listening guys.

Angelique Medina:
Thanks, folks.

ThousandEyes T-shirt Offer
Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog