Watch on YouTube – The Internet Report – Ep.4: April 13-19, 2020

In this week’s episode, Archana and I discuss several significant incidents that unfolded over the past week. First, we saw a notable increase in the number of ISP outages occurring across the global Internet—a trend that had seemed to be reversing itself in the previous week.

After going through our usual health check of provider networks, we examined an interesting online banking outage that occurred on April 15th, as millions of Americans attempted to check on the status of their stimulus checks. This caused understandable angst, as many customers were unable to log in or even access banking site landing pages. Finally, we took a look at how popular streaming services like Netflix deliver content, and we weigh in on whether or not Netflix is “breaking the Internet” as some folks have speculated.

Catch up on past episodes of The Internet Report here.

Give this week’s episode a watch or a listen in the embeds provided, grab our slides on Slideshare, and as always, feel free to read along with the transcript below. We’re excited to share that we’re now officially available on iTunes (Apple podcast), Spotify, and Stitcher, so be sure to subscribe and leave us a review on your platform of choice. Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

ThousandEyes T-shirt Offer

Show Links:

Listen on Transistor – The Internet Report – Ep.4: April 13-19, 2020

Follow Along with the Transcript

Angelique Medina:
Welcome to The Internet Report. This is a weekly show where we break down all of the interesting incidents that occurred in the previous week. My name is Angelique Medina, and I’m joined by Archana Kesavan, my co-host. We’re going to start first with just giving you a quick rundown of what happened last week and also go through some statistics, which we do every week. Some things that made the news last week where there was a Cloudflare issue with their dashboard. There were some issues with users unable to access their banking information to check on stimulus checks. Then, we also had some pretty significant outages. We’ll run through all of that. But first, we’re going to take a look at some of the numbers that we saw last week. If you recall, last week we noted that outages were down overall and speculated that that was potentially due to some normalizing or stabilizing within the service provider networks or the ISP networks. This week, what we can see is that outages were up from the previous week. They went up last week.

Global US Network Outages
Figure 1: Global vs. U.S. All Network Outages

Archana Kesavan:
So Angelique, to reiterate this, this particular graph is looking at network outages across ISPs, cloud providers, collaboration apps, and then edge infrastructure, as well.

Angelique Medina:
That’s right. Yeah. Something to note and that we mentioned each week is just what we’re actually covering here. It’s more than just ISPs. It’s a number of different provider types, and if we’d look at this from a daily standpoint, we can see that there was some dip in outages that corresponded with that lower number we saw a couple of weeks back. Then, starting on Monday, April 13, you see this jumps right back up to 49 outages and that was something very interesting. We’ll take a look at Monday specifically, even though it wasn’t the peak period during the week in terms of the number of outages. We saw more on the 15th and 16th, for example. Monday was actually much more interesting because it was very widespread and the outage incidents that were happening were very long-lasting—as opposed to the Monday, or sorry, the Tuesday, Wednesday in which they were fairly brief outage events that didn’t seem to have a significant impact.

Daily Outages
Figure 2: Daily Outages, Past 3 Weeks

Archana Kesavan:
And also, right there in the weekly trend, the midweek where you see those numbers spike up just a little bit, then the average is on Monday, 51 to 53—they were independent of the banking issue that Angelique was talking about in the beginning where people couldn’t access their financial institutions to check about the stimulus check. This is because the outages that are being reported here are specifically from a network perspective and as you’re going to see later in the show, that’s independent of what was happening with the stimulus check issue.

Angelique Medina:
That’s right. And that for those of you who aren’t in the U.S., that’s something that was expected to be delivered on the 15th, which is historically it’s our U.S. tax day even though that’s been pushed back a little bit this year. But April 15th was really that hotspot day.

Angelique Medina:
Now we can see that, overall, no surprise that the ISPs make up the bulk of outages. There are just so many providers in that category as opposed to cloud service providers or UCaaS providers. It was up for sure over the week of April 6. Now, does this mean that we’re back to where we were at the end of March? It’s not clear. I think what the Internet is showing us is that it’s unpredictable. Maybe this is an indication there’s still some adjustments or changes being made or it could just be that we’ve seen some anomalous activity that’s impacting the number overall.

Global US ISP Outages
Figure 3: Global vs. U.S. ISP Outages

Angelique Medina:
In terms of the cloud service providers, they look pretty steady and nothing really interesting to report there. They overall don’t experience a lot of outages in their networks and that was the same for last week, as well.

Global US Public Cloud Outages
Figure 4: Global vs. U.S. Public Cloud Network Outages

Angelique Medina:
For the collaboration of providers, they came down from their peak at the end of March and it hasn’t really gone up since then. So that’s good.

Global US Collaboration App Outages
Figure 5: Global vs. U.S. Collaboration App Network Outages

Angelique Medina:
Now, this is something, I think, in terms of the outage events from a network standpoint, there was something that was really interesting that occurred on Monday that we’re going to look at a little bit more closely. We see here …

Archana Kesavan:
This is the 49 spike that we are going to talk about that we noticed specifically on Monday.

Angelique Medina:
That’s right. On Monday the 13th, we saw that there were a number of ISPs that had an issue. We can see here we have Level 3 Communications, we have NTT, we have Cogent and Shaw, AT&T, and Zayo and GTT, and actually more providers than this that were simultaneously experiencing outage events in their network.

Angelique Medina:
Now, if we look here, we can see that Telia is really the one that seems to have been most impacted by this event. If we look at some of the other markers of this outage, we can see that the duration of the outage is pretty significant, lasting over 30 minutes. It impacted quite a number of locations throughout the United States, including Chicago and Dallas.

Angelique Medina:
We also do see though that other providers had similarly long outages. If you look at Cogent, 38 minutes, 24 minutes for Level 3. This is not just notable from the standpoint of the scope, or the fact that this was occurring simultaneously. Just the sheer time that it took to resolve this issue was pretty significant.

Archana Kesavan:
The other thing is, at the same time that we saw is one of these providers actually withdrew their routes to the most heavily-impacted ISP in here. That lasted, I mean, that withdrawal of route lasted for about nine hours.

Angelique Medina:
Took about nine hours, yeah. Around these times, there didn’t seem to be any, in terms of what the root cause was. It’s interesting because this was such a strange outage event that happened at the same time. We were looking at BGP routing to see if there was potentially some, anything noteworthy there. The only thing that we could see was one of the application providers withdrew their route from Telia. Then, as you mentioned, it didn’t restore it until about nine hours later. It could be that they were potentially had made a change in their network or maybe one of the other ISPs made a change in their network that then impacted their peers so there was a collateral impact as a result of that.

Archana Kesavan:
But basically, they were just trying to avoid Telia in that particular case.

Angelique Medina:
That’s right. Now, why that is, whether that was coincidental or not, we just don’t know. But certainly, the fact that they were the most impacted by this outage on Monday is something to keep in mind.

Archana Kesavan:
Right. And just looking at when we say most impacted that includes not just the number of interfaces that were affected but also their expansion regionally, how widespread that particular outage was within Telia’s network.

Angelique Medina:
That’s right. So if you were to have, say an infrastructure issue, maybe if you had a router that up and died or something, that might affect one location. But the fact that this is simultaneously impacting five cities, East Coast, Texas …

Archana Kesavan:
Chicago …

Angelique Medina:
Yup, it’s interesting. It indicates again that this was likely due to some change that was made by the operator within their environment whether that was an intentional or unintentional change that then potentially also caused some of the issues in the neighboring ISPs that they’re connected to—their peers.

Archana Kesavan:
Possibly cascaded down to the ISP that they were connected with.

Angelique Medina:
That’s right. Yeah, so this was sort of the most notable event last week from a network standpoint. Now, the other thing that happened was this incident with the banks, and them not able to have their users log in and access their stimulus check information. Right?

Archana Kesavan:
Yeah, right. Essentially what you’re seeing here is across the U.S. you know, users were unable to get into or access their particular financial institution, in this particular case. Those red nodes are basically an indication of there is an issue. The interesting thing, however, is when, when you see these issues that come up in terms of accessing an application, they can pretty much be anywhere in the stack rate. You see that on the left-hand side here in terms of where they should split down by phases. What you see here is there is essentially a problem in receiving or logging into (this is the log-in page of a particular financial institution that they’re looking at) the page. However, the interesting thing is there was not a problem in the actual network path to get to the edge location of these financial institutions.

Financial Institution Stimulus Outage
Figure 6: Stimulus Check Payout Overwhelms Financial Institutions

Angelique Medina:
That’s right. I mean, if you look at just even the status here by phase, you can see there wasn’t a DNS issue, there wasn’t a connect issue… Typically, when you see that there is no connect issue that means that there isn’t a corresponding network issue, which as you mentioned, we’ll see that in a moment. Connecting to the webserver was not the problem, but getting a response was, and we can see that here. That looks like the network path is clean.

Financial Institution CDN
Figure 7: Reachability to CDN Edge NOT the problem

Archana Kesavan:
Right. It’s not surprising that these URLs or these websites are front-ended by a CDN. That’s precisely why CDNs exist. You should them and there was no problem in actually getting to that edge as you see here in the actual network path.

Angelique Medina:
Yeah, so then the question is, was this an issue with the CDN provider maybe not being able, for whatever reason, to serve up the content and was this specific to a particular provider? What we saw was that wasn’t the case. This was fairly widespread, and it impacted a number of different banks that were using many different providers. In some cases, we have the same bank who was using several CDN providers, and all of them were experiencing the same issue simultaneously. This didn’t appear to be impacted or doesn’t appear to be caused by the CDN providers.

CDN Providers
Figure 8: Not isolated to a single CDN provider

Archana Kesavan:
I mean, at this particular point, it kind of overlapped with Cloudflare having an issue, which really impacted their API or the dashboard, not their CDN services. This is not to be confused with Cloudflare’s issue last week. They were completely independent here from what we saw.

Angelique Medina:
Which is, this is really interesting because we discussed this earlier. It’s not the CDN provider that’s the problem, but they’re not able to even serve up, in most instances that we saw, a log-in page. The reason for this, it appears, that they couldn’t even … it’s likely because they weren’t able to revalidate or re-fetch the index files for the banking sites, and because of that, they couldn’t serve up any additional components or web objects from that point on. Even if they were stored with the CDN provider because they couldn’t revalidate that.

Archana Kesavan:
They didn’t really know the intelligence that comes from that index file. It was unavailable. An index file is fetched from the origin, so they didn’t really have any idea on how to even load up the main page.

Angelique Medina:
Now, you could certainly store your index file with the CDN provider, which could potentially have at least prevented this issue with just the log-in page loading. But it appears that this was an issue with the banking origin being overloaded. Even if the log-in page had been able to be loaded by the CDN provider, it does seem like the origin itself was having issues and so may have been a problem either way.

Archana Kesavan:
In this particular case, you were stuck on the first page. If the index file had been local, you would have probably gotten stuck.

Angelique Medina:
You might have gotten further along.

Archana Kesavan:
You may got in.

Angelique Medina:
But you may still run into issues.

Archana Kesavan:
Exactly, exactly, yeah.

Angelique Medina:
That was a pretty significant issue that occurred on the 15th, and it was something that occurred throughout the day at different periods. There were some instances in which the site was available and loaded and then, maybe 30 minutes later, it wouldn’t be. This was something we saw across, not just mid-size banks, we saw even very large banks had this same issue. It was a little surprising to us because large banks typically have pretty, they have a lot of …

Archana Kesavan:
… Robust and redundant architecture …

Angelique Medina:
… You’d think that they would have the scale to manage the inbounds on this. But at least …

Archana Kesavan:
I mean, we were talking about this just before this call, in terms of how much was the load, right?

Angelique Medina:
Right, right. I mean, if you consider the population of the United States, and it was reported that more than 93% of the U.S. population qualified to get at least some part of the stimulus.

Archana Kesavan:
On April 15th.

Angelique Medina:
That’s right. So if you just look at the share, the volume of what we’re looking at here in terms of the population, maybe it isn’t all that surprising. But it is surprising that it wouldn’t necessarily have been anticipated by the banks. I mean, if they were expecting that this was going to happen … why they weren’t prepared, we just don’t know. But this has all happened pretty quickly so maybe it’s not surprising that there is just no way to accommodate this.

Archana Kesavan:
But it wasn’t like different sets of people were logging in at different points in time to kind of spread that load. Customers were probably just …

Angelique Medina:
Waking up in the morning and …

Archana Kesavan:
Right. Then, it kind of unfortunately impacted a lot of the financial institutions that day.

Angelique Medina:
Then, so beyond that, actually one of the things that we were also talking about with reference to the CDN providers, and one of the reasons to why the CDN providers themselves are so valuable in distributing traffic locally. This came up because we were talking again about Netflix and why this keeps coming up in terms of them being responsible for, maybe, slowing down network performance because people are just streaming more. I guess that’s the idea. But we were talking about this earlier and also again last week. So the way that Netflix distributes their content, they basically run their own CDN infrastructure, and they are in many instances collocated within ISPs and so when you’re streaming …

Archana Kesavan:
They actually have this program called the OpenConnect, where it’s actually a physical device (and an ISP needs to partner with Netflix to actually do that). It’s actually a physical device that basically caches all the content that is regional. Depending on analytics on what kind of shows that particular region is interested in. It caches all of it. Every time when we play on Netflix, it doesn’t necessarily go to the origin to fetch all of that data. That’s pretty much really close to within a few miles of where we are located. Sometimes, it can happen that it would pick the most available edge location and can switch midway once it finds the best fit. So sometimes when you’ve seen the case of you press Play and you see a little bit of blur, but then it immediately snaps back and you see that clarity, that’s when you know, it’s like a better location to feed the stream from.

Angelique Medina:
Right, yeah, I mean, somebody, if your neighbor, even if it’s something that may not be incredibly popular, even if your neighbor were to stream the video and it has to be fetched for the first time from the origin, then it’s cached. Then, that copy is available locally.

Angelique Medina:
Also, I mean, one thing that can be done as well by content providers is they can anticipate when content is going to be popular for a particular demographic or region and they can pre-cache…

Archana Kesavan:
Preview that as well.

Angelique Medina:
Even before it’s requested. By doing that, they are preventing the requests being made to their origin. That keeps the load down across the Internet backbone and transit providers and then makes your experience that much better because basically, you’ll be connecting to a very, very close by caching server.

Archana Kesavan:
The reason this comes up a lot of times is because Netflix did take some steps in terms of reducing the bandwidth across their video screens. They said they’re not going to compromise on the resolution. If you’ve subscribed for a 4K HD stream, you’re going to get it. Or the bandwidth is going to get reduced by 25%.

Angelique Medina:
They said something like lowering the bit rate or something like that.

Archana Kesavan:
Yeah, it’s lowering the bit rate, and my guess is the way they’re actually doing that is lowering the frames per second for that particular resolution. Our eye doesn’t necessarily detect that 25% reduction in the frames per second, so you’re probably not seeing the actual impact of it. The only place to the whole conversation about the CDNs and the edge being able to stream this data is that even if it does help reduce the bandwidth, it’s really for the last mile or very short distance. It’s not necessarily the entire backbone of the Internet that would see the impact of this particular change.

Angelique Medina:
From the standpoint of usage, I mean, a lot of the ISPs have even said that the peak periods of time in which Netflix usage is up, which is typically in the evenings, that hasn’t really changed. Yes, people are using the Internet more, but from the standpoint of streaming Netflix content or other video content, that’s still within the same window of time and the broadband providers have built out their networks to accommodate that volume of downstream network usage. This is coming locally down to them and so that really isn’t something that’s changed all that much with what happened.

Angelique Medina:
I think part of where this came up was the European commissioner, some official agency or something in Europe, was asking Netflix to make some changes in how they deliver content. What’s interesting about this is that this was not based on any measured or real problem.

Archana Kesavan:
They didn’t run into a problem.

Angelique Medina:
The commissioner was basically like, “We think you’re going to be a problem and so please do something.” Of course, Netflix is going to be like, “Well, yes, we’re going to do what we can, even though it wasn’t necessarily a legitimate concern.” So why they took it upon themselves to ask this, it’s not clear. They didn’t probably didn’t know how they work.

Archana Kesavan:
It’s also that, we’ve not seen this policy or this change implemented in the U.S., for instance. It’s been specific to EU, maybe I’m speculating here, is that their last-mile connection, for instance, might not be as up to date or as good as what you find in the U.S. There could be some parts of EU where …

Angelique Medina:
Possibly.

Archana Kesavan:
… that’s not tight connectively. It’s not upgraded.

Angelique Medina:
Right. The market in the European telco market seems to be more fragmented. There are many more smaller providers or just doing business locally within a particular country. Because of that, there’s a lot of variation in quality. It may just be that they were trying to make this, that they had brought this up just to preempt any issue that could happen.

Archana Kesavan:
It would be interesting, in a few more weeks, to see if there’s actually any data that shows this reduced or, or what it did. We haven’t seen anything so far and you don’t think it should have definitely impacted as much, but if there is anything out there that would be interesting to take a look at.

Angelique Medina:
Yeah. And from the standpoint of the U.S., I think it’s not something in most areas of the country that should be a cause for concern.

Archana Kesavan:
Yeah, exactly.

Angelique Medina:
Cool.

Archana Kesavan:
That brings us to the …

Archana Kesavan:
The end of the show. A lot of things that happened last week, great. The outage on Monday, April 13th, which was not just like it was extended from a time period perspective but also the intensity of that was pretty deep or heavy compared to the other outages we’ve probably seen. April 15th was the stimulus check that we just went through. But yeah, stay tuned and if you’d like to be aware of the outages that are happening, get it a little bit deeper dive into it, follow us on our blog. Feel free to leave us a review. Rate our show in here. Your email, that particular email address there, you’re actually going to get a free T-shirt as well. So don’t forget to do that. We have some pretty fun T-shirts. The one that you see there right now is our newest addition, which we released last week. Definitely feel free to email Eileen there to get your T-shirt and rate us.

Angelique Medina:
All right, that’s our show.

Archana Kesavan:
We’ll see you guys next week.

ThousandEyes T-shirt Offer
Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog