Watch on YouTube – The Internet Report – Ep.3: April 6-12, 2020

In this week’s episode, Archana and I welcomed David Belson (@dbelson) of the Internet Society. We got to discuss some rather good news — overall outage events are down more than 40% globally, and more than 44% in the U.S. after a several-week-long spike in events. We very well may be looking at our ‘new normal’.

After going through our usual health check of ISPs, public cloud provider and UCaaS provider networks, we talked a bit about some of the network-related issues going on with some of the state unemployment sites, digging into several network disruptions impacting New York’s site specifically. Finally, David shared some interesting events he’s been tracking both at the Internet Society and for his personal Internet Disruptions (@InternetDsrptns) project.

Give this week’s episode a watch or a listen in the embeds above, grab our slides on Slideshare, and as always, feel free to read along with the transcript below. We’re excited to share that we’re officially on iTunes (Apple podcast), Spotify, and Stitcher now too, so be sure to subscribe and leave us a review on your platform of choice. Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

ThousandEyes T-shirt Offer

Show Links:

Listen on Transistor – The Internet Report – Ep.3: April 6-12, 2020

Follow Along with the Transcript

Angelique Medina:
Welcome to the Internet Report. I’m Angelique Medina and as usual I’m joined by Archana Kesavan, and we also have a special guest today, David Belson. David Belson is the Senior Director of Internet Research and Analysis at the Internet Society, where he focuses on efforts around Internet measurement, Internet shutdowns and understanding market trends and how they impact the growth of the Internet around the globe. He has 25 plus years of experience in the Internet infrastructure space. He spent a lot of time at Akamai where he launched the State of the Internet Report Series, and he also publishes a blog called the Internet Disruption Report where he provides aggregated coverage of Internet disruptions taking place in countries around the world, including the causes of these disruptions.

Angelique Medina:
So we’re really excited to have him join us today. So as usual, we’re going to start by talking about some of the overall trends that we’re seeing in terms of network availability across ISPs and cloud providers and UCaaS providers. And then we’ll talk about some of the interesting events that have occurred over the last week, including some disruptions to some unemployment sites that we’ve seen occurring over the last few weeks. And then we’ll touch on some of the interesting outage events that David has seen in his own work and that he’s compiling. So with that, we’re going to start by looking at some of these trends.

Global Network Outage Events
Figure 1: Global Network Outages Down Significantly

Angelique Medina:
And here, we’ll just recap a little bit of what we saw the last few weeks. So overall good news, we saw towards the end of March, so around March 16th it started this slight increase and then over the last week of March into April, held pretty steady, but then last week we saw a pretty significant decline in the number of outages. So this is actually interesting because it’s in line with what we talked about last week Archana where we were speculating that the increase that we were seeing in outages didn’t appear to be related to congestion because they were longer than what we would typically see under congestion events, and they were simultaneously in many instances happening over more locations, so they were more widespread. And that seemed to indicate that they were related to maybe state changes that were being made. So either peering changes or configuration changes that were then leading to these incidents. So this is good news then, right?

Archana Kesavan:
Yeah, totally and also it looks like it was indeed those optimizations that were being done internally done within the provider networks, and the peak of traffic that’s been going up has probably stabilized too. At this point, everybody who was working from home is working from home, right? So the watermark has completely changed like you were talking about earlier Angelique and that’s one of the reasons. So you’ve seen the highest level of traffic, you’ve made all the changes that you need to from a provider perspective to accommodate for that. So that probably explains the drop off that we are seeing here.

Angelique Medina:
I think David you had said that you were tracking some outage, sorry, traffic statistics from the IXPs and you had seen also this increase in traffic levels and then a leveling, if you will.

David Belson:
Right. Yeah. So looking at IXP graphs across a number of continents and some of the graphs that the big CDN providers published recently. And yeah, it’s very much that model where there’s the peak that comes right around the time that people are asked to stay at home, the isolation events and then generally tails off a little bit after that and has hit this new steady state that the provider is obviously about to handle, but it’s a little bit higher than it was in general, in February let’s say, a month before.

Angelique Medina:
Right. So this seems to support the notion that there were a few growing pains in getting to that point of supporting these increased traffic levels, but now it looks like we’re getting to a point where that’s the new normal as you put it, and it seems like the providers are generally keeping up with the level of traffic. So that’s good to see.

David Belson:
Right. And I think the application providers have been doing a lot of work scaling out in many cases accelerating their year long scaling plans into a week or two. I know that a lot of the IXPs have gone to their participants with offers that basically say like, “Hey, whatever you need, it’s low cost, it’s no cost, basically peer with whoever you can at whatever rates you can.” So there’s been a lot of work done there to help I think either minimize or limit those bottlenecks.

Angelique Medina:
Yeah, that’s a good point because this was something we touched on very briefly last week, which is the level of collaboration that’s been occurring across providers has been really great to see. I mean, there really has been a pulling together where everybody’s like, we’re going to do this, we’re going to get through it. So that’s been really interesting.

David Belson:
Absolutely.

Archana Kesavan:
What I’ve also heard is between just application providers and ISPs is a lot of cost sharing models that are in play as well as they’re enhancing their bandwidth. So definitely everybody is coming together to make sure there’s enough capacity in there.

Global US Network Outage Events
Figure 2: U.S. Network Outages Down, Too

Angelique Medina:
Yeah, so we see the same drop in outages in the US as well, that’s in line with what we see globally, so that’s good. And then in terms of the ISPs, again, same downward trend that we see dropped pretty substantially from March levels. So really very similar to what we saw in late February. So this is starting to look like what we see normally, and again there are outages that happen across the Internet on a normal basis, right? You’re never going to get to this zero level, that’s not what normal looks like, normal is there are things that happen, whether it’s routing related or infrastructure related or security related, like you might see with a DDoS attack. This stuff is pretty normal. So this is looking good overall.

ISP Outages Global
Figure 3: ISP Outages Down Globally
Global US ISP Outages
Figure 4: U.S. ISP Outages Down Nearly 50%

Angelique Medina:
And then in the US of course, again, pretty substantial decline in the number of outages. Cloud providers, we haven’t really seen a lot of impact on them over even the March period where we’re seeing a steady rise in overall outages across various providers. Cloud providers have seemed fairly immune to some of the recent issues. And one of the reasons we talked about is the fact that they run massive global networks, in many cases they have fairly extensive edges. In particular Microsoft and Google, they tend to have users enter their network very quickly and so they have a lot of capacity to absorb traffic surges, that’s what they built their network for. So they’ve fared very well during this period, and we continue to see that these are pretty normal levels overall, very low.

Global US Public Cloud Network Outages
Figure 5: Cloud Provider Network Outages Lessen Overall

Angelique Medina:
And then the collaboration app providers in terms of their network, so not the application itself, but in terms of their network and issues that we’re seeing there. We historically really have barely seen or hardly ever see outage events in the UCaaS provider networks, but that wasn’t the case in March, there were periods throughout March where there was a spike in outages, which is really unusual. And we had speculated a few weeks back that this was probably something where they were adjusting to just the… Like there’s no way that you can plan for the surge that they’ve experienced. I mean, it’s incredible. I mean, millions and millions of call minutes …

Collaboration App Network Outages Global
Figure 6: Collaboration App Vendor Networks Stabilize

David Belson:
200 million users all at once.

Angelique Medina:
Yeah, I mean, it’s pretty incredible. I think Zoom went from like 20… I don’t know the exact number, but it was like 22 …

David Belson:
Yeah, tens of millions to like hundreds of millions.

Angelique Medina:
Tens to hundreds of millions, so just an incredible traffic growth in a space of a couple of weeks, and so not unsurprising that you’d see some issues crop up, but again that seems to have gone down. So hopefully we’re reaching that stability point.

Global US Collaboration App Network Outages
Figure 7: But 100% of Outages Are in the U.S.

Archana Kesavan:
It responded pretty well from 29. I mean, that’s a big downloads week, so hopefully that stays or goes down.

Angelique Medina:
Yep, yep. So again, steady decline. Most of them, or all of them it looks like, were in the US last week, but it’s going in a good direction.

David Belson:
I wonder if that’s a function of most of those providers being based in the US?

Angelique Medina:
That’s probably what it is.

David Belson:
They may have global infrastructure but…

Angelique Medina:
Yeah, they have more. I would think they would have more points of presence in the US, and so you’re potentially going to be, depending on where you’re located, routed to a POP that’s in the United States.

Archana Kesavan:
Also, I’m just wondering if this might be related to last week being a shorter week because of the Easter holidays.

Angelique Medina:
Oh, yeah, interesting.

Archana Kesavan:
When we see a typical week, next Monday, we’ll be able to validate it better whether this is accurate or not, but definitely a lot of folks have been off Thursday, Friday.

Angelique Medina:
Interesting.

David Belson:
I know in places today, today’s a holiday as well.

Angelique Medina:
That’s right. That’s right, yeah. Yeah. I didn’t realize it was in the US. I know that in Europe and Australia, they had those days off, so yeah. Yeah, that’ll be interesting to see next week or this week, next week. So that’s effectively just a highlight of what happened last week and some trends that we saw. Another really interesting thing that we looked at, and also went back a few weeks looking at, was some issues that were cropping up with users trying to connect to the New York Labor Department’s unemployment site. And that was particularly interesting because it turned out that this did appear to be related to congestion within their network.

Archana Kesavan:
Their network, yeah. Yeah. So what you’re seeing here, and then this is again traffic patterns and the impact of traffic that we’ve been seeing all the way up from March to, it’s ongoing to last week. So what you see here specifically for the New York Labor site is that the availability had these trends, and these trends really mapped the time of day. So starting Monday to Friday you see some interruption with respect to just accessing the website itself. This was actually the website where you apply for unemployment, and over the weekend there’s stability. And then again March 29th you start seeing these time of day dips again, so definitely there’s a trend that we’re seeing. Last week, however, this came down a little bit compared to the previous two weeks just from an availability perspective. But still there’s definitely some kind of disruption going on, mapping again to time of day.

Archana Kesavan:
So during the week people are filing. More people are accessing the website to file these claims and then that’s resulting in this. But I think what was very interesting for us to look at is this really maps to an overall end to end packet loss, which coincided with that time of day pattern that we were seeing from Monday to Friday with respect to accessing the site. But it came down to notice where this congestion was or where this packet loss was, it was not necessarily on the Internet per se. It was actually within the hosting location of New York’s unemployment websites hosted itself. So it’s within their own data centers that they are hosting this application who started seeing this increasing packet loss.

Archana Kesavan:
This was the week of last week to about April 1st, and then again over here, when we come down to April 6th, we start seeing very similar patterns as well. But this was all indications that the congestion was within the data center or where the application was hosted itself. And if you looked at some of the news articles that were going around because of this disruption last week, there was mention about increasing capacity within the network. And this probably is capacity from an application perspective on the server as well as capacity within the internal network as well to the routers or switches that are involved in there.

Angelique Medina:
Yeah, what’s interesting, there’s a couple of things. I mean, from a pattern standpoint, it really does map to the advice that’s been given to have certain folks with last names having been given a designated day of the week to apply, and those days are Monday through Wednesday. And then Thursday, Friday and Saturday are sort of wildcard days. But we’re really seeing Monday, Wednesday, Thursday, and then a little bit on Friday is where you see those heavy peaks, and not so much on a Saturday, which maybe people are trying to follow the advice of just filing on those specific days, but there’s less traffic it appears or congestion on Saturday.

Archana Kesavan:
On the weekends.

Angelique Medina:
And over the weekend, yeah. They didn’t specify Sunday. I’m not sure why, but I assume you can also apply on a Sunday. The other thing that’s interesting is that they’re really just hosted out of a single data center and they’re not using any public cloud services or a CDN to front end their site, which could make it harder to very quickly absorb the level of traffics that they would have seen over the last few weeks.

Archana Kesavan:
For me, the moving to a public cloud, that’s not surprising because these are government and federal agencies, so they’re maybe a little slower in terms of the actual quote unquote, cloud migration. But the front ending without a CDN, that’s definitely surprising. And we noticed this pattern across a couple of other unemployment sites, as well.

Angelique Medina:
California and some… Yep.

Archana Kesavan:
Yeah.

David Belson:
My guess there also is that… I guess it’s two fold. One is that they are effectively in the October, early November period of Black Friday shopping where you don’t want to touch the site no matter what at this point, A. And then B, you’re dealing with state purchasing processes. So yeah, at this point they’re all probably would love to clamber on to Amazon Cloud or Google Cloud or Oracle Cloud, whatever, but again, A, they’re on a lockdown for the infrastructure, but B, they probably have to go through a multi-month procurement process that would get them there.

Angelique Medina:
Yup.

Archana Kesavan:
It will be interesting to see after we come out of Covid, how some of these sites do end up changing. So over time, if we can keep track of their connectivity architecture, maybe they will move to the cloud, maybe they will have CDNs.

David Belson:
It would be interesting to also be able to understand the backends on these as well. So I mean, in this case it looks like the traffic might not have even been getting to the backend because of the congestion on the front end. But New Jersey had put out that call for COBOL programmers, and I’m guessing they’re not the only one, they’re just the first ones to have raised their hand. And I’m kicking myself for taking Pascal in 10th grade instead of COBOL. I could have a side gig here.

Angelique Medina:
Yeah, that’s really interesting because it seems like… And we’ve seen this issue with other sites that have seen a pretty significant surge as well, which is that whether it’s a legacy network or hosting or application architecture, that’s where we’re seeing a lot more pain. In some instances, we’ve seen providers who are getting an increase in traffic, and maybe they’re handling the traffic okay …

Angelique Medina:
An increase in traffic and maybe they’re handling the traffic okay, but their application response time is really slow and these are often companies that have been in business a lot longer and probably have more technical debt. And so really this is kind of just points to the real impact of your architectural choices, whether they be network or application-related. And cloud providers and other services like CDN providers are kind of the real winners here. And so, it’d be interesting to see if this perhaps triggers, as you mentioned, some kind of some, not to use a marketing term, but digital transformation. Yes, I said it. But that they start to rethink their posture, if you will, and realizing just how little agility they have to respond to changes in need. So, if there’s any good outcome from this, that would certainly be one of them.

Angelique Medina:
Right. And it’s definitely not that the Internet is congested, right? And this reiterates the fact that the Internet is actually able to handle it. We didn’t see any issues getting within that data center where the application’s hosted, but that congestion is in smaller parts, where the application’s hosted and operated.

Angelique Medina:
These have all been very much site or application specific, whether the issues have been related to the network or to the application itself. And that’s somewhat good news because that means that where we are seeing issues, they’re relatively contained versus something widespread or systemic. So that’s good. And I know that you’ve been tracking some interesting things that happened last week, David.

David Belson:
Yeah. And to the point you just made, that was… I published a blog post about Internet resilience for the Internet society at the end of February, kind of before all of this really hit. And there was some discussion of will the Internet be able to handle all this? And one of the points I’d made in the post was that, “Yes, we expect the Internet to be able to handle all of this,” which I think you guys are showing, but that where we expected to see the problems was most likely going to be on the application side. But I think that that’s bearing out like what you just said.

David Belson:
What I’m doing… So a few things. One is, as we talked about earlier, I’ve been looking at a lot of the articles and blog posts from the various providers and looking at data from IXPs and the network providers, trying to understand what changes have they seen as a result of all the shift from working from home. Have they been able to handle it? And by and large, the answer has been yes. Again, as you’ve pointed out, there have been some hiccups, more on the platform sides and a little bit, I think, also as we expected on some of the first mile of connections. I think we’re all privileged enough to have high-speed Internet access, which is great and it’s probably less likely to become congested. But when you have a more restricted home Internet connection or you’re reliant on mobile tethering, it’s going to be hard to do the streaming plus the video conferencing plus whatever else you need to do.

David Belson:
So compiling a lot of that data together to do a set of followup blog posts about Internet resilience. And then most I’ve been looking at as part of the Internet disruption report. I look at things at a more aggregate level, generally, more of a country level. So we’re working on putting together the March report and there were a couple of the major things that stood out there. One was seven separate power outages in Venezuela that impacted conductivity there. As I’ve found writing the report over the last year and doing similar work prior to that, there are a handful of common root causes and Venezuela’s power shortage seems to be particularly problematic. And in past months, there’s been one or two power outages that have been significant enough to impact connectivity. In March though, there were seven. So we’ll be looking at that and what parts of the country did it impact.

David Belson:
And then submarine cables and problems with those cables have also been a common factor. So two that stood out in March where the SAT-3 cable, which problems there impacted conductivity in Angola and Gabon. And then there was apparently what was called a dragon storm in Egypt, I guess, which was a pretty significant storm that created problems with the EASSy cable, E-A-S-S-Y, assuming I’m pronouncing that correctly, which had a measured impact on, I think, up to 10 countries. So some generally more significant than the other. And then once the date in April is over the last two weeks, the ACE cable, Africa Coast to Europe, had some problems earlier in the month and that’s caused connectivity problems in Mauritania and Liberia. And then the SEA-ME-WE 4 cable, or SMW 4, which has historically had a lot of problems as well, had more issues and impacted conductivity to about five or six other countries. So I think we hear you’ve gone into a lot more detail looking at more specific issues in what I’ve been tracking. They’ve been more aggregate, more higher level, and I think harder to… What’s the word? Not authenticate.

Angelique Medina:
Just sort of know what the actual cause is. Yeah, yeah.

David Belson:
Well, not so much the user impact, but the attribution. That’s the word I’m looking for. So we can see through measurement that, okay, there are issues in a typical set of countries and you have to start looking for the commonalities and say, “Oh, these are all connected to this given cable.” Sometimes an affected provider will say something and they’ll post something that says, “Oh, our upstream provider is having problems,” or “There’s been a cut on the SMW 4 cable or the eighth cable that’s created problems for our subscribers. We sincerely apologize,” and so on. But in many cases the consortions that own the cables don’t say anything. In many cases the providers don’t say anything. So it becomes really challenging to definitively attribute the problems that we see.

Archana Kesavan:
The Venezuela outages that you were talking about in March, the substantially high number, do you see that related to the situation we are in? Or are they just independent occurrences?

David Belson:
They’re independent occurrences. I mean, Venezuela has had historically problematic power infrastructure and I think just in March had power outages that had a sufficient impact on Internet connectivity. If you look through the Twitter feed for… I can’t remember the name of the power company there, but they have a pretty active Twitter feed. And I mean pretty much like every day they’re something. “We have dispatching crews here, dispatching crews there, to fix this issue, to fix that issue.” So I think they’re in a constant state of repair, I think. Not all of the power problems there though have a widespread impact on Internet connectivity.

Archana Kesavan:
Got it. Got…

David Belson:
And one of the interesting things too is that we see oftentimes power outages are hard to detect through Internet monitoring because in many cases the end points that are being measured are within data centers and generally the data centers will have backup power, they’ll have better power. So it’s when you start seeing traffic loss from subscriber connections and when you start seeing measurements to end points within the subscriber ISPs failing, that’s generally when it starts to show up in these types of reports.

Archana Kesavan:
Got it. Okay.

Angelique Medina:
Interesting. Yeah. You’ve been doing this for a while in terms of compiling a lot of these incidents and kind of collecting them and documenting, which I think over time is really helpful because it helps people to understand what baseline is. That there are a variety of different issues that can happen. And to your point, in the case of Venezuela, this is fairly common and not necessarily attributable to any political climate or anything like that.

Angelique Medina:
And the Internet is so vast, that having different angles on what’s going on, whether it be at the infrastructure level or even just the cable level, the pipes and understanding that, or even the vantage that, for example, maybe the CDN providers or the cloud providers have. Because every one of them has a slightly different view or picture of what’s going on, and the same with all the IXPs and no one of these entities can see the whole picture. It’s really just through some compositing, if that’s a word, of all of these data sets. Does it start to emerge of what the overall state of the Internet is, if you will?

David Belson:
Yeah. No, absolutely. That’s absolutely true. One of the other projects I’m working on at the Internet Society is around … They’ve put forth eight project areas that they’re focusing on for 2020 so I’m involved with the measuring of the Intranet Project, surprise, surprise. And focusing specifically on a use case around Internet shutdowns. That is one of the things we’re trying to do is work with partners to aggregate these different vantage points to really help understand, okay, this set of measurements saw a problem in a given country at a given time, but this set didn’t, so was it really a shutdown? Or was it something else? And then looking at some of these events, some of these structured events, they definitely more evident in certain types of measurements than others.

Angelique Medina:
Yeah, that’s interesting. Maybe, I think it’d be helpful to give us an overview of what the mandate is of the Internet Society. Because you mention this work around disruption. Is it because they want to track issues around Internet sovereignty or censorship? Or is there a broader agenda and what is sort of the mandate of the Internet Society overall?

David Belson:
Sure. Our aim is to, is the Internet is for everyone. Making sure that, no, I shouldn’t say making sure, but helping to drive towards making sure that everybody has Internet connectivity and usable Internet connectivity. And making sure that the Internet remains open, globally, connected, secure and trustworthy. And those are, I think four key areas that we focus the project areas on, that we focus the advocacy work and the policy work. We do a lot of work around IXP deployment. We do a lot of work around community network deployment. Those unserved and underserved areas, trying to bring them onto the Internet. Doing a lot of work around encryption, time security. The project I’m involved with, measuring the Internet. We’ve broken it out into four use cases, but really ultimately, trying to ensure that the Internet is available and is usable for as many people as possible.

Angelique Medina:
Yeah. That’s great. Well then I guess, maybe the last thing we can cover here is your thoughts on how the recent events could potentially impact rulings around net neutrality going forward, given that the Internet has kind of been seen as very essential to everyone recently. I don’t know if you have any thoughts on that, if there’s thinking that, because I think in the last ruling that they basically said it’s not really a utility, but can the case be made now that it is actually quite essential?

David Belson:
I think the case can absolutely be made that it’s essential. I’m going to actually sidestep the question of net neutrality because that is a rat hole that I don’t want go in.

Angelique Medina:
Fair, fair.

David Belson:
Having said that, I think that the current events hopefully will do more to get the policymakers around the world to help close the digital divide. Again, there are a lot of places even in the US where high speed broadband is either not affordable or not available. You have kids sitting in cars outside of school buildings or libraries trying to snag the wifi signal just to get their homework done. Internationally, there are many countries, the Internet infrastructure, especially the last mile infrastructure is just insufficient. And hopefully there can be a lot more investment made in really bringing that up to a usable level and to creating environments that are sufficiently competitive where the cost of getting online is much more reasonable.

Angelique Medina:
Well I think that’s a great place to end the show today. Archana, why don’t you take us out?

Archana Kesavan:
Sure. David, thank you so much for being on the show. Great hearing from you and your perspective of how the Internet is doing. And again, Angelique, it’ll be very interesting to see next week how the ISPs do and how the UCaaS providers do as well.

Angelique Medina:
Absolutely.

Archana Kesavan:
Hopefully we’ll have another outage and we can talk about it again. Every week there seems to be an interesting outage that we’re covering.

David Belson:
Cloud security.

Archana Kesavan:
Yeah. But I think the communities, these outages, happen with or without regulating.

Angelique Medina:
Absolutely.

Archana Kesavan:
That’s the key to remember. We’ve seen these outages before COVID. We are seeing them during and we will continue to see them after the situation’s passed. With that, if you’re interested in hearing more about how we understand outages, how we dissect that, feel free to just subscribe to our blog, blog.thousandeyes.com or follow us on Twitter. Until then, have a safe week and we’ll see you next week guys.

ThousandEyes T-shirt Offer
Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog