Watch on YouTube – The Internet Report – Ep. 7: May 4 – May 10, 2020

On this week’s episode of The Internet Report, Archana and I cover some notable news and incidents from last week. We discuss a Facebook SDK outage that had ripple effects on other popular services that leverage its log-in functionality, including Spotify and Tik Tok. We also discuss a strongly-worded blog from AWS on the JEDI contract awarded to Microsoft, and share highlights from our quarterly update of the 2019-2020 Cloud Performance Benchmark, including performance changes to AWS’ Global Accelerator.

We’re also joined by Arash Molavi Kakhki, the lead Internet researcher here at ThousandEyes. Arash shares his insight into how we define and detect outages and how packet loss and network latency can impact end-user experience in various ways depending on the application. Finally, we take a look at the outage numbers from last week for ISP, public cloud, and collaboration app provider networks.

Give this week’s episode a watch or a listen in the embeds provided, grab our slides on Slideshare and, as always, feel free to read along with the transcript below. We’re also available on iTunes (Apple podcast), Castbox, Google Podcasts, Spotify, and Stitcher, so be sure to subscribe and leave us a review on your platform of choice. Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

ThousandEyes T-shirt Offer

Show Links:


Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 7: May 4 – May 10, 2020

Follow Along with the Transcript

Angelique Medina:
Welcome to The Internet Report, where we unpack all of the interesting incidents and events that have taken place on the Internet the previous week. As usual, I’m joined by Archana Kesavan, my co-host.

Archana Kesavan:
Hey, guys.

Angelique Medina:
And I’m Angelique Medina. We have a really great guest this week. So we’re joined by Arash Molavi.

Arash Molavi:
Hi.

Angelique Medina:
And he is the lead researcher at ThousandEyes. So he came to us from Northeastern University, where he got his Ph.D. in computer science and also worked on transport protocol performance, as well as net neutrality. And he’s going to talk a little bit about some of the work that’s been done around outage detection. So really excited to have him on.

Angelique Medina:
Just before we get started working through some of the major events that happened last week and sharing some of the stats, I wanted to just make sure that everyone knew where to subscribe to the show. So we’re on YouTube, and you can also subscribe to us via Apple Podcast, Stitcher, Google, Spotify. So really anywhere where you get your podcast coverage. So with that, also we’re really excited to announce that we’re going to be holding a live event in June. So June 18th, we’re going to be hosting an event on the state of the Internet. So if you have some ideas on maybe some talks or some particular individual or topic that you would like to see covered in that event, feel free to reach out to us at the internetreport@thousandeyes.com.

Angelique Medina:
So with that, let’s just get into some of the interesting incidents that happened. And I think the one that is probably one that most folks have heard about is this Facebook SDK issue that brought down many different applications, including Spotify and Tik Tok. And so basically what was happening was, a lot of these applications have… Some of their internal architecture relies on Facebook, and because Facebook was having this issue, it was basically causing their applications to effectively not work. And so this happened on May 6th. And certainly got a lot of attention, particularly with Tik Tok and some of the others.

Angelique Medina:
And I think that you had also kind of had mentioned something earlier, Arash, about the fact that a lot of these applications use Facebook to effectively sign into the application, but it may have been a broader issue.

Arash Molavi:
Yes. So basically, the SDK from the news, apparently, the part that was failing was… A lot of these applications use this SDK to allow users to sign in through Facebook. And there was an issue with the latest update and caused the apps to crash. And one thing that I wanted to mention is, for a lot of the cases, even the user, if the user is not explicitly using Facebook to log into Tik Tok, or Spotify, or those apps, just the fact that this SDK’s being used in the app, that could cause the app to crash. So you don’t have to be using Facebook to log in to necessarily be affected by this.

Angelique Medina:
Right, tight. Yeah. So this was application related, wasn’t related to anything that was taking place on the network. And then once that got addressed by Facebook, that fixed the issue. But it’s interesting that a lot of the applications apparently have sort of this dependency on Facebook, to…

Arash Molavi:
Yes.

Angelique Medina:
That’s maybe not a single point of failure, but certainly, there’s a pretty broad impact if something happens on the Facebook side. So that was certainly interesting. It also kind of speaks to the fact that applications today are built around different services and have a whole ecosystem built around them, right? So if one of these external services, in this case, Facebook, isn’t available or is having an issue, that impacts you, right?

Archana Kesavan:
I think what was really interesting is a point Arash mentioned, that it doesn’t have to be a live interaction with the kit itself. Even if you had it and it’s embedded into your application, then you’re going to fail. That was kind of… The impact of a failure like that is kind of large.

Angelique Medina:
Yeah, absolutely. The other one that was… It wasn’t necessarily a news story, although it certainly triggered a news story, and that was this blog post that was put out by AWS around the JEDI contract, and the award to Microsoft. And clearly, they have some very strong opinions on the matter. And so they put out this fairly strongly-worded blog post about how they felt the process was being handled, and whether or not it was fair that it was handed off to Microsoft. So clearly there’s been some escalation in this war of words between the two. So we’ll see how that plays out.

Angelique Medina:
So those were some of the major news stories. A couple of other things that happened, we also saw that Microsoft announced several …

Archana Kesavan:
New regions for their data centers. I believe it was New Zealand, Italy, and Poland, followed by Oracle made a new announcement, too, in terms of some new cloud regions. And I think it was their second region in South Korea.

Angelique Medina:
Yeah.

Archana Kesavan:
It’d be interesting to see Oracle’s updates over the next few months, now that… You know, a couple of weeks ago we heard Zoom’s investing heavily in Oracle Cloud, as well, so we just keep track of how Oracle is expanding their network.

Angelique Medina:
So a lot of the news was cloud-related, which is good timing because we actually put out some updates to some of the research that’s been done around cloud networks.

Archana Kesavan:
Yep.

Angelique Medina:
And what about the blog post last week on this, which was really interesting?

Archana Kesavan:
Right. So one of the other research and data-driven projects that we’ve been working on is the performance comparison of cloud providers. So last week, we did one of our quarterly updates to cloud performance, which is comparing network connectivity and architectures of AWS, GCP, Azure, Alibaba, and IBM. And I think one of the… two, actually, really interesting points there that stood out for us was how these providers are constantly making updates to their network. We’ve spoken about there is no steady-state in the cloud, but what’s really interesting to see here is that AWS, Microsoft, GCP, whoever it might be, are making improvements to their architecture, to how they’re peering with other providers, introducing new regions, new regions that are compatible to certain types of services.

Archana Kesavan:
And one example is AWS’s Global Accelerator service, which betters the performance of, and access to, applications that are hosted in AWS. And we saw up to almost a 25% increase (improvement) over a six month period, in terms of network latencies. And this was just one data point, but we updated the report to have multiple other data points, in terms of how AWS has made these improvements.

Archana Kesavan:
We actually saw the inter-region performance of Microsoft get better over time, as well. So it’s just really interesting, because we’ve been doing this research for about two years now, and every year we’ve seen these performance improvements across these providers. We’ve seen them bring out new service. They don’t stop with optimizing it and providing the best connectivity.

Archana Kesavan:
So that was interesting. So if you’re kind of interested in reading more about it, definitely jump onto the cloud report, that you can access at thousandeyes.com/research. Right. Back to you, Angelique.

Angelique Medina:
Yeah. So one of the things that I wanted to talk a little bit about today. We’ve covered this very briefly in some previous shows, talking a little bit about how we’re determining what an outage is, because we share a lot of outage stats on a week-by-week basis. And we certainly get a lot of questions about, like, well, what constitutes an outage? What do we mean by an outage? And how are we effectively deriving these numbers? So that’s where Arash comes in, because he was part of the team that did the research behind Internet Insights, which is the mechanism that we’re using to share these outage stats with everyone.

Angelique Medina:
So maybe if we first off… Arash, if you want to walk through one, what do we mean by an outage? And how does that maybe differ from some other performance degradation? And then, what are some of the scenarios that would trigger an outage?

Arash Molavi:
Yeah, sure. In the very broad definition of an outage, if you think of Internet as a network of networks, which is what it really is, in an ideal world you expect any node in this network to be able to communicate to any other node in the network. But obviously we know that’s not always the case. And so basically, you can think of outage of any disruption or any disconnection, if you wish, in the network that causes a part of the network not to be able to talk to another part. And obviously this is a very broad definition of the outage. This could be done intentionally or unintentionally. This could be a permanent outage, even. So for one example, like if you think about the great firewall of China, that is basically censoring access to Facebook. This is, by definition, an Internet outage.

Arash Molavi:
And as I said, these could be intentional, like the example that I just gave, or in most cases, unintentional. Examples as we see in the news every now and then, that there was some construction and some fiber optic cable got cut, and that causes an outage. Or misconfigurations, or your automation happening in configuring networks, or someone basically “fat-fingers” and touch something wrong, and that can resolve in that disconnect and disruption in the network.

Arash Molavi:
So that’s the definition of the outage. Now, if we want to focus on the outages that you’ve been reporting throughout this podcast, and the outages that we’ve been detecting at ThousandEyes. So these are basically network outages in the current infrastructure of the Internet. And so the keywords here are… like the biggest keyword here is network. Because, for example, you just started this show by you talking about the Tik Tok and Spotify outages due to a Facebook SDK. That’s more of an application issue. It’s not really something that is happening at the network layer.

Arash Molavi:
So we’re detecting outages at the network layer. And I want to emphasize that this could be an issue actually at the network layer, or it could be another issue that it’s manifesting itself in the network layer. And another key thing is, we’re focusing on outages that there’s, like, 100% disconnect. So if you can think of it, again, of network, you’re going from point A to point B. So you’re going to take some path. And if an outage happens, the outages that we consider, at some point in this path, things just get dropped. So you have 100% loss. And so in some network in some location, you’re experiencing 100% loss, and those are the outages that we detect.

Arash Molavi:
So yeah, so that’s the high level of definition of the outage and what we consider outage in our reports. But obviously, I also want to mention that while we’re focusing on cases that there’s 100% loss, you don’t really need 100% loss to have bad performance. You can have… I don’t know… if you have 30, 40% loss, that’s still pretty bad. So your experience can be really bad. Or even you might not be able to connect to an application, even though there’s not 100% loss, or if your latency is too high, then your experience can be pretty bad, even though there’s not 100% loss there.

Angelique Medina:
Yeah, absolutely. So as you mentioned, there’s a number of things, a number of performance indicators, if you will, that could potentially impact users. One of them is outages, which we cover a lot, and that, to your point, could be caused by a fiber cut, could be caused by a router just dying or blowing up. Or it could be that the router, whatever, is fully functioning, but there’s been a configuration issue. Or like maybe some automation or human error that’s caused the outage, or even a routing issue could cause the outage. For example, if there’s a hijack or leak, that could then lead to a particular site becoming unreachable, and all of the traffic just simply dropping within a particular network.

Arash Molavi:
Exactly. The routing example is a great example, where it’s not like a router just blew up or is malfunctioning.

Angelique Medina:
Right.

Arash Molavi:
So the latest example of this, the high profile example, was Rostelecom a month ago, where they basically, by mistake, hijacked and leaked a bunch of prefixes. And what happened was, a package were taking a route that they were not supposed to take, and they arrived within a router, and the router didn’t know what to do with them. So it kept dropping it. So we saw a 100% loss in the network, which was caused by BGP, basically misconfiguration.

Angelique Medina:
Right. But then to your point, there are other scenarios in which users can be impacted, a lot of it having to do with user behavior itself, like for example, network congestion. And we’ve heard, clearly, a lot about this recently, because of the increase in traffic usage. So one of the ones, or examples, that we looked at, and this was a few weeks ago, was the New York State unemployment site. And what we had shown was, in looking kind of at network performance, we could see that there were various points during the day, and days of the week, in which there was a really significant increase in packet loss. So we can see here for this particular location connected to the Verizon network, it’s like a 62% packet loss, which is really, really high, right? Not a 100% packet loss, 62%.

Angelique Medina:
But what’s interesting about that is that in this particular instance, even at 62% packet loss, we could still see that, in this instance, there was still a connection to the server that was able to be made, even though response time was really, really high, right? And that kind of speaks to, in some ways, the resiliency of the Internet and how protocols were designed, right? Because even when there’s packet loss, you’re going to effectively retransmit and attempt to connect, right?

Arash Molavi:
Exactly.

Arash Molavi:
So basically, we actually earlier talked about this in the example that I gave, is that we see these things happening all the time, but usually in a form of DDoS attacks, where a malicious entity wants to bring down a network or a webpage, and they basically use an army of bots to send a lot of requests, and basically go above the capacity that that network or the application is able to handle. And that causes a lot of loss and a lot of latency, and basically it renders the application useless. But in this case, obviously, it wasn’t a malicious entity doing the attack, it was so many people that had to apply for unemployment at the same time. And that system was just not designed to handle this much load.

Angelique Medina:
Right.

Archana Kesavan:
Also, when you have a terminal loss of 100%, right, that disrupts everything. You can’t even get to where you want to go. And when you see loss like 62%, in this particular case, it’s like, it depends on what packets are being lost in the sequence of an HTTP phase, right? Like, for example, in this particular case, you were able to get through your first connection, but then you’re just waiting for the rest of the packets to make it through. So I think just seeing that variation in terms of how packet loss can impact some locations or some users, and for other users, it could just manifest as the application being slow. For some users, in this particular example, Angelique, that you’re sharing, it manifested as user not able to connect to the website itself.

Angelique Medina:
Absolutely. So in this particular case, the fact that they were able to get to the application server was really just luck, because if we look at some of the other locations here …

Archana Kesavan:
I think it was the Comcast one from New York. Yeah.

Angelique Medina:
Yeah. So if we go back, we can see that there’s simply no connection that was able to be made. And if we look at kind of the network piece of it, again, really, really high issue here, right? So overall loss is really high.

Angelique Medina:
That’s kind of interesting, how, in some instances, the network loss could impact users reaching the application. And in some instances, they’re able to reach the application, but they just have a really bad experience. So packet loss, particularly as it increases, it gets higher, impacts experience. And latency could potentially impact experience, as well. For example, if there’s network congestion, that could increase latency in a particular network. But even if that’s the case, even if there is congestion, the amount of increase in latency may not rise to the level that would actually impact an application and your experience of it. It would have to be pretty high these days in order for you to start to notice because a lot of applications are just not that latency-sensitive.

Archana Kesavan:
Right. I mean, in terms of, I think, latency, and as you’re talking through latency, it just came to my mind is, one is congestion, for sure, taking a lot of time. Others, like given the context of COVID-19 and the increased usage of VPNs we are seeing right now, if you’re going through a VPN concentrator that’s not optimized for your location, you’re just traversing multiple hops to even get to the destination, you’re not necessarily like… it’s not a congestion issue, but it’s increased your latency. The chances are, to your point, like if the application’s really sensitive, you’re going to catch it. But if your application’s not sensitive as much, which is a lot of the websites and things like that might not get impacted. But if you’re, say, on a voice call, that’s going through VPN and you’re just routed to a really wrong place, then that’s when you start seeing impacts to user experience.

Angelique Medina:
Yeah.

Arash Molavi:
I just want to add to what Archana said. When it comes to latency, if you just think about CDNs, CDNs are a big part of the Internet today. The core reason that they exist is to reduce latency. You’re geographically distributing the content, so you’re closer to the end user. So latency is really important. There’s so many research that says, hey, if the latency increases by, I don’t know, one second, the revenue of the company, it’s just going to go down by 10%. I’m making these numbers up, but there’s so much precision that just like… The users are not tolerating latency. Or even cases, like right now we’re having a video conference. If our connections had high latency or high jitter, we couldn’t have a smooth conversation. So you’re right that in some cases, like if I’m watching Netflix, maybe latency’s not that important. My video is going to start two seconds late, but then I’m going to buffer the video, I’m fine. But some applications are really going to be useless, if your latency is high or variable.

Angelique Medina:
Yeah, for sure. And I think that what this kind of speaks to is, when you talk about network performance, it’s all about context, because you can’t just say like, oh, there’s packets loss. Full stop, that’s terrible. And or if there’s latency, that’s potentially going to impact application. It may, it may not. It really depends on a lot of instances. Outages, yes, it’s very clear cut. But these other things, even when there’s some measure of loss on a network, you could still have a pretty reasonable, good experience. So again, it really just depends. The application context is important.

Angelique Medina:
And of course, you bring up CDNs. Also really interesting, really important point you brought up there. Not just latency though, because it also… What we’ve seen recently is in the resilience of the Internet and how people have been talking about that. The fact that CDNs basically reduce backbone usage of the Internet.

Arash Molavi:
Oh, exactly, yeah.

Angelique Medina:
That’s a huge, huge thing. So when we talk about Internet resilience, yes, a lot of it has to do with transport protocols and all kinds of different things that have to do with kind of the peer networking piece of it. But also application delivery has been optimized so much that that has really helped in this period, in which there’s been a lot of more usage of the Internet, and websites, and other digital services.

Arash Molavi:
Absolutely. And I just wanted to add a small note here. Apart from the effect that latency and loss have on user experience, they’re also like… for people like myself and you that look at networks a lot, it could also be very valuable data points, as well, because latency and loss can tell you a lot about the network. Like one simple example is if you have two separate links and both have high loss and high latency, and then if you zoom back and look at the trend of loss and latency, maybe one of them is always lossy and high latency, maybe the other one follows the pattern of, basically, workdays, and on weekends, it dies down. And then, you know, okay, so this is a network that just gets congested at peak hours, or the other one is a link that’s just bad and it’s always congested. So you can learn a lot about the network, just looking at latency and loss, as well.

Angelique Medina:
Yeah, absolutely. So speaking of disrupted user experience, we’re just going to quickly go through some of the outage stats from last week to see what that’s telling us about kind of the overall health of the Internet and how things are progressing. So we can kind of see here that outages were down, overall, kind of in this 200 level. So it was 216 last week, the week before it was 282. So it has gone down globally, as well as in the US. These 200 numbers are kind of closer to what we were seeing in January, February, so they’re kind of starting to normalize to pre-March, which was kind of where we saw our spike, March numbers.

Network Outages Week of May 4
Figure 1: Network Outages – Week of May 4, 2020

Angelique Medina:
Down ISPs, or outages, are down, as well… Well, cloud service providers, we never really saw much of a peak. They just kind of tend to stay under 25-ish overall, globally. And last week was pretty similar to the week before that. So there was 13 outages overall. The week before that there was 12. So not really a lot.

Archana Kesavan:
We’re almost seeing the pattern of these outages stabilizing, which we saw a few weeks ago, especially on the ISP side return back, it looks like, right?

Angelique Medina:
Yeah. So that’s good news. So with that-

Archana Kesavan:
And all of this data that we’re showing here, if you’re interested in taking a look at it, you can get to thousandeyes.com/outages right there, and the trend is up there.

Angelique Medina:
Yeah.

Archana Kesavan:
So definitely-

Angelique Medina:
It also show, if you want to check out the outages for the UCaaS providers, they’ll also be listed up there on that site.

Archana Kesavan:
Yep. All right. With that, we are at the end of the show. Arash, thank you so much for being a part of this show, as always.

Arash Molavi:
Thank you for having me.

Archana Kesavan:
Of course. And if you’re interested in subscribing to the Internet Report, it’s available on your favorite podcast channel you’re looking at, or YouTube. And if you’re interested in our newly launched T-shirt, I don’t think it’s new anymore, it’s been going on for three weeks now, but working safely from home, email us at internetreport@thousandeyes.com, and we’ll send you over your T-shirt quickly. With that, we’ll close out here and see you guys next week.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog