Watch on YouTube – The Internet Report – Ep. 6: April 27 – May 3, 2020

On this week’s episode of The Internet Report, Archana and I are thrilled to be joined by Martin Levy, Distinguished Engineer at Cloudflare, who previously worked on expanding Cloudflare’s global network footprint, and today is highly focused on BGP security. Check out this week’s episode to hear his thoughts on BGP optimizers, route filtering, and the rate of RPKI adoption.

We also discussed a series of Virgin Media outages in the UK last week that appeared to have prompted local users to check their bandwidth en masse via Speedtest.net. We then covered our usual availability check of ISP, public cloud, and collaboration app provider networks.

Give this week’s episode a watch or a listen in the embeds provided, grab our slides on Slideshare, and as always, feel free to read along with the transcript below. We’re also available on iTunes (Apple podcast), Spotify, and Stitcher, so be sure to subscribe and leave us a review on your platform of choice. Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.


Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 6: April 27 – May 3, 2020

Follow Along with the Transcript

Angelique Medina:
Welcome to The Internet Report, where we discuss the last week on the Internet, outages, route security incidents and headlines from telcos, the public clouds, and more. Joining me to talk about what went down last week is my co-host, Archana Kesavan.

Archana Kesavan:
Hey everybody.

Angelique Medina:
And our guest who we’re really excited to have on, Martin Levy.

Martin Levy:
Nice to meet you.

Angelique Medina:
Martin is a distinguished engineer at Cloudflare. He joined Cloudflare about six years ago focusing on expanding Cloudflare’s global network footprint. He’s now highly focused on BGP route security, and he’s a long time advocate for IPv6 backbone deployment. He’s been involved in the Internet exchange community for some time, and he currently sits on the board of LONAP. So Martin, welcome again.

Angelique Medina:
So Martin, given your work advocating for BGP route security, specifically RPKI, it’s very relevant that you’re on the show today because a couple of weeks back we had a major route security incident with Rostelecom hijacking a huge number of routes, including routes to Cloudflare services. So that was a something that we had covered on the show.

Angelique Medina:
What was interesting about that is that there’s really quite a number of factors that seem to have been in place. In this particular recent incident, it appeared based on the nature of the route leak, and the fact that it was a very specific /21 announcement, that it was due to a misconfigured BGP optimizer, which was the same cause for a really massive hijacking last year by DQE of which Cloudflare was also involved. So optimizer, but also what’s really surprising is that some major ISPs did not filter out that announcement and propagated onto their peers—and that’s what seemed to really cause this cascading impact in terms of Internet disruption.

BGP Route Hijackings
Figure 1: BGP Route Hijackings

Martin Levy:
Yes, it seems to be a repeat pattern with a different geography, but the part you mentioned that’s absolutely key to find out what’s going on here is the /21 route. The route from Cloudflare, if we focus just on Cloudflare is a /20 and we don’t announce the 21. The 21 is created somewhere. In the DQE event last year, it was created by a route optimizer and took a while to get basically turned off and have the routes disappear. Rostelecom one seems to be a lot quicker. There’s also a /21 and, therefore, one can assume the same has happened—but in both cases, whether it be one upstream transit or many, or many peers, you have a route that propagates and propagates fairly efficiently and causes the damage. Many, many backbones were affected. The route optimizers seem to generate sometimes tens of thousands of routes that don’t exist at the beginning of the… in the sort of stable correct Internet mode and because of that and because of the distance that route can travel, you get some measure of the damage that it causes.

Angelique Medina:
Yeah, for sure. I mean in the recent one. It only lasted for a few minutes, and in that short period of time, it was accepted by Level 3 and Hurricane Electric. It was propagated to their peers. Traffic started flowing towards Rostelecom and then was dropped at their network edge because there wasn’t anything to send it to internally. All of this happened within just a few minutes before Rostelecom quickly corrected and revoked the announcements. So this speaks to how interconnected the Internet is.

Angelique Medina:
The incident a couple of weeks ago, well at the beginning of the month, plus DQE, Rostelecom also did something very similar a few years ago. So, Cloudflare has had to weather a lot of these issues, which is why you guys have more recently, not recently I guess, for a few years now, have been very vocal in putting out the need for increased route security because it’s not just things like the BGP optimizer or misconfiguration. It really comes down to security hygiene between service providers and a need to route or filter out illegitimate announcements.

Angelique Medina:
So one of the things that you guys have done recently, and again you guys have been writing about this for a long time, not only the danger of optimizers, but also the need to adopt RPKI, which is just one mechanism, but it’s a step in the right direction for route security. And more recently you guys announced, this was just a few weeks ago, a tool called, “Is BGP Safe Yet?” And it basically allows users to check on whether or not their ISP has adopted RPKI.

Martin Levy:
Yeah, we did this announcement on the 17th of April. So nearly two weeks ago now, and the tool is built upon a couple of different capabilities. One is that all of Cloudflare’s routes are RPKI assigned. In other words, the origin has an assertation within our RPKI, has a ROA, which is what you use. It specifies the particular route and the particular AS number and we then have one route that we announce globally that explicitly is incorrectly signed or signed with AS 0 in this case, which means that this route is invalid and should not be propagated. We announced that route onto every Internet exchange, every transit, every partner pop that we have and that route in fact now has a website on it and that website can be measured in the, “Is BGP Safe Yet” website. Therefore we can say, “Ah, you can get to us through a valid route, and you can also get to us through an invalid route.”

Martin Levy:
The moment that we see the invalid route responding, we know that ISP and, therefore, also at subsequent transit providers are in fact actually failing to do RPKI. So this is an active website. It gets updated, it gets a large amount of tests. But the graph on your screen to the left, which shows the amount of RPKI assigned routes in the world, took a jump and then took a jump higher as of the 17th. People responded to seeing this and responded to the tweets that were created that said that their provider was RPKI compliant or not RPKI compliant.

Cloudflare Works Improve Route Security
Figure 2: Cloudflare Works to Improve Route Security

Martin Levy:
Clearly, this has been taken to heart and the line has changed. The graph has changed since that date. We are seeing more rowers being created on a day to day basis by the community, by providers. This is not work that Cloudflare does. This is work that other networks anywhere in the globe, saw a large jump in the Asia Pacific area for example. So cause and effect in our mind is that this has woken people up. This is not the only thing that’s done that. The RPKI world and the BGP world has had various “deployathons,” like a “hackathon,” but more about the practical deployment. This has been done successfully in the Asia Pacific region, in the European region. It generates an uptick in the quality, the quantity, the signing of routes and therefore the ability for Internet exchanges or for backbones to do their filtering. And so hopefully this graph continues with an upward trend. And although we do still have a long way to go, I’m sure we can talk about, at least after the 17th I think that Cloudflare has helped in some way.

Angelique Medina:
Right. So it does seem like public shaming does actually work. Getting the consumers, the users involved. Because RPKI, to your point, this has been something that has been an effort within the community, and I think a lot of Internet users more broadly aren’t aware of this and don’t know anything about it. And I think the work that you guys have been doing to kind of get many Internet users who are in fact maybe more empowered and aware of this kind of stuff to activate and to hold their ISPs accountable. And that seems to have kind of moved the graph here as we see it. So that’s good. And it’s important too because it’s not just a question of Cloudflare or other applications or service providers, like web service providers, securing their prefixes of their routes through ROAs. The ISPs must participate in this. It sort of takes two to tango. If you do it, if you’re securing your routes, that doesn’t matter if the ISPs are effectively ignoring RPKI. So it’s got to be a collective effort. It has to be something where everyone agrees that they’re going to play by the same kind of security measures. And so this is nice to see that this is on the uptick.

Martin Levy:
There’s enough software out in both the open-source environment and others to make deploying RPKI filtering a practical thing for most backbones. And as we have seen with the likes of an AT&T or Telia and others, and we recently blogged with our list plus the website literally states this because it’s tested.

Martin Levy:
We are seeing large backbones that are now doing filtering both at the backbone level, at a peer level, and at the customer level. And this is pretty fundamental to keep the Internet going. The consumer may not notice this except for just, “Oh, the Internet isn’t working very well for certain sites at the moment.” And then that fixes itself.

Martin Levy:
However, of course, one needs to realize that BGP is underlying all of the geographies of where routes and where IP packets go, and therefore the security of the underlying infrastructure is fundamental and needs to be protected. And as the Internet has grown, as the number of operators have grown, this is not the little ARPANET that was developed 40 years ago. This is real.

Archana Kesavan:
Martin, a couple of follow-up questions on what you said, is this announcement or this tool is kind of created that uptick in the number of ROAS being signed. And Angelique, you mentioned ISPs have to work together to get this done. So what has traditionally been kind of the hindrance for ISPs to go this route?

Archana Kesavan:
Like you mentioned AT&T being one of the adopters, early adopters to go this route, but we know others haven’t. That’s one part of the question. The second is, if all the ISPs over the next year or two adopt this mechanism and all routes are signed, is the Internet, is BGP truly safe then?

Martin Levy:
Right. Great two questions. So the first one, adoption of any technology is hard. People have operational issues as to why they cannot just put the latest and greatest out there. The larger you are, the harder it is. You have a lot of other priorities. You have freezes for management reasons. At the moment we know that certain backbones have got freezes on just because we’re dealing with a pandemic and people are not able to work at a hundred percent within those backbones.

Martin Levy:
But it’s also a reality that specs, RFCs can come out of the IETF, and they can take a while before they get into software. Before they get to the point of being able to be used. So it’s natural that things take time. Small networks can implement very quickly, can learn and we can definitely find certain operational issues. The RPKI environment has required the regional Internet registries, the five regional Internet registries, to develop software and to deploy and to train their users and that has now been happening.

Martin Levy:
And of course, at the endpoint, we deal with an Internet that is a collection of a very large number of Internet routers from predominantly a few companies; very few companies, a handful of companies. But there are others. So we are definitely seeing some hardware out there that is just not capable, not ready yet to run RPKI. That’s normally the case in small providers. The cool part is at the big end, at this baker’s dozen-plus of tier-one backbones, you’re talking about real backbones with real hardware, with massive support contracts and a need for various reasons, RPKI is not one of them, but for many other reasons to keep up with the latest hardware, the latest software.

Martin Levy:
So the excuse, and calling people out for this one. The excuses are becoming few and far between as to why those backbones are not doing RPKI filtering. There’s another class here, Internet exchanges. Which have actually been a lot more responsive in this area. The Internet exchanges have built really good software for their route servers and that community talks to each other, has been upgrading. So Internet exchanges are actually fairly far down the road in regards to being able to do drop invalids for RPKI and provide localized route security in there.

Martin Levy:
The second question involved sort of, “Oh, when will we get to the end?” It’s essentially well, when will we all be secure? And the answer is, “Sorry. We actually have got a long way to go.” And that’s not the end of the world, but we have a long way to go. And so for that, I think that we should realize that take the most important networks, work outwards from the core, if one can believe that there’s a true absolute core of the Internet and to look at education and deployment of modern-day security capabilities, because this is the Internet now.

Martin Levy:
And by the way, during a pandemic, even more importantly, the Internet absolutely has to perform. So we can’t afford these route leak environments. This is not a toy network. It’s not a research network. And so just to plug, if I may, I mean Cloudflare put open-source software out in this phase. Other companies have as well, whether it be the code out of RIPE or Net Labs. You’ll see the other code and deployments out of major providers of hardware. All of this is making it easier and easier for networks to just go down this right route.

Angelique Medina:
For sure. And even if just the Tier 1s were to adopt, it would have a massive impact, to your point.

Martin Levy:
And they’re the ones that get the regular note saying, hey, how are you doing? And by the way, last week would have been fine, but not in a year’s time. That type of timeline does not work. Access networks by the way, have just as much need for this as do cloud providers. And if I give another plug

Martin Levy:
Need for this as do cloud providers. And if I give another plug, the Internet society’s MANRS program has now got three areas of core backbones, Internet exchanges, and now cloud providers. All of those need to be up to speed and fully RPKI compliant.

Angelique Medina:
Yeah, absolutely. And for those who own prefixes, securing their IP spaces is becoming much easier. As you said, there’s a lot of tools out there. It’s not as scary as it appears. Yes, there’s some idiosyncrasies in terms of how you configure your ROAs, but it can be done. And you can also validate once you’ve deployed that your sites are reachable in the way that you expect them to be. So there aren’t a whole lot of excuses to not go down the path of adopting. So …

Martin Levy:
The validation stage is an important one. You can run RPKI in a soft mode and at least understand that you have everything ready and that you have this ability to see the routes that you would drop. And then you get to drop. RPKI is a rather special situation. It’s very important to understand that. The RPKI data that let’s say Cloudflare filters on its network is, in fact, identical to the filter set that an AT&T or other backbones filter with.

Martin Levy:
And so, if we see something wrong, let’s say something that is going to get dropped, that shouldn’t be or the other way around, we can contact that player or somebody else can contact that player. But it’s actually a common issue. So coming in now into an RPKI capable backbone is a much easier task than it was let’s say two years ago. Nowadays, there are so many people that are filtering if somebody actually has a mistake with inside their RPKI dataset, it will be seen by many and be hopefully fixed by many, well before it becomes an issue for any particular networks.

Martin Levy:
So being late to the game is bad, but you can stand on other giants and simply go, right, well it works for everybody else. Let me do it for myself.

Angelique Medina:
Right, right.

Angelique Medina:
Yeah, absolutely. So on that, I mean we can talk on RPKI and route security all day. But there was a really interesting sort of event that happened last week, kind of a notable outage of the week and that was Virgin Media in the UK that broadband provider had a pretty significant outage that affected a lot of users, there was a lot of chatter on social media. And the nature of this outage was really unusual as well. We saw it in a number of ways, not only on Virgin Media’s infrastructure but more broadly on another network that’s owned by Liberty Global. So UPC. So Archana, do you want to walk through that?

Archana Kesavan:
Yeah, I can walk through that. So we caught this in ThousandEyes and then we’ll look at some trends that Cloudflare also saw in the similarity in those patterns. But what we saw around just happened on Monday, last week around 9:15 Pacific, I guess 5:00 PM in UK. We started seeing LGI as Liberty Global and UPC is a part of Liberty Global and so is Virgin Media. And what we see here from an outage perspective is this interesting pattern repeats itself. So it lasts for about 15 minutes, but then it repeats itself every hour for the first three hours. Takes a break for a little bit here but then shows up again. And if you double click on it it’s multiple locations, not just within London but we saw Amsterdam get impacted too. But again …

Angelique Medina:
Ireland, as well. Yeah.

Archana Kesavan:
Yeah, Ireland as well. UPC is heavy in Ireland, so definitely there. But what was interesting is this trend rate. It lasted for a few hours overall, but the time it lasted for was very short, but then it kept repeating itself, which is very similar to that pattern that we also saw from the Cloudflare graph there. So …

Angelique Medina:
Yeah, it was interesting because I know that on your guys’ side, you had seen during this incident that traffic destined for prefixes owned by ASN 5089, which is Virgin Media, and that’s what serving kind of the consumer users are connecting to, basically dropped. And that coincided with what we were seeing, which was in this infrastructure that’s managed by Liberty Global as well as Virgin Media there was this outage event occurring around the times, started around 5:17 PM and then again, both of us saw this kind of hourly, this incident short-lived, but just kept recurring. Which suggests that it could have been related to some automation issue. So why don’t you walk us through kind of what we’re looking at up top, Martin.

Martin Levy:
Yes. Cloudflare delivers traffic to nearly every AS in the world and definitely just about every IP address in the world. So this is sort of a collective graph of all the traffic that we’re delivering to Virgin Media in the UK. And the graph is normally a nice smooth graph. You can tell when people go to sleep, you can tell when people are awake as per, you would expect. By the way, our graph is all in UTC time because we’re sort of a global company. We have to pick something. And so it’s not even UK time, it’s UTC. So we saw this drop in traffic. We saw the inability to deliver pages to Virgin Media and we also saw UPC, as well. If I had a different graph published it would show the two of them in lockstep. And those graphs about 15 minutes past the hour … and yeah, three repeats it misses an hour, one repeat, it misses an hour, two repeats.

Virgin Media Outage Pattern
Figure 3: Virgin Media Outage Pattern

Martin Levy:
And then finally, very early in the morning you could assume that the engineers finally get to go home or get to continue being at home and fix their issues.

Martin Levy:
So automation, yes. By the way, not RPKI-based in any way. We’ve change subjects here. But basically, we haven’t seen them announce what actually happened. The fact that it happened over two different ASs that are essentially part of the same company either leads the way to automation being the issue.

Martin Levy:
Interestingly, you saw it in interfaces, we just saw it as the end-user, but they both sum up to the same thing. Cloud, through infrastructure to the end-user. And the end-user has got restricted access to the Internet. Obviously this graph is much smoother if you look at different operators. So we can sort of clearly see it affecting, in this case a couple of AS’s and nothing else. This is a graph from a tweet pretty much the next day when we knew that things are quieting down. We can see that 12 plus hours.

Angelique Medina:
Yeah. Yeah. We saw the last little kind of outage blip happened like around 1:00 AM, 2:00 AM or so. But the bottom graph, just I guess because we’re being somewhat California centric, this is in Pacific time. Could’ve changed this to UTC. That might’ve been a little bit less confusing. But nevertheless, you still see that pattern. What was also interesting that you had kind of brought up was a blog article by Ben Cartwright-Cox, who he talked a lot about these same things. He again noticed this peculiar pattern, not only in your traffic stats but also in some probing that he had done. But he also saw that as this outage unfolded as traffic would … or says the network basically came back online that we would see these huge spikes in users basically testing their bandwidth.

Angelique Medina:
And then that could also have an impact on network congestion, as well. So it’s sort of there was an issue with the network and then users kind of would collectively all at once respond in a similar way. And that could create congestion issues and then you could get into this vicious cycle at some point with users kind of just continuously testing, but the testing itself actually causing an issue. Which was an interesting angle to kind of bring in, in terms of how human nature can also influence how the network is performing.

Martin Levy:
Yeah, it was a great observation on the part of Ben, and it makes perfect sense. Users have one tool, speed test or it’s equivalent. And they instinctively just go to that test. But what’s interesting is that if you assume that this outage was absolute, in other words, it wasn’t about congestion, it was about an actual outage, a lack of the ability to move bits between two places, this is the wrong test. And users don’t have … And by the way, network engineers are only slightly better in this case. Network engineers heavily use traceroute, ping, et cetera. But the reality is that they still sort of care about absolute throughput. And in an event like this, in fact, actually, that’s not the test you want.

Angelique Medina:
Right, right.

Martin Levy:
I really want to understand what’s my breadth of connectivity? Can I get successfully? Which we don’t know successfully to the internals of the network versus the externals of the network. And so, I’m not going to take on the task of educating end-users, that’s far too complicated.

Angelique Medina:
Right. A speed test. But to your point, I mean, speed test is a pretty … It’s sort of like bringing a hatchet when you really just need a scalpel. Just a simple ping test to some external destination probably would have done the trick versus a bandwidth check.

Archana Kesavan:
Looks like a good feature request for the speed test app.

Angelique Medina:
Yeah. I mean, to your point, maybe there’s just a … The users need to understand that they have variety of tools at their disposal. And this is kind of going back to the work that was done around our RPKI, which is this, yeah, this is this very capability or mechanism that has really been kind of the provenance of the Internet community, and not really users at large. But apparently, users, consumers can be educated.

Martin Levy:
Users can absolutely be educated. And the easiest example to use is the green lock in the top of your browser.

Angelique Medina:
Yep, yep, yep.

Martin Levy:
Foibles aside, the fact that users now understand a green lock when they’re going to their bank is an important thing to recognize. So I know that’s a simple one, but let’s hold out hope for end users. But anyway, Ben’s blog is, which I hope you add the link to …

Angelique Medina:
Yeah, absolutely.

Martin Levy:
Has some great conjecture in it and I think is a wonderful thesis on things that happen. He’s got some obvious, some real measurements in there as well. So it’s worthy of discussion.

Angelique Medina:
Absolutely. I mean, we’ve definitely seen the influence of users on the Internet just over the last few months and kind of influencing how they’re using the network, whether that’s using more video conferencing apps or the fact that they might be consuming more upstream bandwidth versus downstream. And that’s been somewhat of a difference or that’s been different over the last month or so for sure. The user has to be factored in whenever you talk about Internet performance, as well.

Angelique Medina:
Which brings us to kind of our outage roll up from last week, just taking a look at some of the outages that we picked up. The Internet’s a pretty broad place, but some of the ones that we saw last week, so there was around 282 outages, which is a reduction from the previous week. And 98 of those were in the US, so that’s overall, that’s including ISPs, public cloud providers, video conferencing app networks, as well as DNS, CDN networks.

Angelique Medina:
And then on the top right, we see the ISP portion of that, which it’s a big bucket. Most of the outages we saw were in ISP networks, so 236, which is down from the previous week of 250. And then cloud service providers, generally speaking, we don’t see a lot of issues in their networks. And we really only saw a dozen globally and just one actually in the US, and that’s pretty significantly down from the previous week where we saw about 26, which is what we typically see. We haven’t seen any unusual spikes over the recent period with public cloud providers. So overall, looking pretty good just with the exception of this little … A bad day for Virgin Media last week. Things are looking good overall. So with that, any last questions, comments?

Network Outages  Week of April 27
Figure 4: Network Outages: Week of April 27, 2020

Archana Kesavan:
No, I think I’m good. Martin, thanks again for being a part of the show.

Angelique Medina:
Yeah, it was great having you.

Archana Kesavan:
Yeah, great information there. As always, guys, leave us a review, follow us on any of those channels that you see there. We’re available in most of the podcasts out there. And again, if you’re interested in our newest T-shirt, which is working safely from home, email Internetreport@thousandeyes.com with your address and size and we’ll send that over to you. And again …

Martin Levy:
I need the v6 version of that T-shirt, by the way. Great T-shirt, but I need the v6 version. Even though I am wearing a v4 T-shirt today, but I have an exception on that one. I’m allowed to do that. I know that. Anyway, it’s been great. It’s been great, guys. Absolutely, absolutely superb. And hopefully, the Internet gets better the next week and the week after.

Angelique Medina:
Absolutely. And join us again.

Martin Levy:
All right. Thank you very much.

Angelique Medina:
Thanks everyone.

Archana Kesavan:
Bye.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog