Watch on YouTube – The Internet Report – Ep. 21: Aug 24 – Aug 30, 2020

This is the Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. It was a rough week on the Internet last week, with outages and incidents across multiple services and providers including Slack, Zoom, AWS, and Verizon. However, in today’s episode we’re going to focus exclusively on Sunday’s CenturyLink / Level 3 outage that according to Cloudflare, caused a significant 3.5% drop in global Internet traffic, making it one of the most significant internet outages ever recorded. Don’t forget to follow along using the links below and you can learn more in our accompanying blog post, CenturyLink / Level 3 Outage Analysis.

Show Links

  • Follow along and explore this week’s outage analysis within ThousandEyes — no login or subscription required!

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport. Want to ask Angelique or Archana a question? Find them on Twitter at @Bitprints and @Archana_K7.

Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 21: Aug 23 – Aug 30, 2020
ThousandEyes T-shirt Offer

Follow Along with the Transcript

Angelique Medina:
This is The Internet Report, where we uncover what’s working and what’s breaking on the internet and why. We had a big week last week. There were multiple outages. We had Slack, we had Zoom … I think we even had AWS … EUS, too, had some issues. I think Verizon had some issues. But, we’re not going to talk about any of that. Today we’re really just going to focus on what everyone’s talking about, it seems, which is this really significant global network outage that happened in CenturyLink’s network, and that included Level 3, which is a major transit provider and is operated and owned by CenturyLink. So, that’s what we’re going to go deep on today.

Angelique Medina:
Before we do that and go under the hood, let’s maybe just take a look at the very basic root cause analysis that CenturyLink has put out around this incident. Hopefully, more data or information will emerge, but what they put out was interesting and also aligns with what we saw from the standpoint of our vantage points.

Archana Kesavan:
Right, right. And, just to quantify this outage a little bit … I mean, obviously, it had a massive impact because, of course, CenturyLink and Level 3, they are a really big content provider. But, also how long this outage lasted. It started around 6:00 a.m. Eastern and almost lasted for about five hours as is seen from their RCA statement as well.

Angelique Medina:
Which is pretty stunning in terms of a length of an outage. We’ll talk a little bit about why it took so long for them to remediate this issues. Just in terms, again, on this scope, I think that some folks had thrown out something like three and a half percent of internet traffic was impacted … something along those lines. I mean, just to give a sense, again, of the scope of the damage … I mean, they’re tier one, and they appear with very large providers. Also, they’re used by pretty much every service that you can imagine, from Google to Cloudflare. Again, these are highly trafficked destinations. Lots of enterprises use them. And then, even we do, which we’ll talk about a little bit as well.

Archana Kesavan:
Again, from the perspective of them … A lot of services use them as their direct upstream provider. But, they’re also such a big content provider. Like, access to services that might not even directly be doing business with CenturyLink, or they engage with CenturyLink, were also inactive, right?

Angelique Medina:
Absolutely. Didn’t need to have been a customer or a direct peer of them in order to have been impacted. In fact, there were many folks who were impacted even if … It could’ve been multiple degrees away, and-

Archana Kesavan:
Right. Right. And, I think the term impact also depends on from where your users are, where they’re coming from and what segment of those users were affected because a major content provider had an outage. So talking a little bit about the RCA that CenturyLink had put out … And, hopefully, over the next few days we get a more detailed output from them. But, as of now, the root cause of this issue, as I stated, as you can see here, is an offending Flowspec announcement that prevented BGP from establishing correctly.

Angelique Medina:
Yeah. That’s right. And then, just for context, Flowspec is basically an extension to BGP, and it operates similarly to ACL’s from the standpoint of it being a set of rules, like firewall rules. But, the difference between them and Flowspec is that Flowspec is dynamic, which is one of the reasons why it could be very powerful, but also very dangerous, in terms of its implication if something goes wrong.

Archana Kesavan:
Right. We were talking to somebody from our ops team earlier today, and we were discussing how dangerous this extension could be just because of what it triggered. And, he was passionately supporting what a great extension Flowspec is. And, to understand that, it’s important to, I guess, think through why Flowspec came into being. Historically, Flowspec came in to be able to push updates quickly into multiple routers that you might own, and has been associated to do these changes when there was some kind of DDoS attack or something of security.

Angelique Medina:
So, right now, fast forward 10-15 years, we have all sorts of automation, orchestration in place. So, it really questions if something like Flowspec is going to be useful or not. That’s not the main topic of this conversation. But, it turns out that this protocol that was kind of formed and built years ago, to be able to push these rules broadly across routers, resulted in this really catastrophic event.

Archana Kesavan:
And, Angelique, one of the things that you mentioned is that it works as an ACL, right? I think, while the charm is that it works as an ACL and it can filter out, the fact that that update is so dynamic and then it triggers as you get … And, it’s associated with BGP, so if you get a BGP announcement, say, for instance, with the Flowspec in there, that can trigger the ACL, and we’ll see what that can do because that’s what happened in this incident.

Angelique Medina:
Right. They basically said that this Flowspec basically shut down BGP. So, it would’ve terminated existing BGP sessions, and the implication of that, as others have pointed out, is that once that happens and BGP is shut down, when the routers attempt to reestablish connections, they will have been sending out a lot of BGP updates. And, we’ll take a look at what that meant in terms of users of their service.

Archana Kesavan:
One of the things … I think this was Cloudflare that called out as written through route views as well, around the beginning of the outage at 10:00 UTC, 6:00 a.m. Eastern, the size of BGP updates that were seen from Level 3 was kind of … I don’t know how big they were, but they were pretty big than what a regular BGP announcement would look like. So, something to do with Flowspec and the number of updates that were going on across all their peers. I think it kind of-

Angelique Medina:
Well, so … Yes. The sessions were terminated because of Flowspec, but in trying to reestablish them, they’re basically redoing the announcements. And then, once their peers receive these announcements, they’re propagating them across their peers and so on and so forth. And so, it has this cascading impact. It’s not just Level 3’s updates. It’s all of the consequent or subsequent updates that need to be made.

Angelique Medina:
So, it really has this cascading impact, and it was like a 10X something, the number of announcements than what is typically seen, but was interesting was that it kind of had a tail-off effect, which really did align with some of the de-peering that was done as the outage just continued. So, after two, three, four hours, some of their major peers at that point just shut down their peering with them. And so, they were no longer propagating out routes, which was probably one of the reasons it went down, even during the incident.

Archana Kesavan:
It did give them some time to stabilize because they were not catering and trying to honor all those updates that was coming in, which was jamming CenturyLink’s control plane and network. All right. With that, I think we can just jump right into what ThousandEyes saw, in terms of how this outage unfolded. So, let me pull that up.

Angelique Medina:
So, this is the aggregate view across just the huge number of tests that are run from our vantage points. And so, precisely at the start of the outage … this was 10:00 a.m. UTC … we see a really significant increase in terminal interfaces. And so, this is basically where Level 3 is just completely dropping traffic at these particular points. What’s interesting is that even though this had a global impact, a lot of what we’re seeing seemed to initially start on the West Coast. 10:00 a.m. UTC is 3:00 a.m. Pacific Time-

Archana Kesavan:
Pacific. Yep.

Angelique Medina:
… and 6:00 a.m Eastern time. And, again, this is Sunday, so it almost seems like this maybe potentially … And, we don’t know yet. We’re still waiting on details, but could this have been part of some maintenance event, where they’re making changes to their network? Could it have been because they were responding to a DDoS attack, which is one of the ways Flowspec has been used-

Archana Kesavan:
Flowspec. Exactly.

Angelique Medina:
… We just don’t know. But, the timing does align with when you might want to make changes to your network such that they didn’t impact users. It would be interesting to see if we get more details on this, if that is in fact the case, if they tried to make some benign updates and then just … it didn’t go as they had planned. Yeah.

Archana Kesavan:
Right. No, totally. Because if we actually look into another snapshot in …

Archana Kesavan:
… Into another snapshot in here. As it’s loading up, one of the triggers or the initial first set of signatures that we kind of saw in this with retrospect to this outage and what you’re looking at here is BGP path changes, to be very specific, to actually our own application’s prefix here. Which again, like a lot of services that were impacted, CloudFlare, we saw some Google services get impacted, we saw a variety of consumer services, like OpenTable or GoToMeeting and we could just keep naming names. It basically shut down half the internet yesterday. We were also a victim of that particular outage because we do rely on CenturyLink as you know, one of our tier one ISPs.

Angelique Medina:
Yeah. Yeah.

Archana Kesavan:
Right. So going to that exact timeframe, just expanding this a little bit in here, around again, 10 o’clock UTC, we stopped seeing-

Angelique Medina:
Because, well, this is the first, this is the beginnings of it. So this was clearly a control plane issue and from an external standpoint, this is the first manifestation of it, which is it kind of looks like they were suddenly… There was a lot of route flopping and they were sending out announcements, they were revoking announcements and so it was going a little bit haywire from a kind of a BGP route perspective.

Archana Kesavan:
Right. Right. And the RCA would just flows back kind of hindering BGP sessions and then or bringing down BGP sessions and having those peers come back up and trying to reestablish is probably what we’re seeing here with a lot of route withdrawals or reestablishing peering with other providers.

Angelique Medina:
And announcements going out. Yeah.

Archana Kesavan:
If we go across in here from a timeline perspective, just to see what these different updates look like, and kind of also give you a flavor of what can enterprises do to be able to work around or bypass something like this, right? This is what we’re going to see next. Let me see if we’re ringing out the right timeline. So actually, this is interesting, right? 15 minutes after 10 o’clock, we see that there’s some kind of stability with respect to level three and the peering, the routes that it’s established with its peers.

Angelique Medina:
And yet, during this same period of time, they’re dropping traffic. So they’re announcing routes and then they’re black holing traffic and this is kind of the behavior that was across the board, across all of their customers and any peers that were impacted. So the announcements were one thing, but even once the announcement said, “Stabilized,” it wasn’t good because they were dropping traffic they were announcing. And so we’ll talk a little bit about kind of some peering changes that were made and what was so kind of really, really unusual about this incident is that there was really nothing that their customers could do to route around the issue because of the nature of this outage and the fact that level three was effectively advertising zombie routes. They were like-

Archana Kesavan:
It would make me, I feel, unable to honor any updates that were coming in from their peers or from enterprises. They were essentially saying or trying to withdraw routes, but level three is control plane, CenturyLink probably was so jammed because of what was going on, that they were really unable to do anything there.

Angelique Medina:
Yeah.

Archana Kesavan:
So they continued to kind of advertise the older routes, which didn’t really serve the purpose of trying to bypass or go around.

Angelique Medina:
Right, right. Anytime you have an outage like this, you’re thinking, “Okay, I can maybe change my advertisements and that will fix it.” But to your point, their control plane was broken and so there was a lot of folks who were revoking advertisements to level three and level three was, call it what you will, hijacking, whatever, they were continuing to advertise those routes independently, which we’ll see here.

Archana Kesavan:
There. Angelique, just to call attention to it, to the timeline here, right? The first time triggers that we saw was in around 10 o’clock in line with the RCA and when everybody has seen.

Angelique Medina:
So the start of the incident, yeah.

Archana Kesavan:
Start of the incident. And the second peak that we see here is around 10:45. Right? And there and to your point, there are a couple of things that are happening here right now.

Angelique Medina:
Yeah. So yeah, so some of the path changes here are influenced by level three, just kind of doing their own thing on their own and some of them are influenced by ourselves, in this case. So here, we basically have, at this point in time, have revoked announcement of our prefixes to level three. And at the same time, have established a peering relationship and have started advertising through Cogent as a secondary provider here.

Archana Kesavan:
Right. And even before the outage, what this looked like is we’re having two upstreams, one at the [inaudible 00:17:12] and the other one is level three. And then because of level three issues, we kind of get dropped from level three and then start peering with Cogent.

Angelique Medina:
Yeah. And two important things were done at this time, which I think it is important to point out and is one of the takeaways here, which is there were certain things in this outage that were completely uncontrollable from the standpoint of their customers and enterprises. The fact that their routes were still being hijacked and they weren’t valid is one thing, but when you consider asymmetric routing and if you have… Even if you can’t control the advertisement of your prefixes, you can control who you’re sending traffic to. And if you know one of those providers is severely incapacitated, you don’t want to be sending them traffic because it’s just going to get dropped. So in this particular instance with this update, one, the prefixes were revoked to level three, announcing our own prefix, but also it we’re not accepting inbound traffic from CenturyLink. And that’s really important because then that prevented us from… We’re basically not sending them any more traffic and so if something comes through from, say Zayo or from Cogent, it will go out that way as well.

Archana Kesavan:
Right. Right. And the way it manifested, and sometimes what you see from a service point of view, is it might seem like, say for instance, another provider is dropping because you’re kind of seeing the path only one way, but the traffic actually being dropped at CenturyLink in the reverse route.

Angelique Medina:
Right.

Archana Kesavan:
So I think that was kind of good.

Angelique Medina:
Yeah. Just because it’s coming in under Zayo and, for example, and you’re expecting them to work, doesn’t mean that it’s necessarily going out through the same provider. And so you have to understand that you can, in this instance, you could have controlled the outbound traffic. That’s always in your control. Inbound, you can influence, but you can’t do 100% control, but outbound you have complete control over that and so that’s something to keep in mind.

Archana Kesavan:
Yeah. I think the other interesting thing here, as we were walking through this earlier today is to kind of get another validation that level three was not ceasing to send updates, right? Because here we’re seeing 23 routes, say for instance, this example, we are established with Cogent and this is a completely new peering relationship that we triggered because of level three’s outage. However, those same 23 routes, we see them through level threes at work as well, which if the withdrawals had gone as expected, we shouldn’t have seen this. We should have been, right?

Angelique Medina:
We should have seen the same number of kind of paths or same number of routes before and after, but that’s not the case here. So during these periods, in which they were doing their reannouncements, they were just basically kind of the number of routes would go up.

Archana Kesavan:
Right.

Angelique Medina:
And then it would kind of return to lower levels throughout this incident, so.

Archana Kesavan:
Yeah. So this was one type of looping situation we saw like level three, do this, route withdrawal, establishment at 10 o’clock, then around 10:45, you see this. At 10:45, you also see a new connection established through Cogent from our end. Once we just walk through this timeline a little bit, we kind of see some semblance of stability here, right? We see again, level three back to peering with the other ISPs, so it almost feels like there’s a pattern here in terms of there are some instability and then in the next interval, level three seems like it’s established peering, then again, the other interval, it withdraws its BGP routes and then-

Angelique Medina:
It’s going through these cycles where it’s just announcing the routes and then it will go where it will revoke the announcements and then reestablish them and it’s going in these cycles. Now, this is just from a control plane standpoint or a routing standpoint, effectively through the duration of the outage, they were just dropping traffic. So.

Angelique Medina:
… through the duration of the outage. They were just dropping traffic. So it was independent of any of the announcements and even the route flapping. So it wasn’t like the route flapping was causing the traffic loss, even when they were not route flapping. They were still dropping traffic.

Archana Kesavan:
Right. Right. One of the interesting updates as the outage was happening was this Twitter input that we saw from Telia, one of Level 3’s peers, that almost four hours into the outage, around 10:00 AM Eastern, announced that they were asked to de-peer from CenturyLink. And when the outage cleared itself, which was around 11:00 AM Eastern, so that’s five hours into the outage, Telia sent out another message saying that they’ve been asked to prepare because the outage has basically … it’s done and things are stable now. We saw that here from another provider entity, right?

Angelique Medina:
Yeah. [crosstalk 00:23:11] did it earlier too. I mean, this was three hours into the incident. It’s sort of interesting that Telia said that they had to be asked to de-peer, where one has to wonder, is that kind of the right order? Should you need to wait to be asked or should there have been some proactive de-peering in order to prevent some impact to their customers?So in the case of NTT here, we can see that about three hours into the incident, that they effectively rather than peering with Level 3, they’ve begun to peer with Cogent.

Angelique Medina:
Yeah. So that may in fact have addressed some issues for their customers, anyway. The other thing that’s kind of interesting from the BGP standpoint is sort of a question around, okay, so we mentioned earlier is that what CenturyLink was effectively doing by advertising routes that were revoked by their customers, like we had done here, you could almost kind of think of that as route hijacking. And now they didn’t hijack routes, but effectively, that’s what was happening.

Angelique Medina:
In a route hijacking scenario, one of the tools at your disposal in order to kind of mitigate the attacker, whether … or maybe attacker or even just benign configuration mistake that might be causing this is to advertise a more specific prefix and that would then be preferred as kind of the shorter path to your service and then you could basically steer traffic around the issue. That would require a few different things. You have to have that option available to you. Like, if you only have a slash 24, that’s not going to be necessarily a good idea because splitting that into slash 25s might not be the best. There’s maybe some debate on that. But for example, if you’re CloudFlare and you advertise a slash 20, if you start advertising a slash 21, that could potentially address the issue, although we didn’t see them make that change. And one has to wonder maybe because they’re an anycast service, if it was determined that doing that would have had unforeseen negative consequences in terms of how traffic was getting routed and then how much traffic was hitting each of their individual locations and resources.

Archana Kesavan:
Right. Right.

Angelique Medina:
So a lot to consider.

Archana Kesavan:
Right. And also, there was this other question on who would … If they had just shut the interface to CenturyLink, like, physically terminated the interface, which was also done, would there have been any impact? And no, not in this particular case because it was really beyond the enterprise’s control. There is no way you could twist CenturyLink’s hand in this particular case to stop advertising the route, right? Setting any type of BGP strings or trying to influence that-

Angelique Medina:
You mean a no-export broken, they weren’t even kind of really in any new kind of announcement changes. But that’s really just from a kind of an external announcement standpoint. Again, you can control who you’re sending traffic to.

Archana Kesavan:
Right. And we did see some cases where a few services were able to recover fast enough, but a few others did not. For instance, I think this was GoToMeeting. I’m trying to … Yeah, I think this was GoToMeeting, where we were able to-

Angelique Medina:
They fared better, it seemed, during the incident than, for example, for comparative purposes, OpenTable. And there could be a few different reasons for that. We can talk a little bit about what happened with them and the fact that without them doing anything really different, they were able to avoid some of the high level of damage. So let’s just maybe look at OpenTable first and see how they experienced it. And then we can go to GoToMeeting.

Archana Kesavan:
Yeah. That makes sense. So what we’re looking at here is … I’m trying to access opentable.com from a few global locations here. And again, right at 10:00 AM UTC, we start seeing issues. Again, this is Level 3 trying significant packet drops in here, and this is across the board, right? All locations that it’s going through Level 3 is impacted. As you can see from the timeline here, this lasted pretty much till 14:25, till actually when the service came back up.

Angelique Medina:
They basically were impacted for the duration of the incident.

Archana Kesavan:
Right. And the way they recovered is truly just completely bypass Level 3. And also, this is the point in time where any kind of route advertisements or BGP related pushes that the enterprise has done, were actually being-

Angelique Medina:
Honored.

Archana Kesavan:
… honored. So it makes sense…

Angelique Medina:
So they took some remediation action when they should have, but it’s just that because of how this outage unfolded and that CenturyLink wasn’t honoring those, they didn’t actually get implemented until the end of this incident.

Archana Kesavan:
Right. Right.

Angelique Medina:
Well, so this is now GoToMeeting and again, this is an enterprise that in many ways, like OpenTable, did all the right things that they should have done, but they had a very different experience of the outage than OpenTable.

Archana Kesavan:
Than OpenTable. Right. Right. And just from a service view, if you look at this here and we’re looking at packet loss, interim packet loss in the network, we do start seeing the impact of the outage here, right around 10:00. So around 10:04, we’re starting to see packet drops across basically all networks. And the reason you’re seeing this across all networks is because the next top or the upstream from-

Angelique Medina:
The interface, the connected, next interface is Level 3.

Archana Kesavan:
It’s the peering. So it’s kind of the penultimate note where you’re kind of seeing the drops, and the reason is because GoToMeeting, that Citrix is a product, basically has just one upstream provider, which is Level 3. And so when Level 3 goes down everybody else-

Angelique Medina:
One active. One active service provider, which is interesting because we usually see … The typical architecture is that you have two service providers and you use them in an active-active mode, where you’re load balancing across them. And then if there’s an issue, then it’s just the other one will just either carry the load or even, in some instances, we see you might have an active-active config, and then one passive, even, as a back. You could have kind of a variety of things. I don’t see-

Archana Kesavan:
That’s kind of what we saw in ours

Angelique Medina:
Right.

Archana Kesavan:
We had active-active in and Level 3 was kind of the backup.

Angelique Medina:
Yeah, we don’t see active-passive too much, but that’s what we see here.

Archana Kesavan:
Right. So was this right here? Yeah. I think this is what? Fifteen minutes, around 15 minutes into the outage, we start seeing Citrix establish the active peering with GTT, kind of completely withdrawing from Level 3. And the impact of that is kind of seen … If we look at it here, this packet drops really just lasted for what? Twenty minutes, and the recovery to the service is back up. Obviously, if you are connecting from CenturyLink itself, you’re still going to see some kind of issues because your vantage point is within CenturyLink’s network. And, you’re kind of going to see drops there. That’s inevitable, but even these were able to course correct. And we started seeing availability to the service come back up pretty quickly here.

Angelique Medina:
Right. And that immediate spike after all of those announcement changes, whether in now announcing through GTT and then maybe Level 3 making additional whatever zombie announcements, that would have been almost kind of like a little flapping because the routes are changing. And so that could have also contributed to that kind of immediate spike in packet loss as routes were converging. But what’s kind of interesting, if you go back to the BGP route visualization, you can-

Angelique Medina:
Back to the BGP route visualization, you can see that… So even after they’ve kind of made this change, level three never goes away, right? Because they’re still advertising, right? Even though the intent with bringing GTT online was to just use them as a service, level three is like, “Nope, we’re not going away.” But yet we see that there’s not a lot of packet loss and level three is not really kind of-

Archana Kesavan:
Main decider.

Angelique Medina:
… Kind of in as many paths as they were prior to GTT’s introduction. And what’s really interesting about that is that the fact that GTT is so densely peered may have been the reason why they were just simply preferred as a path to Citrix, even though level three was still in the mix. They were so advertising, but GTT pops in, they have more peers, so then maybe providers at that point may have just simply been preferring GTT and there were just more paths available through that provider. And that’s more of the luck of the draw, right? Because with OpenTable, they’re level three, they were using level three and they were using Zayo, right? And Zayo was kind of considered a lower tier provider then level three. And so, because level three was still advertising, a lot of folks may still have been sending them… May have been preferring that path.

Archana Kesavan:
Right. It really just depends who your backup is and how densely peered they are and how well used they are as well. And again, to your point, if this had been, say Zayo or some other smaller ISP in here, we might have not seen this level of recovery, right?

Angelique Medina:
: Exactly. Exactly. And that’s almost like just luck because it could be, you may think, okay, well I need a higher tier backup. Well, I mean, it could’ve gone either way. You could have had GTT go down and they were the problem and then you were getting all of your traffic sent to GTT and there was the issue.

Archana Kesavan:
Right.

Angelique Medina:
So that’s just how internet routing wild works.

Archana Kesavan:
So yeah. So a lot of things that happened in the course of five hours and I can’t even imagine what kind of day it was for the Level 3 Ops team.

Angelique Medina:
Those are customers. And their customers.

Archana Kesavan:
Yeah. Yeah, definitely the customers. And then, yeah, I think we were talking about this earlier, the only silver lining truly is that it happened on a Sunday and if it had happened during a weekday kind of the impact we would have seen would have been really, really catastrophic, more catastrophic than what it actually turned out to be.

Angelique Medina:
Absolutely. So, okay. So lessons learned. What can-

Archana Kesavan:
Yeah. We have some interesting, I mean, we’ve discussed it, but some good takeaways and for enterprises or providers. Yeah, you want to kick that off?

Angelique Medina:
Yeah. I mean, I think one thing that’s interesting that we kind of went over at the very top of the show is the root cause analysis or the statement that was put out by CenturyLink is pretty basic and in fact, during the course of the incident, I think it wasn’t until, I mean, we’re talking hours into the incident in which there was any kind of public statement made by CenturyLink. And so if you’re one of their customers, you don’t know what’s going on and you’re not getting any kind of quick communication. So it’s important to have kind of just more transparency perhaps from service providers like we see a lot from cloud providers and from providers like CloudFlare. Being more open with customers as to what’s going on and being quick to communicate, I think is something that perhaps enterprises can push more with their service provider. One way to do that, of course, is to keep them honest with just independent information and visibility into how they’re working.

Archana Kesavan:
And I think that one of the things that we were talking about earlier was also how the first line of support when it comes to these service providers, like when you’re opening a ticket, for instance, and the first line of support is not necessarily aware of the exact details of what’s happening in the NOC. For instance, they’re aware of a problem and they’re going to tell you, “Yeah, we are aware of a problem. We are working on it.” But there’s really no more detail that-

Angelique Medina:
They tell you scope if there’s a really specific thing that you could be doing, it’s just not going to come.

Archana Kesavan:
Right. Right. Yeah. So in terms of transparency and keeping providers… And in a lot of cases when you’re dealing with ISPs and when you’re dealing with service providers and appraisers have to go through this need to prove innocence or need to kind of establish that, hey, there is something going on, right. I mean, we are not trying to push the ball on you and blame you, but there is seriously something that’s going on. So I think the genuine need for being aware of this is going to help both. I take a look at this as everybody coming to the table, trying to solve the issue together rather than placing blame or pushing it on somebody’s problem state and making it somebody else’s problem state when, right? It’s really about how much can we collaborate to kind of help you understand scope and also resolve the issue.

Angelique Medina:
Yeah, absolutely. And then I think another takeaway is really understanding that at the end of the day, internet routing is somewhat uncontrolled and unpredictable and also contextual. It depends on where your users are sitting, which path they’re going to take your service, you can’t always control that, you can’t control who peers with who and how that changes at any given time. And so just be aware of that and understand somewhat, at least a few degrees away from you what your kind of dependencies are. And also from that standpoint, from an outbound standpoint and in terms of advertising routes to your service, that’s something you can influence, but you can’t 100% control, but you can control where you send traffic to.

Angelique Medina:
So making some of those changes locally could have helped to mitigate some of the impact of this. And then the other thing is also kind of just understanding outages in context is really important because as we mentioned earlier, this timing is everything. If this had happened during the weekday, I think that the impact on services and the financial impact for a lot of enterprises would have been far greater. So thinking about how you evaluate your service providers from that standpoint is also something to keep in mind.

Archana Kesavan:
Yeah, totally. It’s definitely been learning for us. I mean, as we’ve been painted in appealing through these outages over the last few years, so these really large outages actually do surprises in terms of what else could go wrong, right?

Angelique Medina:
Right.

Archana Kesavan:
I mean, we’re like, “Yeah, I think we understand everything that could go wrong with BGP updates, but not really. In this particular case, flowspec as a teacher, as BGP extension is actually a really useful and valid protocol that a lot of service providers still use, but it’s kind of really unfortunate how it kind of manifested in this case, be that because of misconfiguration, whatever the reason might be and how it actually came to happen, we don’t still know, but it’s yet another new thing that could go wrong with BGP.

Angelique Medina:
Yes. So really interesting and we’ll kind of get into the nitty gritty of this in a little bit more depth than even the show on our blog, so be sure to check that out and if you’re sort of interested in how it might’ve impacted your services because again, it manifested differently for everybody. Let us know, drop us a note, we’d be happy to chat about what you saw, what you experienced. If you have any questions, I’m happy to take a look, so.

Archana Kesavan:
Yeah. And if you were somebody in the Ops team who was struggling to kind of make this all go away and if there was something different that you tried that worked for you or didn’t work for you, it doesn’t matter, let us know, it’d be great to kind of even have you on the show and kind of educate the community. We’re all in this together. So that would be-

Angelique Medina:
Absolutely. Absolutely. So with that, we’re going to close out the show. As always, don’t forget to subscribe and if you do subscribe, we will send you a cool t-shirt, working from home shirt. Be sure to send your… If you do subscribe, drop an email to theinternetreport@thousandeyes.com, give us your address and your t-shirt size, and we’ll get that right over for you. And until next time-

Archana Kesavan:
And have a great week.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog