Identifying Root Cause with Routing Outage Detection

Posted by on July 26th, 2016
July 26th, 2016

In the second installment of our exploration of the new Internet Outage Detection features, we’ll delve into the dark art that is BGP routing. When routes go awry, prefixes often become unreachable, and impacts can reverberate throughout the network stack to cause issues in the network and application layers. When large-scale outages like this happen, it’s often difficult to correlate the network and routing layers and pinpoint the root cause of routing problems.

Routing Outage Detection detects routing outages based on prefix reachability and provides insights on both the scope of the outage and its likely root cause. In this post, we’ll discuss how this new feature works and explore some complex routing events that had impacts on major services. If you’d like to learn more about detecting network-layer outages with Traffic Outage Detection, see our companion post.

The Algorithms Behind Routing Outage Detection

Routing Outage Detection continuously analyzes the entire ThousandEyes data set of BGP routing tables collected from over 300 public route monitors. A routing outage is detected when at least 30 prefixes in the same country see reachability issues in the same time period, where reachability is the percentage of time that a given prefix can be reached via known routes.

At this time, an ‘Outage Detected’ dropdown will appear with additional information about the scope of the outage, both in the aggregate and as it relates to your organization, as well as the network(s) most likely to contain the root cause. Let’s explore a few recent routing outages to see what Routing Outage Detection can do.

Unexpected Changes in Telx’s Upstream ISPs

Our first example is from a test to an IP address located in the network of Telx, a U.S. data center provider. On July 15, 2016, Telx’s /21 prefix saw a dip in reachability and a large spike in average AS path changes to 1.6, indicating that each route monitor observed 1.6 route changes on average. We see time periods when routing outages are detected shown in purple, in the same way as with traffic outages — feel free to explore the data at this share link.

Figure-1
Figure 1: On July 15, 2016, Telx’s /21 prefix saw a large spike in average AS path changes to 1.6.

The ‘Outage Detected’ dropdown presents a wealth of information about the global scope of the outage and its likely root cause. Below, we see that the outage is located in the U.S. and has affected 171 prefixes, a considerable number, across the routing data set.

The root cause analysis shows the networks most likely to have caused the routing issues. In this case, Hurricane Electric and Telx are singled out with the highest % Routes, where % Routes is defined as the percentage of affected routes traversing a given network. Because Hurricane Electric has the highest % Routes of 38%, it is inferred to be the AS most associated with the affected routes and thus most likely to contain the root cause.

Figure-2
Figure 2: % Routes is defined as the percentage of affected routes traversing a given network.
Because Hurricane Electric has the highest % Routes, it is inferred to be the AS most likely to contain the root cause.

The BGP Route Visualization confirms the hypothesis of the root cause analysis. If we isolate only the route monitors seeing path changes during the outage, a clear pattern emerges. Before the outage occurred, this set of monitors traversed Hurricane Electric to reach the origin AS (AS 36086).

Figure-3
Figure 3: Before the outage occurred, this set of monitors traversed Hurricane Electric to reach the origin AS (AS 36086).

During the wave of route changes, the same route monitors observe a route flap: NTT appears as another upstream ISP for Telx, but routes to NTT are quickly withdrawn during the same time period, as shown by the dotted red lines to AS 2914. Those routes are then immediately taken back by the original upstream ISP, Hurricane Electric, shown by solid red lines to AS 6939.

Try the interactive data below.
Figure 4: The route monitors observe a route flap: NTT appears as another upstream ISP for Telx, but routes to NTT
are quickly withdrawn and reclaimed by Hurricane Electric during the same time period.

It’s quite possible that Hurricane Electric originated the route flap by inadvertently withdrawing and advertising the same routes within the same time period, causing route instability. It’s also possible that Telx, the origin AS, misconfigured or changed their route preferences and made a mistake in their routes to upstream ISPs. With this range of possibilities in mind, the changes that occurred in the BGP Route Visualization indeed corroborate the calculations of the Routing Outage Detection algorithms.

So why should we care? Though the causes of routing and prefix reachability issues are often complex and difficult to analyze, their impacts are obvious, often impacting traffic on the network layer and causing service availability problems. As a result of the Hurricane Electric route flap, Telx saw a significant spike in packet loss to 12.5%, triggering the detection of a traffic outage in Hurricane Electric as well, primarily in New York, NY. This indicates that the route flap set off layer 3 packet loss that likely stemmed from convergence issues, where traffic entered Hurricane Electric right before routes for Telx through Hurricane Electric were withdrawn, causing those packets to be dropped as they no longer had routes to get to Telx.

Figure-5
Figure 5: As a result of the Hurricane Electric route flap, Telx saw a significant spike in packet loss to 12.5%, triggering the detection of
a traffic outage in Hurricane Electric, primarily in New York, NY.

Solving the Mystery of JIRA’s Massive Outage

As promised, we’ll dive into the routing data from the July 10, 2016 outage that completely brought down JIRA, a major SaaS service, for about an hour. As we saw in our companion post on Traffic Outage Detection, JIRA experienced a traffic outage in NTT America, but as we dig into the routing layer you’ll see that the blame is not so clear-cut.

We last left off our analysis of JIRA’s outage scratching our heads at the traffic path changes observed: before and after the outage, traffic transits both Level 3 and NTT, but during the outage, traffic transits only NTT, ultimately terminating within NTT’s network. Let’s investigate the data from the routing layer to make sense of the issues — feel free to follow along at this share link.

Under normal circumstances before the outage began, routes to JIRA traversed one of two upstream ISPs: NTT and Level 3. This confirms what we saw in the network layer traffic paths taken before the outage.

Figure-6
Figure 6: Before the outage began, routes to JIRA traversed two upstream ISPs: NTT and Level 3.

But at the same time network layer traffic is terminating in NTT, a routing outage is also detected based on prefix reachability issues, with average reachability dipping to 1.9% and sudden increases in AS path changes indicating significant route instability.

Figure-7
Figure 7: A routing outage is also detected, with average reachability dipping to 1.9% and sudden increases
in AS path changes indicating significant route instability.

At the start of the routing outage, the total number of affected prefixes across the ThousandEyes user base is 35, which is relatively low and corroborates the narrow scope of the traffic outage. However, the root cause analysis places the blame on Level 3, whereas Traffic Outage Detection sees issues in NTT.

Figure-8
Figure 8: The root cause analysis of Routing Outage Detection places the blame on
Level 3, whereas Traffic Outage Detection sees issues in NTT.

So which algorithm is right? At the beginning of the routing outage, the BGP Route Visualization shows every single route to the /24 prefix for JIRA being withdrawn (dotted red lines), causing the prefix to be unreachable for all route monitors during the outage. Level 3 is called out as the most likely culprit because, as the primary upstream ISP, it is associated with the most withdrawn routes and likely played a part in the route instability along with the secondary upstream ISP, NTT.

Figure-9
Figure 9: Every route to the /24 prefix for JIRA is withdrawn (dotted red lines), causing the prefix
to be unreachable for all route monitors during the outage.

With the /24 prefix unreachable for about an hour, routers reverted to forwarding traffic to the less specific /16 prefix for JIRA. Unfortunately, the /16 was configured to direct routes to NTT’s AS (AS 2914) rather than the correct origin AS (AS 133530).

Figure-10
Figure 10: Routers began forwarding traffic to a misconfigured /16 prefix that directed routes to NTT’s AS (AS 2914)
rather than the correct origin AS (AS 133530).

These routing changes are reflected in the Path Visualization, where we see traffic paths switching from transiting Level 3 and NTT (using the /24) to transiting only NTT (using the /16) and ultimately terminating in NTT because the /16 is misconfigured, leading not to JIRA but to NTT where the traffic cannot find its destination IP. JIRA’s issues are twofold: not only is there severe route instability for the /24 prefix, likely due to mistakes on the part of Level 3 and NTT, but the backup /16 prefix was also misconfigured, likely an oversight by the operators of JIRA’s network.

JIRA’s outage was ultimately resolved when NTT and Level 3 begin advertising routes to the /24 prefix again, returning traffic and AS paths to their original state.

As it turns out, both Traffic and Routing Outage Detection were correct, but represent two different perspectives. Traffic Outage Detection showed actual failures in the network infrastructure, while Routing Outage Detection inferred the likely root cause of issues on the control plane, and of the outage as a whole.

You can run all of the above analyses with a free ThousandEyes account — sign up today to put Internet Outage Detection to the test.

Processing...