At NANOG 68 in Dallas, TX on October 18, 2016, ThousandEyes CEO Mohit Lad presented on recent outages that had large-scale effects on Internet infrastructure and availability.
During his talk, Mohit discussed three recent, impactful events: the June 2016 DNS root server DDoS, the May 2016 Sea-Me-We-4 cable cut and the April 2016 AWS route leak.
Internet Outages Happen All the Time
Mohit explained that the ThousandEyes platform observes many outages occurring all the time, but that not every outage affects a large number of networks and users.
Using our outage detection algorithms, we see that outages affect roughly 170 interfaces and about 1,600 prefixes per hour.
Mohit then went on to focus on three outages from 2016: the DNS root DDoS, Sea-Me-We-4 submarine cable fault and AWS route leak.
DNS Root DDoS
Mohit first described the analysis of the June 2016 DDoS attack on the DNS root servers, which you can read about in our previous blog post on the attack. The event revealed an important correlation—the higher the number of anycast sites for a given root server, the smaller the impact they experienced during the DDoS attack. At the same time, the capacity at each site was also an important factor in the magnitude of impacts on a root server’s performance.
Mohit also listed several indicators that can help identify DDoS attacks based on his experience:
- Availability (DNS root query) below 90%
- Resolution time (DNS root query) more than one standard deviation higher than normal
- Multiple roots impacted (in the case of a DNS root DDoS)
- Multiple anycast POPs impacted
- Multiple upstreams impacted
Sea-Me-We-4 Submarine Cable Fault
Mohit also discussed the analysis of the Sea-Me-We-4 submarine cable cut that occurred in May 2016, which you can find in our related blog post on the outage. The cable fault had a ripple effect across the globe, from Europe and Asia to Latin America.
When monitoring your networks, look for these indicators of cable faults:
- Many path traces impacted in adjacent POPs on the same network
- Jitter can be an even more convincing and telling measure than packet loss
- Multiple networks impacted suggest a cable fault (elevated loss and jitter) , IXP failure (elevated loss on many interfaces in the same POP) or peering failure (terminal loss, path changes)
- Dropped BGP sessions may occur when problems persist
AWS Route Leak
The final event that Mohit discussed was the AWS route leak that occurred on April 22, 2016. First he showed what AWS routes look like on a normal day: the prefix 18.104.22.168/20 is advertised from Amazon.com’s AS (AS 16509), which is peering with the expected providers NTT, TI Sparkle, Telia, CenturyLink and Hurricane Electric.
As an example, traffic traveling from Portland, OR to AWS US East (the same AS above) normally transits Hurricane Electric in Chicago, which peers with AWS.
However, during the route leak, traffic from Portland instead went all the way to Zurich, Switzerland and then terminated there, and we saw the same pattern with a lot of other AWS traffic.
Looking at the routing layer, we noticed that two new, more specific /21 prefixes for Amazon were introduced and advertised from 10:10-12:10 PDT. These prefixes were advertised by Innofield (AS 200759) as belonging to a private AS (AS 65021). The prefix advertisements were then propagated through Hurricane Electric. As a result, traffic transiting Hurricane Electric began to be routed to a private AS rather than Amazon’s AS.
This route leak was particularly problematic, as the leaked prefixes were not the same as Amazon’s prefixes but more specific and thus preferred over Amazon’s legitimate prefixes. However, the impacts were not widespread because most ISPs did not accept the bogus routes.
A post mortem from the Innofield team later indicated that the event was an accidental route leak, likely caused by a misconfigured route optimizer. This event is similar in nature to a previous July 2015 incident where Enzu leaked dozens of prefixes in Los Angeles.
Some indicators of route leaks to look out for include:
- New prefix or new destination ASN
- Major BGP route changes (significant change in new path)
- Origin or next-hop ASNs that may be in geographic locations far from the expected destination
- High packet loss at one of the ASNs in the path or ASNs with a common next-hop ASN
For more details from Mohit’s talk, check out the video of the full presentation below.
Our CTO Ricardo Oliveira also presented on a related topic at RIPE 73 in Madrid, Spain on October 25, 2016. He also discussed a number of recent large-scale Internet outages and how they were detected by our new Internet Outage Detection feature. If you’re interested in Ricardo’s talk, watch the video of his presentation or take a look at his slides.