Watch on YouTube – The Internet Report – Ep. 25: Sep 21 – Oct 4, 2020

This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On today’s episode, we take a look at the recent Azure AD disruption that significantly impacted access to Microsoft cloud services and apps (as well as third-party apps) for nearly three hours. We then went under the hood on a recent BGP hijacking in which Telstra began announcing routes to services that didn’t belong to it, such as Quad9. Catch this episode to hear our take on these incidents, and see below for show links, some additional commentary on these outages, and a sneak preview of next week’s episode.

Show links:

A Lesson in Application Complexity

On September 28th, at approximately 2:25PM PT (9:25PM UTC), Microsoft experienced a global service incident that impacted the reachability of nearly all of its applications and services—as well as third-party apps and services that use Azure Active Directory (AAD). Users in North America and Australia were most impacted, with only a 17% success rate in North America and 37% success rate in Australia during the incident, according to Microsoft. Users in Asia, Europe, and those who were already authenticated at the time of the incident were less likely to have experienced issues. The disruption resolved for most of its users by 5:23PM PT (00:23 UTC, Sept. 29).

ThousandEyes observed the incident from vantage points around the globe, confirming not only that Microsoft’s frontend web servers were reachable and unimpeded by network-related issues, but also that status codes and error messages received from Microsoft’s servers indicated internal issues within its Azure AD service — a service that Microsoft later identified to be the source of the disruption.

Microsoft Service Unavailable 503 Error BGP Hijack
Figure 1. Service unavailable errors (503 error code) received by users connecting to login.microsoftonline.com

While network and application outages are common and inevitable, disruptions to Azure AD are extremely rare. Microsoft’s Root Cause Analysis of this incident provides several clues as to why the service has a nearly flawless record of availability. From an architecture standpoint, the service is built to be highly resilient, with Microsoft describing its deployment as an “active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries.” When updates to the service are required, they are rolled out in concentric rings of increasing scope over the course of several days. At the heart of these service updates is the test environment, followed by the parts of the service impacting internal users and, finally, the production environment is updated in controlled stages.

Disruption to the Azure AD service occurred due to a previously unidentified issue that prevented its Safe Deployment Process (SDP) system’s ability to interpret deployment metadata. Rather than only deploying to its test environment, which is its normal process, it deployed directly to all rings, including production, simultaneously — which caused service availability to degrade. Efforts to roll back the deployment had to be done manually due to the same issue within its SDP system, which had corrupted the deployment metadata.

It’s not common, but it happens

Google experienced a similar incident last year, when a routine maintenance update to its network management servers accidentally targeted all of the clusters controlling a part of its network in North America (again, due to a bug) rather than the smaller set of servers intended. Despite these incidents, both of these providers have enviable service records, with fairly infrequent service disruptions despite operating applications, services and networks on a massive scale.

Understand What You’re Working With

What the Azure AD incident does illustrate is the complex web of interactions that make up what users see and experience as a single monolithic app. The reality is that, under the hood, modern applications and services are no longer monolithic. They are modular, distributed, increasingly dependent on 3rd-party apps and services, and often very, very chatty.

Given the complexity of most enterprises’ heterogeneous service ecosystem, it’s to be expected that things will go wrong. The important thing is to be prepared and have visibility. Your internally-developed applications likely have more dependencies than you realize, not only from a delivery standpoint (e.g. CDN and DNS providers), but backend dependencies that enable aspects of your application (e.g. third-party API services).

Understanding how each component of your business-critical applications work together—the critical elements, authentication paths, potential points of failure, and even the performance of each object and interaction—ensures that you can properly manage them, fixing and optimizing what you own, staying knowledgeable and alerted to what you don’t own.

Telstra BGP Hijack: Yet Another Internet Routing Mishap

On September 29th, a flawed configuration rolled out to one of Telstra’s edge routers caused hundreds of announcements to be erroneously announced to the Internet, leading to the service provider effectively hijacking routes that belonged to other ASes.

ThousandEyes observed the hijacking of various services, including routes belonging to Quad9, a public DNS resolver service.

Telstra Announces Quad9 Slash24 Route Hijack
Figure 2. Telstra announced the same /24 as Quad9, leading to two origins during the hijacking

About 25% of Quad9’s traffic was directed through Telstra’s network and subsequently blackholed. The reason why all traffic was not dropped was due to Telsra’s announcement, which was not a more specific prefix, but the same prefix as Quad9 — a /24. Quad9 routes are mostly preferred during the incident, but not entirely, leading to the partial traffic blackholing.

Quad9 Availability Degrades Route Hijacking Telstra
Figure 3. Quad9 service availability degraded during the hijacking incident

A Sneak Peek at Next Week’s Episode

Be sure to tune in next week as we review Slack’s October 5th service disruption.

Slack October 5 Outage
Figure 4. Slack’s October 5th service disruption

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 25: Sep 21 – Oct 4, 2020
ThousandEyes T-shirt Offer
Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog