AWS US-East-2 Outage and Monitoring Microservices

Posted by on May 31st, 2018
May 31st, 2018

On the morning of May 31st, 2018, shortly after midnight Pacific Time, users reported an outage in Amazon Web Services’ (AWS) US-East-2 region. An issue with Internet connectivity caused this data center to lose reachability to the Internet, impacting the functioning of any apps with external dependencies such as API services and integration with 3rd party SaaS applications. Although this outage was detected and resolved relatively quickly, it illustrates the complex web of inter-service communication that drives modern microservices applications. A failure anywhere along this complex chain can have cascading effects on your digital experience and/or employee productivity.

ThousandEyes maintains a global network of vantage points called Cloud Agents that can monitor network and application performance across any network. Our Cloud Agents in the US-East-2 region were able to detect and pinpoint this outage as pertaining to an API service.

Complex web of inter-service communications
Figure 1: 100% packet loss from AWS US-East-2 region.

Scope of the Outage

A more detailed look at the ThousandEyes outage detection data reveals that the outage was indeed limited to the Columbus, OH area where Amazon’s US-East-2 data centers are located. What’s also interesting in this view are the diverse inter-service communication paths that were impacted by this outage. We have CDN providers, payment gateways, CRM providers and private cloud providers all communicating with AWS over the Internet.

Correlated outage details showing impact to external services
Figure 2: Correlated outage details showing impact to external services.

The possible digital business implications of this outage are worth considering:

  • Not being able to get data from a CRM to drive a digitally transformed customer service process. For example, an online service scheduling app or widget that needs to look up the physical address of a customer in order to finish the process of creating a service appointment.
  • Not being able to refresh a CDN cache, resulting in customer or prospect confusion when they read outdated content on a website.
  • Not being able to run a charge against a payment gateway API endpoint to complete an e-commerce transaction.

With enterprises building so much dependency on communications that cross infrastructures—networks and services that they don’t own or control—it’s more important than ever to understand these dependencies so they can manage the digital experiences they’re delivering to customers and employees.

100% packet loss from AWS US-East-2 region
Figure 3: Complex web of inter-service communications.

ThousandEyes Network Intelligence for the Cloud

ThousandEyes makes it easy to monitor networks and services you don’t own or control, including all the major public cloud providers like Amazon Web Services (AWS), Google Cloud (GCP) and Microsoft Azure. Even if you’re not in all of these cloud services, your customers and business partners surely are. It’s important to understand their app experience coming from a different cloud provider, and your app experience consuming API services from another cloud provider. While high availability features offer some protection against outages, as noted in our AWS Direct Connect blog post, they are not a panacea. Network outages are a fact of life and if you can’t control these networks that your business depends upon, you should at least monitor them.

Request a demo to learn how ThousandEyes can give you deep visibility into cloud providers.

Processing...