On Tuesday, February 28th, 2017, the US-East-1 region of Amazon Web Services S3 saw a complete outage from 9:40am to 12:36pm PST. AWS S3 (Simple Storage Service) is a cloud object storage solution that many services rely on to store and retrieve files from anywhere on the web. In addition, many other AWS services that depend on S3 — Elastic Load Balancers, Redshift data warehouse, Relational Database Service and others — also had limited to no functionality. Similar to the repercussions of the AWS outage caused by a route leak in 2015, the S3 outage disrupted a large number of services that depend on AWS over the course of roughly 3 hours. These services included Quora, Coursera, Docker, Medium and Down Detector.
During the outage, the S3 service was completely down — our Cloud Agents observed 0% availability and 100% packet loss through the entire duration of the outage. Though use of the S3 service often occurs on the back-end and is not readily apparent to end users, today’s outage revealed many services’ dependencies (in many cases, over-dependencies) on the S3 service and exposed a critical lack of redundancy in many services’ cloud storage solutions.
Immediate and Complete Outage
Starting at 9:40am PST, the availability of the S3 services immediately dropped from normal levels down to 0%. At the same time, packet loss also immediately increased to 100%. Both availability and packet loss consistently remained at those levels for the entirety of the outage. You can follow along with the data here: https://gokahptkc.share.thousandeyes.com.
Looking at the Path Visualization, we can better understand exactly where packets were lost. The issue is clear: all network connections are terminating within the AWS US-East-1 infrastructure in Ashburn, Virginia.
The Impact on Online Services
We also saw evidence of many services’ reliance on S3. For example, when Amazon S3 became available at 12:36pm PST, we saw Coursera’s website come back at exactly the same time.
Interestingly, the S3 outage revealed how hidden some services’ dependencies can be. There are a number of ways that services can rely on a data storage solution like S3, and many of them are not readily apparent to end users or customers.
Services may rely on S3 in a variety of ways:
- A service may be directly hosted on S3, in the case of static websites. These services’ fates are tied to S3’s — during the outage, these services would have suffered a complete outage.
- A service may have objects on its web pages that are hosted on S3. In this case, the service would not be completely unavailable, but certain objects served out of S3 may be unable to load, and page load times may be impacted.
- A service may have critical sub-services (such as user session management, customer records or media files) that depend on S3 or other impacted AWS services. This might manifest itself as a complete outage or reduced functionality.
As a result, the many affected services’ failure modes during the AWS S3 outage turned out to be telling indicators of the ways in which they rely on Amazon’s services.
Was it a DDoS?
With the catastrophic DDoS attack on Dyn just barely in the rearview mirror, many can’t help but consider whether the AWS S3 outage was similarly caused by a DDoS attack. However, our data provides strong evidence to the contrary.
First, this pattern in performance impacts (consistent levels of 0% availability and 100% packet loss) does not match the patterns we’ve typically seen in outages caused by DDoS attacks. In past DDoS attacks, we have generally seen that packet loss and availability issues do not immediately peak but rather increase over time, and we have also seen that these performance impacts are extremely volatile and variable over the course of a DDoS attack.
There is also additional evidence against a DDoS attack in the above Path Visualization (Figure 2). Notice that the traffic is terminating within the AWS US-East infrastructure, rather than terminating at peering connections with other networks. The latter is a symptom that would be more indicative of a DDoS attack. In addition, since AWS does their own DDoS mitigation (AWS Shield), we wouldn’t expect (and don’t see) any notable route changes or the appearance of DDoS mitigation vendors.
A Data Center Outage?
AWS S3 is an object storage service that stores files such as music, images, documents or text. It has multiple regions around the world, each comprised of multiple data centers. Because the impact of the outage was on only one region, US-East in Northern Virginia, and because it impacted the entire service in the region across multiple data centers, it is unlikely that a facility-specific issue such as a power outage or device failure was the root cause.
The Likeliest Root Cause
Because of the immediacy of the performance impacts, the root cause of the S3 outage looks much more likely to be an internal network issue. It could be an internal misconfiguration or infrastructure failure, with symptoms that manifest themselves on the network layer.
Today’s AWS S3 outage is yet another reminder of the importance of being aware of dependencies within your own service and other business-critical applications. Keep in mind that these dependencies can be as public as an ISP or content delivery network, or as concealed as a backend data storage solution. Think through every piece of how important applications are delivered, and fortify your infrastructure with redundancy that you’ll be thankful for on days like this.
The data presented in this post can be collected through HTTP Server Tests, available through a free ThousandEyes Lite account. Sign up today to ensure that critical services, hosted on AWS or elsewhere, are as resilient as you expect.
Update (March 3, 2017): The AWS S3 team published a post mortem summarizing what caused the outage, stating that they mistakenly took more servers than intended offline. The various S3 subsystems then required a full restart, during which S3 was unable to service requests. Our analysis was in line with what Amazon reported — the problem indeed turned out to be an internal mistake that manifested itself as symptoms on the network layer (100% packet loss, as no TCP connections were able to be made) and application layer (0% availability).