Moving critical applications and services to the cloud brings unprecedented power to IT teams who no longer have to worry about building and maintaining infrastructure. At the same time, cloud computing introduces a solid dose of unpredictability due to the sheer complexity of the Internet and cloud connectivity. We get reminded of this reality on a reasonably regular basis when outages occur in cloud or other service providers. On June 2nd between 12pm and 12:15pm PDT, ThousandEyes detected a network outage in Google’s network that impacted services hosted in some of the US regions of Google Cloud Platform. The outage lasted for more than four hours and affected access to various services including YouTube, G Suite and Google Compute Engine. While Google has since provided an official incident report explaining the root cause and remediation of the outage, the following is a recap of what we saw in the ThousandEyes platform, plus some thoughts on how you should think about resiliency and visibility as you move into the cloud.

Early Indicators

ThousandEyes started to see both user-level impacts and macro network impacts from the Google Cloud outage starting shortly after 12:00pm PDT on June 2nd, right around the time that initial social media and posts on downdetector began. Figure 1 shows the disruption as seen from 249 global ThousandEyes Cloud Agents in 170 cities that were monitoring access to content hosted in a GCE instance in GCP us-east4-c, but served dynamically via a CDN edge network. Users accessing other GCP-hosted services would have started to feel the impact of the outage at this time.

Google Cloud Platform global outage
Figure 1: The outage started shortly after 12pm on June 2nd, impacting global users connecting to GCP us-east4-c.

Figure 2 shows the extent of packet loss — 100% — for users attempting to reach a site hosted in GCP us-west2-a from locations around the globe.

Google Cloud Platform significant packet loss
Figure 2: Packet loss was total between our global monitoring agents and content hosted in a GCE instance in GCP us-west2-a.

At a macro level, we also were also able to see connectivity and packet loss issues that impacted Google network locations in the eastern US, including Ashburn, Atlanta and Chicago. As an example, Figure 3 shows a topology mapping of Google network locations. The light blue nodes on the left show the number of source interfaces per location attempting to reach two Google-hosted domains located on the far right. Between those two points, 15 interfaces located in Google’s network are experiencing heavy packet loss that is preventing traffic from the source interfaces from reaching the services domains. As an interesting side note, the lower portion of the topology in Figure 3 shows evidence of network connectivity issues impacting the Google network in India.

Packet loss in the Google network
Figure 3: Packet loss on 15 interfaces within Google’s US network impacted eastern US Google source locations shown on the left as they tried to reach two GCP-hosted service domains shown on the right.

From a connectivity point of view, our Path Visualization showed traffic dropping at the edge of Google’s network, as seen in Figure 4 below. In fact, for 3.5 hours of the 4+ hour outage, we saw 100% packet loss for global monitoring locations attempting to connect to a service hosted in GCP us-west2-a. Similar losses were seen for sites hosted in several portions of GCP US East, including us-east4-c.

High packet loss at the Google network edge
Figure 4: High packet loss conditions at Google’s network edge.

However, the outage was not uniform across the regions impacted. For example, in Figure 5 below, a different service hosted in GCP us-west1 was fully available during the course of the outage. In addition, all European regions and Asian regions were also unaffected.

GCP-hosted service in us-west1 unaffected by the outage incident
Figure 5: GCP-hosted service in us-west1 is unaffected by the outage incident.

In fact, most GCP regions were unaffected by the outage, including regions in the US as well as in Europe and elsewhere. For example, Figure 6 shows that the europe-west4 region, which is located in the Netherlands, remained reachable during the outage period. The green status shows that global monitoring agents were encountering little-to-no network packet loss in accessing the region.

No packet loss for GCP Europe West4 region
Figure 6: Negligible packet loss for users connecting GCP Europe West4 region.

Google’s Analysis

Google was aware of the issues early on and announced service impacts on Google Compute Engine beginning at 12:25pm PDT, as seen in Figure 6. They subsequently posted an update at 12:53pm that the service impacts were tied to broader networking issues. By 1:36pm, Google had identified the issue as related to high levels of network congestion in the eastern US, which aligned with the elevated packet loss we started seeing before the announcement (see Figure 2 above).

Initial reports of network issues from Google
Figure 7: Google first reported the issues on its status page at 12:25pm PDT.

Service Restoration

Beginning at roughly 3:30pm PDT, we started to see packet loss conditions easing and reachability to Google services improving, as seen in Figure 8, which shows updated status for connectivity to a service in GCP us-west2-a. Reachability continued to improve over the next hour, which we continuously tracked over that period.

Google service reachability starts to improve
Figure 8: Service reachability started to improve from global locations to an affected service in us-west2-a around 3:30pm PDT.

Service finally appeared fully restored by 4:45pm PDT, as seen in Figure 9.

Google services back to normal
Figure 9: Availability of a GCP-hosted service in us-west2-a at 4:45pm PDT.

At 5:09pm, Google posted that the network congestion issue was resolved as of 4pm PDT for all impacted users, as seen in Figure 10. They also promised an internal investigation of the issue.

Google announced resolution of the network congestion issue
Figure 10: Google announced the time of resolution to be 4:00pm PDT.

Where do cloud users go from here?

One of the most important takeaways from cloud outages is that it’s vitally important to ensure your cloud architecture has sufficient resiliency measures, whether on a multi-region basis or even multi-cloud basis, to protect from future recurrence of outages. After all, it’s only reasonable to expect that IT infrastructure and services will sometimes have outages, even in the cloud.

The cloud and the Internet are inherently prone to unpredictability because they are massive, complex and endlessly interconnected. The cloud is arguably still the best way to do IT for most businesses today, but it carries risks that no team should be unprepared for. Given the increasing complexity and diversity of the infrastructure, software and networks you rely on to run your business, you need timely visibility so you can tell what’s going on and get resolution as fast as possible. The good news is you don’t need to have the global infrastructure and an engineering team the size of a FAANG company to see through the cloud and the Internet.

If you’d like to have the visibility illustrated in this post for all your cloud infrastructure, SaaS and third party API providers, then drop us a line. We’d be glad to help you understand how to get the insight you need to make the cloud and Internet as transparent and accountable to your business goals as you can make them.

If you want to learn more about this outage, watch our recent on-demand webinar, Analysis of a Major GCP Outage.

Subscribe to the
Network Intelligence Blog!
Subscribe
Back to ThousandEyes Blog