How Intuit Monitors Connectivity to AWS

Posted by on July 31st, 2017
July 31st, 2017

Our most recent ThousandEyes Connect in Santa Clara had a star-studded lineup of customer speakers from PayPal, Intuit and Netflix. In this post we summarize the presentation from Daniel Shanahan, Staff Network Performance Engineer at Intuit.

ThousandEyes Connect presentation by Daniel Shanahan
Figure 1: Daniel Shanahan presenting at ThousandEyes Connect Santa Clara.

Daniel is a 15-year veteran in network operations and an expert on network protocol analysis and application performance who is responsible for the performance analytics of infrastructure for the TurboTax platform. Daniel’s session focused on the importance of monitoring connectivity to applications hosted in AWS and the role of using ThousandEyes to monitor a hybrid cloud environment.

Intuit’s Hosting Journey to AWS

Daniel kickstarted the session by providing some background on the evolution of the TurboTax platform. At its inception, TurboTax was a desktop software installed on end-user desktops and laptops. However, the migration to TurboTax.com, an online portal, proved to be challenging, as the monolithic nature of the application performed poorly at scale. To cope with the problem, Intuit re-designed TurboTax to a modular, service-based architecture. The new approach had many benefits: it made development and deployment of the application easier while decoupling the relationship between software and infrastructure. It also made migrating the application to an Infrastructure-as-a-service (IaaS) provider like AWS easier.

Daniel quotes, “AWS makes a lot of sense for us. It is reliable, scalable and globally available. But one of the biggest benefits is that it’s elastic.” To illustrate his point, Daniel presented a snapshot of how throughput to an individual network component of TurboTax.com fluctuates yearly (Figure 2). As one would expect, the beginning of tax season sees a spike in throughput, but for the remainder of the season, utilization is pretty low. The elastic nature of AWS allows Intuit to spin up resources as necessary, instead of investing in maximum capacity that remains underutilized 70% of the time.

Network throughput chart
Figure 2: Elastic utilization (throughput) calls for an elastic workload.

Hybrid Micro-Services Architecture

With the modular architecture in place and with AWS as their IaaS provider, Intuit adopted a hybrid approach to deploying TurboTax.com. The multi-tiered services (Web, App and Database) are split across Intuit’s on-prem data centers and AWS. When a customer navigates to TurboTax.com, the call comes into TurboTax Web services in the on-prem data center, but may be redirected to other services hosted in the external AWS data center.

Micro-services architecture diagram
Figure 3: Intuit’s hybrid micro-services architecture.

Monitoring a Hybrid Cloud

As one would expect, with applications split across different locations and broken down into micro-services, monitoring and visibility become critical functions. Daniel spent a few minutes describing Intuit’s monitoring strategy. Intuit looks at monitoring from two perspectives: Application Monitoring and Infrastructure Monitoring. Daniel explained that because applications are independent of the infrastructure, monitoring an application within an on-prem data center and a cloud data center remained the same. “However, the infrastructure side is a little more complex,” noted Daniel.

“While principles of SNMP, packet analysis and syslog work great for the on-prem data center, they do not apply to AWS as there is limited visibility and ability to instrument,” said Daniel. In order to circumvent that, Intuit uses a combination of AWS CloudWatch and collectd metrics. In spite of using a combination of monitoring tools, Intuit noticed a visibility gap.

The Visibility Gap

“What is missing is monitoring the interaction between these services”, said Daniel. He then walked us through an example to explain this further. When a user logs into TurboTax Web (located in the on-prem Intuit data center) but needs to access a previous year filing or W2, traffic is redirected to the Customer Data Storage service located in AWS (Figure 3). When the Web team notices a poor customer experience manifested through errors and increased latency, they assume it is a problem within the Customer Data Storage service. However, application monitoring shows that there is nothing wrong with the data storage service or within the AWS infrastructure. When both services’ teams have eliminated the possibility of faults within their own domain, they automatically default to the conclusion that Daniel’s team is responsible for the infrastructure issues. “We think the network is broken!” is a phrase Daniel has grown familiar with. He said, “This is where ThousandEyes fits in. We need to validate whether the network is indeed broken and identify if there is packet loss or increased latency between the hybrid infrastructures.”

ThousandEyes Deployment

Initially, Intuit started using ThousandEyes Cloud Agents to monitor their online-facing assets like TurboTax, Quickbooks etc. Daniel’s team quickly noticed the value and adopted ThousandEyes to solve the visibility gap he outlined earlier. Enterprise Agents, deployed within Intuit’s data centers, monitor the TurboTax services hosted within AWS. “We now have an understanding of the performance of the connectivity between us and AWS,” recounts Daniel.

ThousandEyes deployment diagram
Figure 4: ThousandEyes deployment within Intuit to monitor TurboTax services in AWS.

Enterprise Agents from the Intuit data center poll dummy EC2 and S3 instances in each Availability Zone across all AWS regions used by Intuit. To keep it simple and avoid any false positives, the EC2 and S3 instances are standalone and have no other services hosted on them. Network tests from the Enterprise Agents target the EC2 and S3 instances to download a 1MB data file. The data collected through these tests are used in multiple different ways:

  • Build a baseline of performance metrics including latency, response time and packet loss.
  • Proactively identify issues in the infrastructure before they affect the application.
  • Diagnose and triage problems by pinpointing when and where they happen in the network.

Operationalizing ThousandEyes

“Collecting data is only one part of the solution, but how we use the data is extremely critical as well,” said Daniel. As part of a larger monitoring initiative, Intuit leverages Wavefront, a data aggregation and visualization tool. Wavefront allows for reporting and trend analysis by collecting data from multiple different monitoring tools. “While the ThousandEyes dashboard and UI is fantastic, not every OPS team wants to learn a new tool. So we feed the metrics generated by ThousandEyes into Wavefront for a unified view,” highlighted Daniel.

ThousandEyes integration diagram
Figure 5: ThousandEyes integration with 3rd party tools and vendors.

Intuit leverages the ThousandEyes API to download data and stream it up to Wavefront, as shown in Figure 5. Wavefront provides the first indication of an error, however one has to dive into ThousandEyes to diagnose and identify what’s going on. Dan said, “Wavefront will indicate an increased packet loss to a particular AWS zone, however to understand what’s causing the packet loss, we rely on ThousandEyes Path Visualization.” He further added that in the past, before Thousandeyes, detecting the root cause of packet loss was an excruciating process that involved capturing packets, analyzing them and still being unable to localize an issue.

In the future, Intuit plans to deploy Enterprise Agents in AWS to get richer contextual data on bidirectional network paths.

Interested in learning more about how our customers monitor their networks? Stay tuned for more posts summarizing talks from PayPal and Netflix.

Processing...