Cultivating Operational Analytics at Credit Suisse

Posted by on November 3, 2015

In the first of our reports from ThousandEyes Connect New York, we’ll discuss the talk by Darrell Westbury, Director of Operational Analytics at Credit Suisse. Darrell is a 7-year veteran of Credit Suisse, having worked across a variety of IT Operations roles. He joined us to talk about “Operational Analytics: Data is the New Soil.” The talk’s title was inspired by the TED Talk “The Beauty of Data Visualization” by data journalist David McCandless.

Figure 1: Darrell Westbury presenting at ThousandEyes Connect New York.

As Darrell sees the IT Ops domain, “the wealth of data that you already have can be applied in new ways to lend new insights that you would have never seen before.” He has spent time thinking about how the Credit Suisse team can use data to have a greater impact.

What Is Operational Analytics?

Darrell is building an Operational Analytics (OA) team at Credit Suisse, a new function that acts as an overlay within IT Operations. It is comprised of Subject Matter Experts in a number of fields, from storage to Linux to Windows to networks, as well as data analytics and visualization specialists.

So what does Operational Analytics entail? According to Darrell, OA is the application of “big data principles and data analytics to the IT Operations realm. The focus is to discover trends and patterns in IT systems data, which is typically busy, complex and noisy, and to use that insight to project forward to identify the potential risks that have not happened yet.”

His team also focuses on reducing the mean-time-to-resolve (MTTR) issues that do occur. “No matter how hard you try to avoid issues, it’s great to be able to say ‘We think the likeliest root cause is in this area and here is the appropriate team to respond to it.’”

Working with OA Data

Within his team, Darrell targets five categories of OA data:

  1. Machine: system logs, events, performance and capacity data
  2. Wire: decoded packet capture data
  3. Agent: intercepted system calls and application method invocations
  4. Synthetic: simulated transactions of customer experience
  5. Human-maintained: inventories, names, classifications

So what does the team actually do with this data? Darrell lays out three steps.

First up, data onboarding. This involves identifying golden data sources that may be common or very specific to a particular application or service. Darrell’s team wants both. “We want to be able to see the world from the service down, rather than the infrastructure up.” His team then works on ETL (Extract, Transform, Load) jobs to get data into their Hadoop environment. The relatively unglamorous part of this, according to Darrell, is data quality management, which is using reference data and business logic to determine and ensure quality. The OA team spends about 50% of their time in the data onboarding stage.

The second phase is data science, where the team spends 25% of their time. “This is where we look for the trends and patterns.” It’s also where the team does reconciliation and statistical analysis, looking for drivers, seasonality and other causal relationships. “This is the fun part where we get to find something new about an application or service.”

The third phase is data visualization, taking up the remaining 25% of team effort. “This is where the rubber hits the road. We build a narrative based on everything we’ve learned and put it into a compelling visual language.” According to Darrell, this step can’t be overstated in its significance. A picture can communicate “quickly and intuitively.”

ThousandEyes As A Source of OA Insights

The OA team uses ThousandEyes for a number of purposes to derive operational insights. Currently, this includes five areas:

  1. DNS testing the resolution of Credit Suisse domains
  2. TCP Port Listeners to show network paths and health
  3. Data path testing with loss, latency and jitter across the Internet
  4. Page load testing and object analysis of key web services
  5. Synthetic transactions to understand site navigation and object performance

Pre-Production Canary

Darrell then jumps into several examples to show how ThousandEyes data can be used. The first example shows “the fundamental benefit of looking at the world from the outside-in.” It involves “an e-commerce environment that was, literally, days from going live.” Internal instrumentation showed a healthy, optimal internal network environment. But from the outside-in, the “ThousandEyes probes that we had set up gave us clear evidence that we were dropping between 50-100% of all communications that were coming in. It turned out that our load balancers were misconfigured. We were able to see this before we went live with this service and correct it.”

Figure 2: Availability issues in a pre-production environment caused by a load balancer misconfiguration.

Quantifying CDN Performance

A second example involves Content Delivery Networks (CDNs). The OA team ran an “accelerated versus non-accelerated A/B test. Prior to using ThousandEyes, we typically had subjective feedback of user experience and speed. Now we were able to put quantifiable evidence together.” After the test, Darrell’s team was able to put together a “relatively clear” visualization of a 33% improvement in overall performance, combined with more stable latencies. “We are making it more data-driven, not touchy-feely.”

Figure 3: Comparison of CDN-accelerated (below) and non-accelerated (above) fetch time.

Uncovering ISP Subcontracting

In a third example, Darrell walks through a 15-minute incident that happened this past August. At first, everything is performing fine. Then, the OA team is alerted to a BGP route disruption, which causes chaos in traffic destined for some of Credit Suisse’s services hosted in New Jersey. “How is this possible? We are using multiple service providers. We have two different carriers and completely discrete and unique circuits. How is it that any one thing can cause that problem?” As Darrell found out, the ISPs were both subcontracting through a common service that had a single fiber break in New Jersey. Without this data, “we would have had no idea. The typical scenario would be on the phone with providers for hours, asking questions, trying to delve into issues, getting mixed responses. Here, we were able to take a snapshot and send it to them and say, ‘Hey guys. We can see exactly what’s happening.’”

The OA team had DR (Disaster Recovery) set up, and failed over within fewer than 10 minutes. “But what might not be obvious,” Darrell explains, “is having the evidence in hand to prove that is really important in ops. Post-mortem discussion was much easier.”

Figure 4: Simultaneous issues in two different ISPs, caused by a common subcontractor.

As Darrell wraps up, he highlights how OA and ThousandEyes run multiple tests at various levels from around the world. “We can see exactly what our services look like from the client’s perspective. We receive alerts on significant variations from established baselines. And we are leveraging real, quantifiable data to take the guesswork and subjectivity off the table. And that’s powerful.”

Check out the video of the full talk below and stay tuned for updates on our speakers from Shutterstock and DigitalOcean.