Troubleshooting Cloud Services at Cisco

Posted by on March 25th, 2016
March 25th, 2016

We just wrapped up our second ThousandEyes Connect San Francisco, filled with intelligent conversations and excellent stories, including a talk by Andrea Di Lecce from Cisco. We were excited to invite Andrea to share how Cisco IT uses ThousandEyes. Andrea is responsible for managing large IT projects within Cisco and as a part of her role, manages the implementation and deployment of ThousandEyes within Cisco. In her talk, she gives insights into how her team uses ThousandEyes to monitor business-critical cloud and internal services, like Salesforce and Webex. She also shares her learnings and a few success stories reiterating the need for an intelligent, proactive network intelligence platform.

Figure-1
Figure 1: Andrea Di Lecce presenting at ThousandEyes Connect San Francisco.

Having spent years in IT from testing networks to now managing them, Andrea recognizes the changing landscape of the industry. A change that has been brought about by increasing applications and services moving to the cloud, and businesses increasingly relying on the cloud. However, she acknowledges that the “cloud” remains a mystery. She says, “outside the Cisco network, the cloud is basically a black box. Once the packets go outside our network we don’t know what happens and it’s beyond our control.” The cloud, with its influx of third-party service providers and varied network control domains, needs a different approach to managing and troubleshooting. And this is where she shakes up the status quo.

Move over, MTTR

If you are reading this blog chances are that you are likely familiar with Mean Time To Repair/Restore. Commonly referred to as MTTR, it’s the average time taken to restore a faulty scenario back to normal. Essentially, it’s the time it takes from the start of an incident to when it is resolved and service has been restored to normal. A critical measurement for SLAs, this metric can be used to quantify fault tolerance and the efficiency of your service or network. Every IT organization perpetually tries to optimize MTTR in an effort to provide uninterrupted, reliable service to its customers.

Andrea stresses that with the evolution of cloud networks, MTTR cannot be the only metric employed. Triaging a network issue and identifying the root cause is not the same as resolving the issue. Each network domain may be controlled by different vendors, and resolution time can be completely independent of triage or troubleshooting time. For example, knowing that your highly rated service provider has a cable cut is not enough to fix the issue yourself. Hence the time taken to resolve the issue is highly dependent on the time that your ISP takes to actually fix the issue.

Mean Time to Troubleshoot (MTTT)

Andrea redefines network monitoring metrics by introducing a new metric: Mean Time to Troubleshoot (MTTT). MTTT is the time taken from the start of an incident to when the root cause or the source of the issue has been identified. In her words, “It’s the time period from when the issue starts to when it’s handed over to the external service provider or vendor.“

Figure-2
Figure 2: MTTT and MTTR Flow Chart.

Andrea quotes some really intriguing numbers during the discussion. With ThousandEyes, Cisco has been able to reduce MTTT by 43%, while MTTR reduction has been a smaller, but still significant, 8%. Why the drastic difference, she rhetorically quizzes the audience. It was not because ThousandEyes didn’t help quickly pinpoint the issue, but rather because for most cases, resolution of the issue was beyond her team’s control. To validate her theory, she brings to light a real world example when Cisco experienced issues with Salesforce logins and services, ThousandEyes automatically detected the root cause, which was a saturated Level 3 ISP link within the Salesforce network. The issue was identified and handed over to Salesforce within an hour, but it took an extra ten hours for Level 3 and Salesforce to resolve the issue.

Her perspective on this topic is interesting as it brings to the forefront the question: while Operations and IT teams are racking their brains for ways to optimize a single metric, MTTR, should we take a step back and also consider ways to optimize MTTT? Measuring metrics that you can actually control may garner more benefits and innovative solutions.

Reducing MTTT with ThousandEyes

In the second half of her talk, Andrea delves deeper into real world examples of how her team has successfully reduced MTTT by both deploying ThousandEyes Enterprise Agents internally and also leveraging Cloud Agents outside Cisco’s network.

Andrea led the build-out of 25 ThousandEyes Enterprise Agents deployed at strategic vantage points within Cisco, including call centers, high priority sales sites and Internet PoPs, from where they monitor both internal and external facing applications like Salesforce, Webex and TAC tool services served via Akamai. Cisco also leverages ThousandEyes Cloud Agents to get an outside-in view of the network to monitor Salesforce and Webex and track BGP reachability.

As Andrea recounted, in what turned out to be a serious event affecting all of Cisco’s India sites, ThousandEyes, she said came to their rescue by triangulating the bottleneck almost immediately. ThousandEyes agents installed in multiple sites in India detected packet loss on the corporate gateway, as shown in Figure 3 below.

Figure-3
Figure 3: The path visualization was able to triangulate the bottleneck affecting Cisco’s India sites almost immediately.

Within less than an hour, Cisco was able to establish the root cause as an overloaded corporate gateway running at 100% CPU. Andrea said, “The issue was automatically detected by ThousandEyes, which also pinpointed that the packet loss was occurring on a specific device within the Cisco network. This allowed engineers to quickly address the problem.” Andrea joked that though it wasn’t an ideal situation with senior management visiting India at the same time the network went down, it was a validation of the value of ThousandEyes.

What Does the Future Hold?

Andrea mentioned that while Enterprise Agents are currently deployed on Mac mini devices, she and her team are currently validating a new deployment model in their labs where Enterprise Agents are colocated with Cisco ISR WAN routers—stay tuned for more on these developments. Cisco also has plans to integrate the ThousandEyes API with the in-house network operations alerting system for better tracking and management.

For more details from Andrea’s presentation, watch the video of the entire talk below.

For more ThousandEyes Connect posts, check out Zendesk’s talk on performance in China and stay tuned for RichRelevance’s talk on using data to build trust with customers.

Processing...