Automating Performance Monitoring at Microsoft

Posted by on April 27, 2017

Last week we took ThousandEyes Connect to the road and visited Seattle, bringing together network engineers, architects and site reliability teams to discuss network performance. Scott Hinckley, Sr. Site Reliability Engineer for Microsoft Dynamics 365, opened up the morning presentations. Here are notes from his talk about how he automates web performance monitoring at scale.

Figure 1
Figure 1: Scott Hinckley presenting at ThousandEyes Connect Seattle.

Monitoring Dynamics 365 Web Performance

Microsoft Dynamics 365 is a suite of CRM and ERP products that is both on-premises and cloud-based. The engineering team is responsible for, among other things, ensuring that the cloud-based online services are running smoothly and that customers can connect to services within expected performance levels.

Scott starts off describing a familiar experience. “At Microsoft we have a lot of tools that help us with analytics and reliability.” Yet, like many organizations, the Dynamics team was looking for an understanding of user experience and connectivity. As Scott explains it, “ThousandEyes gives us a true, outside-in vision from the customer point of view.”

With that outside-in vision and the shift to more cloud-based deployment models at Microsoft, “the use of ThousandEyes was growing extremely fast.” So fast, in fact, that the team was running thousands of tests to various Dynamics 365 services and customer instances. Scott explained that while “ThousandEyes has a really nice GUI,” at this scale managing ThousandEyes from a GUI was just not practical in terms of his time. In addition, expanded use of ThousandEyes by the Dynamics team meant that Tests frequently needed to be edited. This spurred Scott to try out automation.

Driving Efficiency through Automation

Rather than “stopping a galloping camel with brute force,” as Scott had done in previous (real life) adventures, he turned to automation to manage the collection, analysis and management of Dynamics 365 outside-in performance data. Since Scott has “always had a passion for automation. Ever since my first job tackling a million-plus lines of undocumented code” solving problems at scale has been his specialty.

In the case of performance monitoring, Scott used the ThousandEyes v6 API to accomplish his automation challenge. As Scott describes it, the ThousandEyes developer API is “RESTful, supports either XML or JSON, secured with HTTPS and token-based authentication, well documented, and (nearly) comprehensive. Most anything you can do through the GUI, you can do through automation.”

Making Automation Microsoft-Friendly with Powershell

Scott also jokes that at Microsoft, “we’re a Windows shop, surprisingly.” Most of the ThousandEyes API examples are in cURL, not natively available in Windows; he turned to Powershell for Windows-based automation.

Powershell, as Scott explains, “is an extremely powerful automation tool that’s not just a command line. But it can still be quite simple, just like cURL.”

Key Considerations When Working with the ThousandEyes API

According to Scott, after his extensive automation project, “making API calls is pretty straightforward. But you realize there is a lot more work. I started with a loop command; I need a 100 tests that are all the same. But at some point you need to get much more interactive.” That requires interfacing with databases and following different paths between different tests.

Scott excitedly explained that “lucky for you, I’ve done some of the work you’d need to do this yourself. I’ve put it together into an uber-script with a simple text menu.” You can download his ThousandEyes_via_Powershell script here.

To get started, Scott recommends that “you create a user just for automation. Then you can assign that user whatever permissions you need to all of the Account Groups.” Next, you’ll want to get familiar with the following automation elements.

Scott built Powershell functions for many key automation elements:

  • Authorization: A token per user, scoped per Account Group
  • Iterating: Through Account Group, Users, Alert Rules, Agents, Tests
  • Error handling: Gracefully manage API responses
  • Parsing: Capture return results, handle double-byte characters
  • Creating tests: Especially Web Transaction tests with variables
  • Managing usage: Alert on changes to projected monthly spend

From there you can also consider:

  • Storing data: Connect performance data directly into databases
  • Alert generation: Feed data into alerting and ticketing systems
  • Auto-remediation: Tying together alerts and detailed Path Visualization data

Integrating with Site Reliability Processes

When you get to monitoring at scale, you also start to pay attention to usage and traffic loads. Scott demonstrated one of the uses of automation when “at one point there was concern that ThousandEyes might itself be generating a load that was causing noticeable impact on one of our customer accounts.” Scott used an API function for Agent IP addresses to pull all of the IP addresses for the Agents that were being used, and the networking team was able to eliminate ThousandEyes as being a significant contributor to the load on the customer instance.

Automation also allows for chaining together business and operational processes. To manage Test creation, enabling and disabling (i.e., turning monitoring on and off), he tied the ThousandEyes API into “a process running on one of our ‘tools’ servers in our back-end environment. This process queries the deployment databases to find out if there are any new services, removed services or other changes. It pulls a list of our current Tests and edits those Tests as needed.”

In addition to toggling monitoring on and off, Scott described how Microsoft has “fully integrated ThousandEyes with our ticketing system, so if we find a problem we can alert our help desk.” Looking forward, “the next step is looking at trending. For example, if we see our latencies or transaction times changing significantly we will pull our alert data and performance history.”

As Scott sums it up, the Dynamics 365 team wants to “react to a problem before the customer sees a problem. Or at least concurrently.” Automation, Powershell and the ThousandEyes API are making that possible.

Processing...