Monitoring Global Data Center Reachability at ServiceNow

Posted by on January 10, 2017

In this post from ThousandEyes Connect San Francisco, we’ll discuss the presentation by Geoff Wade, Senior Network Engineer at ServiceNow on monitoring performance of their globally distributed data centers. Geoff is responsible for network design, deployment and operations at ServiceNow, and his talk discusses the importance of monitoring a cloud-based service delivery infrastructure.

Figure 1
Figure 1: Geoff Wade presenting at ThousandEyes Connect.

ServiceNow is a cloud-based as-a-service platform for enterprise management software and IT operations and business management. Geoff kickstarts the session by providing an overview of “The Enterprise Cloud Company” and the intricate, fault-tolerant data center architecture designed to simultaneously handle thousands of workflow activities, billions of database queries and hundreds of terabytes of data. ServiceNow data centers are physically mirrored at each location and designed for zero downtime for any equipment failures. Geoff says “We are a cloud provider, which means our customers reach us in the cloud. And that means we must have full redundancy.” While the architecture is resilient to downtime, it doesn’t mean everything will automatically run smoothly. He stresses on the need for a monitoring solution so that they can isolate and fix issues before customers are affected. Geoff quotes, “For most of our customers, we are the Internet, so we can’t get away with saying ‘the internet is broken, try again later!’”

Global Footprint Mandates a Global View

With globally located data centers and an expansive customer base, understanding if and why some customers cannot access ServiceNow is crucial. As Geoff succinctly puts it “When you are worldwide, you need a world view”. A monitoring solution that can provide an outside-in view of the network from global vantage points was a key requirement while evaluating third-party monitoring solutions. That combined with the flexibility, variety of the tests and unified alerting dashboard led ServiceNow to pick ThousandEyes Cloud Agents as their monitoring solution four years ago.

One Platform to Find Them All

Geoff then moves on to highlight how over the past four years ThousandEyes has been leveraged by multiple teams within ServiceNow. The site reliability engineering (SRE), network engineering and systems engineering teams instrument ICMP and TCP reachability tests or HTTP page load tests to monitor different parts of the data center. Each of these teams further monitor alerting trends and patterns to isolate and distinguish between site-level, ISP or application controller issues. Email alerts are used in addition to GUI-based alerting for backup and trend analysis. Geoff calls attention to an email alerting trend that allowed ServiceNow to identify recurring congestion and packet loss in a particular ISP network at the exact same time every day.

Figure 2
Figure 2: Email alerting unveiled a congestion trend within an ISP provider.

BGP Love

Apart from monitoring applications and services, ServiceNow regularly uses the BGP Route Visualization feature within ThousandEyes. Quoting Geoff, “Network engineers love BGP prefix testing and route visualization.” It helps the team quickly visualize a problem from a part of the Internet to ServiceNow, facilitating better resolution response to customers. The ability to visualize more than one AS hop from the originating prefix helps deduct issues in ISP providers that are further upstream but have an impact on service availability.




Looking Ahead

Marching forward, ServiceNow is evaluating Internet Outage Detection and using detailed metrics to benchmark and validate ISP performance. Interested in learning more about how our customers monitor their Internet-centric environments? Read the previous post from Connect on how Quantcast monitors their high-availability, low-latency pixel-serving architecture.

Processing...