Monitoring the Oracle Service Cloud at Scale

Posted by on June 16th, 2015
January 21st, 2016

In our third report from ThousandEyes Connect, we review Alex Barerre’s presentation on rolling out monitoring to a diverse set of cloud-based applications. Alex, a Senior Systems Administrator at Oracle, joined us from Montana where he works on the Global Cloud Operations team to ensure that everything is running smoothly. The team is responsible for operating various cloud services, from IaaS (Compute, Storage) to PaaS (Java, Database, Messaging) and SaaS (CX, HR, ERP, SCM, EPM). When you include customer production sites that are hosted by Oracle, this adds up to hundreds of critical services to monitor. So how does Alex and his team do it?

Operations at Cloud Scale

Until recently, Oracle used an in-house monitoring system to measure availability of its services. That system was a single thread Python script that checked HTTP responses. As Alex puts it, “if we got a 500, we’d go, oh there’s an alert!”

The team started using ThousandEyes, gathering performance data of the various cloud services from Enterprise and Cloud Agents. Speaking about the change, Alex says, “it was great. We were able to get a lot of really useful data to help us mitigate issues and troubleshoot any problems on customer sites, down to a very granular level. ThousandEyes became the de facto monitoring service that we deploy across all cloud properties.” Alex explains that “as we increased scale, and worked to monitor additional cloud properties, it began to make sense to deploy it ourselves.”

Alex is leading the build out of 16 clusters with 25 Enterprise Agents each, simulating traffic and monitoring performance from a set of global data centers. These clusters of agents are designed to monitor across the various Oracle cloud properties. The Enterprise Agents sit on Sun servers running Oracle Enterprise Linux; they don’t talk to the production management network on the back end because of the need to simulate end user traffic. They are deployed on the edge routers, in external vLANs, so they immediately exit the Oracle network. And Alex is deploying them fast; as of the morning of the talk, Alex’s timeline for deploying these clusters was reduced to 6 weeks.

The clusters that are deployed are already helping the Oracle Cloud Operations team maintain a high uptime, reduce MTTR and keep their customers happy. And they are popular with the operations teams for each of the cloud services, with utilization already oversubscribed for several of the clusters.

More from ThousandEyes Connect

For more ThousandEyes Connect presentations, check out the talks on Internet reliability at Bloomberg and CDN performance at eBay.