How Much Data is Enough? Tips on Selecting a Test Interval

Posted by on January 10th, 2018
January 10th, 2018

Given the huge amount of data that traverses a network, all network monitoring boils down to choosing which samples of data you should pay attention to. And active monitoring, sending and receiving diagnostic traffic or device info is no different. More samples can increase fidelity, but can also cost more in network overhead and analysis. So choosing the right sample size is important.

You Are What You Measure (and Sample)

With ThousandEyes you have two big levers to control what data you want to sample: how many vantage points you want to collect data from (agent locations) and how frequently they collect data (testing interval). ThousandEyes monitors services using Tests, which allow you to customize these two criteria based on the service you’re monitoring. These two can be adjusted in combination to reach the right data samples you require and meet your monitoring budget.

  • For agent locations, we recommend at least one agent per geography and/or network from which you want a vantage point. For example, one agent per branch office. Or two agents per primary ISP to your data center.
  • For testing intervals, we recommend matching how critical your service (an app, and network circuit) is to how frequently you collect data on its health. Is this your primary data center? Probably critical and worthy of high frequency testing. A non-critical marketing site? Maybe a service degradation is permissible for 15 minutes.

Turning up the Testing Tempo

Test intervals are a primary lever you have to control how you collect and sample performance data. The test data you collect will impact everything you do, from how much data is available via the UI and API to aggregate reporting statistics and alerts. Tests generally allow a data collection interval anywhere from 1 minute (most frequent) to 1 hour (least frequent).

Increasing the frequency at which you collect monitoring data can provide a number of benefits:

  • Catching short-lived outages
  • Triggering alerts sooner
  • Increasing data granularity

Back in December we added 1-minute testing intervals for most of our test types to reduce test frequencies even further. That makes it possible to get more actionable data faster than ever before.

Catch Short Outages

First and foremost, more frequently collected data can capture events that don’t last very long. A circuit does down for a minute? Or a server is briefly overloaded? Collecting data frequently is the best way to minimize false negatives.

We recommend that your test interval matches your tolerance for service disruption. Want to detect 1-minute outages? Use a 1-minute test.

If you’re having intermittent performance problems that come and go quickly, then increasing test frequency can help you catch them, and get a handle on those tricky issues.

Figure 1
Figure 1: 1-minute tests can detect short but severe outages that might go missed with longer testing frequencies.

Ready, Aim, Alert

Second, more frequent testing intervals means that your alerts will trigger faster. Since alerts are often the primary way that an operations team will consume monitoring data, getting an alert notification out a minute or two earlier can reduce time to resolution proportionally. More data also means that you can fine tune alert sensitivity to avoid false positives by, for instance, requiring more than one round of problematic metrics to trigger an alert.

We recommend that for highly sensitive services, a single round of data across many locations is a good trip wire. For less sensitive ones, use a second round of data before triggering an alert, which can easily be customized in alert rules.

If you’re looking to reduce MTTD/MTTR by moving on issues ahead of when users start noticing and reporting them, more frequent measurement is a good step to take.

Increase Precision

Figure 2
Figure 2: Higher testing frequencies will trigger and clear alerts faster, proving more accurate SLA measurements.

Third, and perhaps most obvious, collecting data more frequently will result in more granular data for troubleshooting and reporting. This includes application metrics, network metrics and path traces. Telling a clear story of what happened or having a precise understanding of performance can be a function of how much data you collect.

We recommend that you match your data precision needs to the test interval. A 1-hour test test from 10 agent locations creates 7,200 data points per metric per month. A 1-minute creates 430,000. Your SLAs may depend on the precision of these data points, so monitor as much as you need to.

If you’d like to tune reports to get an even clearer picture of performance over time, then measuring more frequently is a good move.

Figure 3
Figure 3: More data points make reports more precise.

If you’re already a ThousandEyes user, you can already enjoy the new 1-minute test frequency to speed up problem detection and increase data precision. If not, get started with a free trial and monitor your critical services with high-fidelity performance monitoring.

Processing...