As members of the Professional Services team at ThousandEyes, we often work with our customers to make the ThousandEyes test data part of their operational workflow. Due to the ongoing COVID-19 pandemic, we have seen an exponential rise in the number of remote workers, which has placed some added stress on IT help desks who need to resolve end user issues for critical applications and infrastructure. Often, this may even be in the remote employee’s local area network. That is why it’s important to localize issues appropriately—and quickly. To help in this regard, we wanted to share how you can use dashboards to localize end user issues.
In a previous blog post, we shared how ThousandEyes is extending our expertise in order to aid organizations that are affected by the novel coronavirus. Specifically, we are offering free usage of our user experience agents until July 31, 2020, for identifying network and app performance issues for remote workers connecting to critical apps and services. These Endpoint Agents provide visibility, from a monitoring perspective, into the last mile of an end user’s network.
In this blog post, we will go through a general best-practice approach to help you localize end user issues in a quick-and-easy way. This is something that has been tried and tested with successful outcomes across our customer deployments as well as through our Professional Services engagements.
Localizing End User Issues
Our Endpoint Agent provides data through three main streams:
- Scheduled Tests (tests running from a group of agents per configured frequency)
- Browser Session Data (data based on actual end user traffic for a configured domain/perimeter)
- Local Area Network Data (statistics pertaining to the LAN, provided every 5 minutes)
Visualizing this volume of data in a correlated fashion, in a “single pane of glass” if you will, is much easier from an IT help desk perspective. So that’s what we’ll walk you through now. Of course, my underlying assumption is that you have your appropriate scheduled tests configured, up and running from your Endpoint Agents.
The Basic Principles
Creating a report or dashboard using Endpoint Agent data is slightly different than creating ones for our Cloud or Enterprise Agent test data. This actually has nothing to do with the configuration, but more to do with how the data is grouped and interpreted.
Rendering data for all of your agents on a single dashboard may not necessarily generate data that is ready for consumption. So it is important to report on data coming from sources based on some of the following grouping examples:
- Agents that are in the same geographic location
- Agents that are running tests to a similar group of tests
- Agents that are in the same private network block
You can group efficiently by using the Endpoint Agent labels.
Limiting Data Sources and Sorting
If you have a large number of agents falling within your label criterion, it is important to ensure that you only render the agents whose metrics are within the problem range. This is achieved through efficient configuration within the dashboard widgets, which we will visit shortly. This, in turn, will help to display data for the most problematic agents first, and potentially not even display data on the dashboard for the end users who are most likely not having issues.
The Final Result
As you can see, the dashboard above is composed of multiple widgets. Let’s examine one of them now.
The configuration options to note here are the following:
- Time Span (5 minutes): This metric is collected every 5 minutes, so with this setting, we will always be looking at the latest round of measurements.
- Sort Cards By (Value – Descending): Considering high Gateway Loss is an indication of an issue, we are sorting the rendered values in descending order to display the Agents with the most amount of Gateway Loss first.
- Limit To (18): In order to limit the values generated in the dashboard for brevity, we are rendering the top 18 agents here (based on the descending order in which they are generated). So, at any time, we will be looking at the top 18 users having Gateway Loss. Note that this setting can be changed in an ad hoc fashion if you are looking to see more than 18 agents.
- Endpoint Agents: We are filtering on a specific set of agents via agent labels per location.
These settings are mirrored for other widgets, and if a low value of a certain metric indicates a problem, the values are arranged in ascending order.
Based on the dashboard and its configuration, we are correlating Low Service Availability with High Gateway Loss, Poor Signal Quality, High Packet Loss and Latency.
An example of how this correlation has worked is illustrated below in this image.
As you can see, we have a user in question for whom we notice Gateway Loss as well as a poor Signal Quality. We also notice that the same user has end-to-end Packet Loss for the service in addition to high Latency. While we do not see the user having issues with Service Availability, it is safe to assume that this user may have had a significant degradation in end user experience while accessing this service. In conclusion, we can say that the potential end user issues were caused by poor Signal Quality and Packet Loss in the Local Area Network.
Using the right metrics, in the right form, arranged together can really make troubleshooting easy. It is not often about pinpointing where the exact issue lies, it makes a huge difference, even if you are able to validate something to not be responsible for a certain issue.