IP Telephony or VoIP has been built upon two foundational blocks – the telephone and the Internet. It has taken over 40 years for both the telecommunication industry and the computing industry to develop for VoIP service. In the recent times, VoIP has been brought to the forefront of communications by the freedom and flexibility of the Internet. While VoIP is cost effective, maintaining service assurance and troubleshooting a VoIP service is extremely challenging. In this blog post we will analyze how ThousandEyes can help you troubleshoot those transitory yet troublesome voice quality issues within your global enterprise network. We will specifically examine the impact of the network on VoIP RTP streams across a corporate WAN.
Why is VoIP Challenging?
In my role as Solutions Engineer, I talk to a lot of organisations about monitoring real-time applications like VoIP and video. Applications like VoIP have a low threshold to network inconsistencies, hence service degradation is immediately recognised. In recent times, I’ve seen an increased reliance on transient and third-party infrastructure like the Internet to deliver these services. This has led to a common problem among our customers: Ensuring application performance over networks they do not own, manage or maintain. This begs the question, how do you actually know the underlying cause of degraded call quality when you cannot see the infrastructure end-to-end?
How Voice Tests Work
ThousandEyes Voice Tests understand call flows within wide area networks that incorporate both internal and external (ISP) infrastructure. By replicating RTP flows with active tests that probe the network in regular intervals, we build a continual baseline of performance. This translates into a meaningful dataset and paints a picture of how the network impacts VoIP call quality at any point in time. Irrespective of how VoIP is deployed and consumed, you can proactively plan, assure and improve service delivery.
An enterprise environment has many call flows operational at any given time. Voice Tests run synthetic RTP streams between several source agents and one common target agent. Each source agent sends a predefined number of UDP packets with RTP streams encapsulated to the target and measures the Mean Opinion Score (MOS). MOS indicates the quality of the call in line with network performance and is achieved by calculating loss, discards, latency and packet delay variance. To ensure simulated VoIP traffic is treated the same way as production VoIP traffic, packets are marked with the appropriate DSCP (Differentiated Services Code Point) values in the IP header. DSCP values are used to ensure preferential treatment of packets. As VoIP traffic is extremely sensitive to any packet latency and delay, VoIP packets are prioritized with the appropriate DSCP markings to reduce the impact of delay. VoIP Tests also specify which codecs to use, so that the simulated traffic can match the characteristics of real world VoIP traffic.
The most commonly used codec in VoIP environments is G.711 as it has minimum overhead on transcoding when breaking out to the Public Switched Telephone Network (PSTN). You might also notice G.729 at times, which is a low-bandwidth codec that is commonly used across the WAN where bandwidth is constrained. ThousandEyes can also support some of the more recently developed codecs including Silk and G.722.
Getting The Environment Ready
Configuring voice tests in ThousandEyes is extremely simple. Once you have your tests in place, you can start gathering VoIP metrics very quickly. Figure 1 shows the configuration view of voice tests that allow you to select both source and target agents, along with the frequency of the tests.
The advanced settings help accurately replicate VoIP traffic by specifying a number of options, that include the network port (although we can obviate the need to open ports using NAT traversal techniques), codec for the RTP stream and DSCP values.
For VoIP, the DSCP value over the WAN is usually EF (Expedited Forwarding), which ensures the traffic is forwarded through the network the same way as real voice data. IP packets with the prioritized marking traverse a node’s Low Latency Queue (LLQ) when interface or device-level congestion is encountered.
In the next section of the blog post, we will look at an real world example of voice quality degradation and take you through a step-by-step post mortem of what happened.
Hello, Can You Hear Me?
On the morning of June 16th, one of our customers experienced a dip in the average MOS score of voice calls going over their MPLS WAN between the US and Europe into APAC. The target node in China is represented in blue in Figure 3 below. Two of the sites, one in Europe (BEL002) and the other in the U.S. (USA001), were reporting an alarming difference in MOS. This drop in call quality was typical of an on-going, yet unpredictable and intermittent, problem our customer had been experiencing over the past several months. As you can see in the lower left of Figure 3, network performance metrics like forwarding loss, discards, latency and jitter clearly indicate a problem.
The two agents at the top of the list (Figure 4), exhibit a low MOS score along with high packet discard and jitter. However, what is also interesting is that both locations appear to be receiving traffic marked with the DSCP value of AF11, which did not match with the configuration of the test. As seen in the advanced configuration settings, Figure 2, RTP traffic for this customer was also configured with a DSCP value of EF.
Behind The Scenes: Troubleshooting Voice Quality
Path Visualization represents a hop-by-hop Layer 3 network topology between the source and the target agents of the voice test. This level of data helps us understand the route taken by the voice calls and if any specific device within the network is affecting call quality. Hovering over the source nodes indicate that the US and European sites appear to be having problems because of remarked QoS values, as shown in Figure 5. Whereas the other locations in green seem to have no call quality issues as the DSCP value of EF remains untouched. This is an indication that somewhere in the network VoIP traffic is being remarked, which is causing adverse application performance and negatively impacting the end user experience. But the question is where?
The “Quick Selection” view quickly discovers three links in the topology where DSCP values are being changed (Figure 6).
Let’s take a closer look to examine how the VoIP traffic has been modified. Let’s start with Europe first.
At the first hop we notice that traffic was re-marked to DSCP 0 (Best Effort). This could possibly reflect a local configuration issue, one that should be simple to rectify.
However, that does not explain why we see another DSCP change three hops into the topology. If we move along the topology to hop number three from the source agent, we can clearly see that traffic from both the European sites converge on the same WAN link; however, they are being treated very differently. Traffic is being split half-n-half between EF and Best Effort. This indicates that this particular node is honouring EF traffic but is configured to remark Best Effort traffic. Best Effort traffic is being re-marked to AF11 by the service provider. So essentially, highly sensitive VoIP traffic is being remarked twice on its journey and is likely the reason we are seeing very high discards, jitter and latency.
The behaviour of traffic from the European site is different than what the US site is experiencing. Unlike Europe, traffic from the US experiences a DSCP change only once along the path (Figure 6). Interestingly enough, the traffic is again being re-classified from EF, but this time into AF11. And this again is more likely the reason we are seeing very high discards, jitter and latency.
Influencing The Resolution
In Europe, the service provider behaviour was as expected, where Best Effort traffic was re-classified as AF11. However, VoIP traffic should not have been remarked to Best Effort in the first place. If you recall, the first remapping happened in the very first hop. The local remarking in the first hop was addressed by a configuration change to treat EF traffic appropriately through the network. However, traffic from the US agent was remarked due to a device specific configuration issue. Within a few minutes of identifying the issue, we were able to gather this intelligence enabling our customer to share this information directly with their provider and influence the resolution.
How to Monitor Your VoIP Network
This is just one example of how ThousandEyes can provide value in the context of VoIP. You too can get this type of visibility into your network by quickly installing our software agents within your WAN. Roll up your sleeves and sign up for a free 15 day trial and start monitoring VoIP in your network.