Prioritizing End User Experience at Scale at IG Group

In this video presentation, Steve Bamford, Senior Infrastructure Reliability Engineer at IG Group explains how the company adjusted to a remote working environment during the COVID-19 pandemic. IG Group is a world leader in online trading, operating in 16 countries with 1,700 staff worldwide. Most of IG Group’s trades are made through the Internet, and having Internet visibility beyond its own boundaries has become key for the delivery of its core services. To monitor the user experience of its employees and optimize online performance, IG Group deployed ThousandEyes Endpoint and Cloud Agents to monitor its web-based platform, Office365, as well as its Zscaler cloud security platform.

Steve Bamford, Senior Infrastructure Reliability Engineer at IG Group, speaks at ThousandEyes Connect.

Follow Along with the Transcript

Steve Bamford
One thing just to cover a little bit about IG. It’s been established for over 45 years. We have a net trading revenue of 467 million as of the 31st in May last year. And we actually operate in 16 countries with 1,700 staff worldwide. So a little bit about myself, I have over 20 years network experience. The last 8 years, which I’ve been in IG. Before that, I worked for CSC and has already said my focus is on maintaining stable infrastructure from which to serve our clients and colleagues. So a little bit about our journey. So we first started our engagement with ThousandEyes in 2018. As our technology moves to the cloud, most of our trades are through the Internet platform. The Internet has very much become part of our network. The importance of greater visibility of the Internet beyond our own boundaries became key. As of 2019, we subscribed to our Cloud and Enterprise Agents and to monitor our dealing platform. Then COVID-19 happened, very quickly we realized that we needed something to monitor our user experience and how our colleagues were working. So we managed to obtain 500 temporary licenses. Very quickly, it became clear that we actually needed to cover the entire estate. And we were able to secure those from ThousandEyes. Once we deployed the Endpoint Agent. We have seen a reduction in escalated single user calls. And we’ve managed to identify key issues around one of our remote-working solutions. And also the lack of a ZPA Zen in South Africa and the impact that has.

Steve Bamford
So, our remote user connectivity requirements. Once we got the agents, we needed to identify what are our remote user connectivity requirements are, what we needed to monitor. So very quickly, Zscaler was the first one we use Both ZIA and ZPA, ZIA for Internet access and ZPA for our internal applications. In addition, we have Office 365 and our own web-based platform, which some of our colleagues on the front line, on the trading services and dealing desks, use day in, day out. Based on this, what we actually did is we made the decision to monitor the key end user tools. So we selected the Zscaler Zen for the country in question, the IG production data center for our RDWeb Tools and platform and teams for collaboration. The other thing, the other aim we had was to enable our first line teams, as our service desk carry out those initial diagnostics. And understand the addition to understand the key issues as they arise, their impact, and be able to communicate these effectively, not just to the leadership team, but also to our users and colleagues around the world. And also to support our Cloud test as they need arise, to support what we’re monitoring on our platform.

Steve Bamford
So troubleshooting the user experience, the one thing we wanted to do is make sure our first line, our service desk teams, could quickly identify an issue, working with professional services he came up with this dashboard. Which enables us to see very quickly the top users, which may be experiencing an issue. And if we take this laptop here, laptop 167 as an example, if they had called up and had said I have a problem connecting to services like Zscaler or to the production data center. Then we can actually the service desk guy can come onto the dashboard and very quickly identify that, yes, they have availability to Teams. Gateway loss is there. Their wireless signal in the middle bar is okay. But they have got latency. Sorry, lost to Teams and Zscaler.

Steve Bamford
So we’ll be able to then work with the user to try and resolve their home network issues that they are experiencing, again, using the tool, if they were saying the problem was intermittent. You can actually drill down and see the local network for the user. And you’re able to see how consistent or intermittent the problem is and be able to offer advice. So I talked about supporting our Cloud Agent deployment. Now, one of the things we noticed in South Africa from our cloud agency is when one of the major undersea cables got cut back in March, we saw a latency increase from the Cape Town agent. Pushing the latency above 200 milliseconds, which is unusual because we use Akamai. So it should be below that. When we investigated, we saw that actually the ISP Akamai used were only advertising this traffic from London. As you can see from the screenshot, it shows London Internet Exchange. We obviously raised the case with Akamai, but what we did is we switched one of the tests to be the website for IG and we saw that it wasn’t just our own users. The Cloud Agent that was impacted, but our clients and our users who used the website outside of the or in Cape Town or in that region were also hitting that edge node and routing in this case via Cogent and the UK. So by using that, we are able to give Akamai more evidence. And once they resolve the issue, we could see very quickly the latency fall from over 200 milliseconds back down to a more reasonable 17 milliseconds. And the traffic remained in-country.

Steve Bamford
So what I wanted to show you is we all saw the… Or all of us in the U.K. are very aware of the Virgin Media outage that occurred on the 27th of April. We had a look at this and what we are able to see … as my screen refreshes … When we looked at the impacts to our users. We were able to very quickly. See, the traffic wasn’t even leaving Virgin Media’s network. You’ll see here that this test is specifically for our London users because we’ve had to split the U.K. up into regions due to the number of U.K. based employees. So very quickly, we saw that the traffic was staying within Virgin Media’s network and not reaching our production data center. And we could see by move looking at our colleagues who are based in the North of England, but it wasn’t just London or South Southern Centric in the U.K., it was actually impacting the whole of the U.K. When we came to the next morning, we started to get queries about connectivity to the production data center. And as we were, as we are currently trialing Internet Insights, we were able to use that and combine that with the Endpoints. And we could see that actually the issue was more widespread than Virgin Media. When we delved in. We were able to see that UPC, in general, have a continuous period of intermittent outages, which actually, while it went on in the U.K., probably up to around midnight for Virgin. There was certainly on the hour, every hour issues occurring on the wider European network for UPC which we saw here, we were then able to look back at our production data center, or it could have been our Zscaler test. Pick out the UPC network in Poland and again see packet loss affecting a number of our users from Poland. We could see that again, this followed the wider UPC outage times pretty much to the letter. Where it was normally if we were to look at these users, the traffic is without an issue.

Steve Bamford
So that was one thing we wanted to, we were able to see very quickly with Endpoints. Previously, it would have taken, you know, a long time with traceroutes, remoting into people’s PCs if we could. You would be able, you wouldn’t be able to do that without or see this easily without the Endpoint Agents. When we were looking at this, obviously one thing we’ll see is a reduction in the escalation of remote worker cases. We wanted to see ideally the number go down. And what I wanted to demonstrate here is if you look between the 30, the 1st of March, and before the lockdown. Our average number of remote worker cases that were raised by our service desk was around 10. Then we went into lockdown, very quickly that number tripled to over 30, largely because of the Zscaler Global ZPA issue that they had. But then even when it fell down, it was still twice the number of incidents that were coming into our service desk for remote worker cases. As we deployed the Endpoint Agent, what we were able to see was that actually if you look at the blue and red in the white hand graph, the number of escalations to our desktop and networks team in the last 2 weeks have actually fallen off. And the number of tickets being raised has fallen off, which is probably an indication of the service desk not needing to raise a ticket because we’ve been able to identify the problem. So, a bit of the whirlwind tour. But hopefully, it gives you an insight onto how we’ve been able to use the Endpoint Agents to support our user experience in with everybody working from home.

Subscribe to the Internet and Cloud Intelligence Blog!
Subscribe
Back to ThousandEyes Blog