Powerball Gambles on Risky Network Strategy

Posted by on January 14th, 2016
January 14th, 2016

On January 13th, Powerball announced the winners of the largest lottery jackpot in U.S. history. Nearly $1.5B was up for grabs and with more than 635 million tickets sold, punters scrambled to check the winning numbers as they were announced at 8pm PT. There were three big winners. The Powerball website, however, turned out to be the loser for the night.

We’ll dig into the infrastructure behind the fraying Powerball website and how it performed over the night. You can follow along with all of this data using this share link: https://fuzee.share.thousandeyes.com

Mega Meltdown

The Powerball website that serves up jackpot estimates and winning numbers usually sees a page load time of 650ms. This is pretty fast, helped along by the sparse content on the page, as you’ll see in Figure 1.

Figure-1
Figure 1: The Powerball website with winning numbers from the January 13th lottery.

However, as the lottery drawing approached, page load times spiked to 5 seconds at 8pm and reached over 10 seconds within minutes after the drawing. Figure 2 shows the page load time for the Powerball website, with a dramatic increase around the 8pm drawing.

Figure-2
Figure 2: Page load failures and load times exceeding 10 seconds coincided with the 8pm PT lottery drawing on January 13th.

You’ll also see that a vast majority of users were not even able to fully load the Powerball website. Figure 3 shows web server availability, which cratered in all but 2 of the 28 cities that we tested.

Figure-3
Figure 3: Fewer than 10% of monitored locations were able to reach the Powerball website
when the winning numbers were announced.

Looking Behind the Curtain

So how did Powerball, which knew it had a record-breaking lottery on its hands, end up with such dismal performance? Let’s dig into the infrastructure and network behind Powerball. The website is hosted in a data center in Kansas City, run by the Multi-State Lottery Association (MUSL). Figure 4 shows the typical network paths into the data center via the ISP Cogent Communications.

Figure-4
Figure 4: At 6:50 PT, all paths flow through upstream ISP Cogent to
the Multi-State Lottery Association (MUSL) data center in Kansas City.

At 7:05pm, the MUSL turned on routes to Microsoft Azure, directing traffic from approximately half of the cities observed to Microsoft’s cloud data centers. Figure 7 shows traffic from 5 cities flowing through Microsoft’s network (green nodes).

Figure-5
Figure 5: Starting at 7:05pm, new routes to Microsoft Azure were also announced and approximately half of
the monitored locations began accessing the Powerball site hosted there.

But as the 8pm drawing approached, the network was under strain from all of the traffic. Packet loss increased, as can be seen in Figure 6, reaching over 90% at the time the winning numbers were released.

Figure-6
Figure 6: Packet loss to the Powerball website was at elevated levels for 90 minutes around the 8pm drawing.

At 8:05pm, MUSL again spread the love to another provider, this time Verizon’s Edgecast CDN. Figure 7 shows the network path topology just after the winning numbers were announced. Paths taken to Microsoft’s data center make up the cluster on top (168.61.218.73), Edgecast in the middle (72.21.91.39) and the MUSL data center on the bottom (104.219.253.10).

Figure-7
Figure 7: After 8pm, the Edgecast CDN was also added into the rotation alongside Microsoft Azure and the MUSL data center.

At 10:10pm, after the traffic finally died down and application and network metrics were back to normal levels, MUSL reverted back to routing traffic to their own data center through upstream ISP Cogent.

Figure-8
Figure 8: Around 10pm, MUSL went back to routing traffic only to their own data center through ISP Cogent.

Scaling Lessons Learned

So what could MUSL have done better? As the winning numbers were announced, the Powerball website simply wasn’t equipped to handle the massive amounts of traffic it received.

For sites that have spiky, but predictable traffic, here are a few options:

  • Use a CDN to serve up traffic round-the clock. This costs more but will have the best customer experience.
  • Flip on a CDN service well before known traffic peaks. MUSL did this with Edgecast, but not until the drawing itself, at which point DNS changes can take a while to propagate.
  • Diversify with multiple data centers and upstream ISPs. MUSL had only one data center and one upstream ISP, Cogent Communications—if Cogent or their single data center goes down, MUSL’s service goes with it.
  • Within the data center, more load balanced network paths and web servers would also help to reduce performance impacts.

The odds of this Powerball drawing were 1 in 292 million. Winning the lottery may be a shot in the dark, but when it comes to web performance, you can have a guaranteed return if you properly prepare for your network’s next big event.

Processing...
  • Jason Maher

    Wow! A very cool write up and very accurate Nick! We were aware that we could have switched to the CDN a bit faster as we only have a 10 gig link out of our datacenter in KC. Splitting the traffic with Azure was something we came up with on the fly and were testing it to see how long it could hang before needing to switch to the CDN. Even with splitting our traffic between Azure and KC and flattening our webpage we still needed to switch to the CDN once draw time hit. DNS was set to a 1 min TTL so load times only got high for a few mins. Thanks again for this write-up!

    Jason A. Maher
    Network Engineer
    Multi-State Lottery