WPX Downtime August 2 2021: What Happened & What We Are Doing About It

By Terry Kyle
Australian, Chief Dog Lover/Co-Founder WPX.net (With Georgi Petrov, CEO)
Arguably The World’s Fastest WP Host

OFFICIAL STATEMENT BY Georgi Petrov & Terry Kyle (founders/owners of WPX):

What Happened Exactly?

On August 2, 2021, all US services for WPX went offline for roughly 5 hours, including our own website, WPX.net

Websites and emails were all affected. 

Even mail servers that we would use to notify customers of the problem were cut off from the Web, due to this outage.

The fault was quickly traced back to the colocation and upstream provider we use in Chicago, Steadfast.net, whose entire network went down.

Here is their Facebook page post about the issue here:

and their late official statement:

At no point could WPX speed up the work as it was being handled by a different company in the supply chain of our hosting services.

Context

On June 8, 2021, one of the world’s largest CDN providers, Fast.ly, suffered a service outage for about 1-10 hours, depending on location.

When Fast.ly went down it took down Spotify, CNN, The New York Times, Reddit and many other massive traffic sites. For up to 10 hours:

On July 22, 2021, a major DNS provider went down and took many big sites with it, including UPS, FedEx, Airbnb, Fidelity, Steam, LastPass, PlayStation Network and many others.

Same deal for Amazon Web Services back in November 2020.

Google has had downtime issues (3 in 2020), Intel’s discovered Meltdown and Spectre vulnerabilities affected millions of computers built by other companies, and the Heartbleed bug affected tens of millions of computers.

The point is that the whole Web is interconnected.

No one organization or company owns the whole Internet.

Every company, including Google and Amazon, also rely on other service providers, just like WPX relies on Steadfast who have generally been good so far. 

We absolutely hate downtime but it can happen to any online service provider.

What About Redundancy Systems?

Some affected customers have asked why we didn’t have multiple redundancy systems in place to prevent this extended outage.

We do and our contract agreement (SLA) with Steadfast is based on multiple redundancies in their system.

A forthcoming Incident Report should reveal why that failed.

During this outage, at no time were files (websites and emails) lost or in jeopardy (they are stored in 2 locations) but access to them was temporarily cut off.

Our failure here was not anticipating the possibility of a complete network outage on Steadfast’s side.

We have already started working on mitigating any future repeat of this situation by investigating backup upstream providers, apart from Steadfast.

Why Didn't WPX's CDN Serve Sites When Chicago Went Down?

Content Delivery Networks (CDNs) are generally not designed as a kind of ‘failswitch’ platform that keeps a perfect copy of every website in every CDN end point location.

Instead, CDNs like ours are designed for delivering the most popular individual pages of websites at the highest possible speed from a location nearest to a website visitor’s location e.g. a site hosted in Chicago can serve a popular page to a visitor from Paris from a server location in Paris, or nearby.

As a result of the Chicago outage yesterday, we are now considering how our own CDN can be adapted in future with a failswitch option on it.

Why Wasn't There More Information During The Downtime?

During the outage, as we posted on Facebook and in all message replies during the downtime, the only information coming out of Steadfast was “we are working on the problem and expect to have it fixed soon”.

That’s all we had and that’s all we could share for 5 hours, however frustrating it was for WPX customers.

Clearly we need to improve that speed and level of communication in future.

Another WPX failure here was that our email infrastructure was located in the same Chicago platform as our Web hosting and when network access was lost, so was email capability.

We are immediately working to rectify that point of failure and to make better use of social media in future.

What Will WPX Do Moving Forward?

With the single-minded goal of never having WPX customers in this situation again, we are:

[1] immediately exploring backup upstream providers for fast implementation so that we are not reliant on Steadfast (and their theoretical multiple redundancies, according to our contract with them);

[2] reviewing locations and providers for a shift of our email infrastructure, separate of site hosting so that no email communication is ever affected again (when our network access went offline, so did email notification for customers – unacceptable!; and,

[3] changing our customer communication practices to ensure that all WPX customers are notified of downtime much more quickly, rather than relying on social media.

We welcome any other WPX customer input into changes in our procedures and practices (can be sent via support@wpx.net thank you).

What About Compensation For Affected WPX Customers?

Affected customers should contact WPX Support directly to discuss compensation.

We deeply regret the inconvenience caused.

This outage revealed specific systemic areas where we need to improve.

Comments are closed.