Wikimedia projects down due to power problem in primary data center

Starting at 0:10 UTC on July 5th, the Wikimedia Foundation suffered from
intermittent, partial power failures in the internal power network of
one of its main data centers in Tampa, Florida. Due to the temporary
unavailability of several critical systems and the large impact on the
available systems capacity, all Wikimedia projects went down. The power
situation stabilized at 1:12 UTC, and systems and services recovery has
been taking place since. We expect all projects to be back online and
editable around 4:00 UTC.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

9 Comments
Inline Feedbacks
View all comments

Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems.

Mike.lifeguard : Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems. As someone who’s had this exact same problem inside of a professional colo, let me explain: Unlike your home, collocation facilities do not have batteries on every computer (Google is rumored to do this now, but that is neither here nor there). These facilities have giant batteries that run the entire facility. Though, generally the batteries are only big enough to keep the facility online for long enough to have the giant diesel generators… Read more »

Mike.lifeguard :
Are there not UPSes on the servers? Redundant power supplies? I’m susprised a professionally-managed datacentre like the one we use would have such problems.

And they should also have enough oil down there in Florida to run them *g*

Jon : Any part of the power system in a colo _could_ fail, generally it is redundant and doesn’t fail, but even in the best of cases – things can go wrong. Heck, it is possible that a wire in the wall from the batteries to the servers melted down and failed. Yes, the reason I asked was precisely to find out what went wrong, not imply that nothing should ever go wrong 🙂 Jon : As for redundant power supplies (on the machines themselves), that only helps in the case of A) having 2 completely separate power sources (generally… Read more »

I was so glad to see support for offline editing in the budget from last week. That will support the dual use of allowing for mirroring peer-to-peer wikis hosting copies of projects (maybe including image bundles, if we’re lucky.) The server-room implications should be that redundancy will be The trick to making offline and peer-to-peer editing run smoothly is third party edit conflict resolution, which is remarkably similar in many ways to the pending changes extensions which recently went live, and also shares characteristics with WP:3O. And also, this outage seems to have been addressed very well. Kudos to the… Read more »

(“…The server-room implications should be that redundancy will be” easier, I meant to include.)

Having only ONE power supply for mission critical datacentres is generally NOT common.
Starting with TIER 4 (Uptime’s Classification) you will have multiple power and cooling distribution parts(see: http://www.bitkom.org/files/documents/reliable_data_centers_guideline.pdf)
– THAT’s STATE OF THE ART DATACENTRE DESIGN! #;-).

James Salsman :
(“…The server-room implications should be that redundancy will be” easier, I meant to include.)

I’m susprised a professionally-managed datacentre like the one we use would have such problems.

There is no stand-by power curious?