Downtime on en.wikipedia.org resolved

We had 52 minutes of downtime on the English-language Wikipedia site today; only en.wikipedia.org was affected. Our master database server was thrown into a funky state in which hundreds of access threads were stuck in the “statistics” state — which seems to be MySQL’s way of saying “I’ve fallen and I can’t get up”.
It’s unclear exactly what set it off, but basically nothing works until you restart MySQL. After switching the site to an alternate master database, all has been well.
At 52 minutes from start of event, this took us a bit longer than I’d like to resolve — we had to percolate through a couple levels of alert calls before we finished diagnosing it and getting the DB switch pushed through. (Sorry to wake you up early Tim!)
A similar event in future should be fixable within a few minutes, thanks to Tim’s work on making the master-switch system more foolproof. We’re fixing up our internal documentation so all our site ops will now know  how to run the database master switch script next time!
Image (1) sad-wiki.png for post 3742
— brion

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

4 Comments
Inline Feedbacks
View all comments

I have a brilliant idea!
From now on, our downtime screen should say, “This Wikipedia is broken. We recommend looking up this subject in your local library; while you’re at it, kindly take down notes and add them to the Wikipedia article later.”

No donation link? A wiki was down; are donations up?

In this case the donations page would have worked fine… we don’t always want a link though since some sitewide outages would leave that broken to. 🙂

well it is doing it again, only partial access and it s including the wikinews servers this time with no access to the wikinews page. I vote conspiricy theory. Is it safeguarded against malicious flooding? ~~~~