Post Mortem on last night’s 1.17 deployment attempts…

We’ve received many complaints about strange behavior on various wikis we host starting last night. These problems were directly related to an attempted deployment.
A bit of background about the 1.17 release:

  • In Oct 2010 we committed to more frequent releases in response to community requests.
  • Simultaneously, we committed to cutting through the backlog of code review requests from the community. As of this writing, the Code Review Team we formed has reduced the backlog of over 1400 un-reviewed core revisions down to zero in the 1.17 branch, as well as dispatching roughly 4000 other revisions in extensions (figuring out which ones we needed to review, and reviewing the important revisions there, too).
  • 1.17 was an omnibus collection of fixes, including a large number of patches which had been waiting for review for a long time. The Foundation’s big contribution to the release was the ResourceLoader, a piece of MediaWiki infrastructure that allows for on-demand loading of JavaScript. Many other incremental improvements were made in how MediaWiki parses and caches pages and page fragments.

As is our usual practice, we review all code before trying to deploy it This practice has generally been good enough in the past that we have been able to quickly address anything we don’t catch in review within the first few minutes of deployment. The 1.17 release process has been longer than we would have liked, which has meant more code to review, and more likelihood for accumulating a critical mass of problems that would cause us to abort a deployment.
Our preparation for deployment uncovered a few issues, including a schema change, an update to the latest version of the diff utility and various other small issues which were discovered during the initial deployment to Pushing to turns out to have been hugely useful, and in future we will take it as a lesson learned that any large deployment must successfully deploy to at least 24 hours prior to general deployment.
When we finally deployed last night, our Apaches started complaining pretty much immediately. We rolled back to the previous version, worked on debugging and thought we had a suitable fix. We attempted deployment again but found the same issue very quickly. What we discovered was that our cache miss rate went from roughly 22% with the old version of the software (1.16) to about 45% with 1.17. The higher miss rate increased the load on our Apaches to the point where they couldn’t keep up, at which point they start behaving unpredictably. This can cause cascading failures (for example, caching bad data served by overloaded Apaches), and can result in strange layout problems and other issues that many people witnessed today.
By the way, whenever we do a large deployment, a number of WMF staff and community developers meet online to work through any issues that might arise. We schedule deployments late at night in the US to take advantage of lulls in request traffic, so everybody is working late. By the second failure, these people had been awake for many hours and we started to be concerned about their ability to work efficiently on little sleep, so I vetoed further attempts at deployment today.
We are currently combing the logs for further clues about how to mitigate risks of a similar outcome when we next attempt to deploy 1.17, which most likely won’t happen until later this week (at the earliest). We’re are also closely investigating the check-ins related to parsing and caching, and evaluating our profiling data. We plan to regroup tomorrow, decide how confident we are in the fixes we are able to implement in the past 24 hours, and make a decision as to when we should target to deploy.

Archive notice: This is an archived post from, which operated under different editorial and content guidelines than Diff.

Inline Feedbacks
View all comments

I think that should be rather than 😉

{sigh} thanks for the proofreading, TeMc. We type those words over and over…and they become interchangeable in the fingers -D

Relevant commentary in today’s Startup Quote:
Thanks for the updates on the blog, they help a lot!

All problems with the deployment aside: Good job.
Okay, there were unexpected surprises in the code (hey, that’s normal). The techs worked fast and efficient to get rid of the problems, and did so for quite a while. Second attempt failed as well, rollback was (reasonably) fast again. No complains there either. And postponing the next attempt is the “right thing to do”.
Now, good luck for the next attempt, may it be more successful. There’s a few API changes I’m waiting for, so “hurry up” 😉

Great post-mortem. Kudos to the team who seemed to spend 24 hour at keyboards and, throughout, were calm and friendly with the sometimes-frantic wikimedians.
Comment: Peak WMF rps seems to coincide with EU business day, not the US; consider moving deployment -3 to -5?

Amgine :
Comment: Peak WMF rps seems to coincide with EU business day, not the US; consider moving deployment -3 to -5?

Our rps peak is actually between 16:00 and 21:00 UTC, so that’s 5pm-10pm in Europe, 11am-4pm Eastern and 8am-1pm Pacific. The rps low is narrower with the lowest rps of the day around 06:00 UTC (7am Europe, 1am Eastern, 10pm Pacific; this is when we’re starting attempt #3 on Friday). See (green=Amsterdam, blue=Tampa)

Many, many years ago I got on a train late at night. It was very full and the only seat I could find turned out to be in a carriage where none of the lights worked. The train rattled along until coming to a stop seemingly in the middle of nowhere. Someone with a head torch got on and the train set off again.We could see him working in the corner of the carriage until the carriage lights came on. Everyone grave a great, friendly cheer. After a while the train stopped again and, as the guy got off into… Read more »

Coach is surely a fanatic of making use of only making use of largest great synthetic leather knowning that could be coach outlet store the hallmark by way of the Coach experience.