We’ve received many complaints about strange behavior on various wikis we host starting last night. These problems were directly related to an attempted deployment.
A bit of background about the 1.17 release:
- In Oct 2010 we committed to more frequent releases in response to community requests.
- Simultaneously, we committed to cutting through the backlog of code review requests from the community. As of this writing, the Code Review Team we formed has reduced the backlog of over 1400 un-reviewed core revisions down to zero in the 1.17 branch, as well as dispatching roughly 4000 other revisions in extensions (figuring out which ones we needed to review, and reviewing the important revisions there, too).
As is our usual practice, we review all code before trying to deploy it This practice has generally been good enough in the past that we have been able to quickly address anything we don’t catch in review within the first few minutes of deployment. The 1.17 release process has been longer than we would have liked, which has meant more code to review, and more likelihood for accumulating a critical mass of problems that would cause us to abort a deployment.
Our preparation for deployment uncovered a few issues, including a schema change, an update to the latest version of the diff utility and various other small issues which were discovered during the initial deployment to test.wikipedia.org. Pushing to test.wikipedia.org turns out to have been hugely useful, and in future we will take it as a lesson learned that any large deployment must successfully deploy to test.wikipedia.org at least 24 hours prior to general deployment.
When we finally deployed last night, our Apaches started complaining pretty much immediately. We rolled back to the previous version, worked on debugging and thought we had a suitable fix. We attempted deployment again but found the same issue very quickly. What we discovered was that our cache miss rate went from roughly 22% with the old version of the software (1.16) to about 45% with 1.17. The higher miss rate increased the load on our Apaches to the point where they couldn’t keep up, at which point they start behaving unpredictably. This can cause cascading failures (for example, caching bad data served by overloaded Apaches), and can result in strange layout problems and other issues that many people witnessed today.
By the way, whenever we do a large deployment, a number of WMF staff and community developers meet online to work through any issues that might arise. We schedule deployments late at night in the US to take advantage of lulls in request traffic, so everybody is working late. By the second failure, these people had been awake for many hours and we started to be concerned about their ability to work efficiently on little sleep, so I vetoed further attempts at deployment today.
We are currently combing the logs for further clues about how to mitigate risks of a similar outcome when we next attempt to deploy 1.17, which most likely won’t happen until later this week (at the earliest). We’re are also closely investigating the check-ins related to parsing and caching, and evaluating our profiling data. We plan to regroup tomorrow, decide how confident we are in the fixes we are able to implement in the past 24 hours, and make a decision as to when we should target to deploy.