As covered on this blog this week, we had a few problems with our initial deployment of 1.17 to the Wikimedia cluster of servers. We’ve investigated the problems, and believe we have fixed many of the issues. Some of the unsolved issues are complicated enough that the only timely and reasonable way to investigate them is to deploy and react, so we’ve come up with a plan that lets us do it in a safe way by deploying on just a few wikis at a time (as opposed to all at once, as we tried earlier).
We’re scheduling two deployment windows:
- First window – This wave will be deployed between Friday, February 11, 6:00 UTC – 12:00 UTC (10pm PST Thursday, February 10 in San Francisco). This first wave will be to a limited set of wikis (see below).
- Second window – Wednesday February 16 (between 6:00 UTC – 12:00 UTC) – full deployment (tentative)
Repeating what is new about 1.17: There are many, many little fixes and improvements (see the draft release notes for an exhaustive list), as well as one larger improvement: Resource Loader. Read more in the previous 1.17 deployment announcement.
Update (2011-02-11, 8:00 UTC) – we’ve deployed to a few of the wikis now (see below for updates on which ones). We uncovered a couple issues we were able to fix, and plan to keep going.
Update (2011-02-11, 9:07 UTC) – we added he.wikisource.org to the list due to community member request, and so we’d have a right-to-left language wiki in the mix. Thank you he.wikisource community! We’ve now deployed 1.17 to meta.wikimedia.org and he.wikisource.org.
Update (2011-02-11, 10:26 UTC) – we deployed to our last six wikis, and then backed off of nl.wikipedia.org and eo.wikipedia.org once we saw some issues with ParserFunctions. We’re investigating those, and will probably try again before this window is complete.
Final update (2011-02-11, 12:28 UTC) – we found and fixed some localization problems that triggered ParserFunction bugs on both nl.wikipedia.org and eo.wikipedia.org. However, the traffic from nl.wikipedia.org was enough to cause a very noticeable spike in the CPU usage on the web servers, as well as timeout errors in our logs. We have profiling turned on for the list of wikis we’ve deployed to, and will use the time between now and our next deployment window to find and fix problems.
This first deployment window will be to a limited set of wikis:
- http://simple.wikipedia.org/ (simplewiki) (deployed 7:50 UTC)
- http://simple.wiktionary.org/ (simplewiktionary) (deployed 7:50 UTC)
- http://usability.wikimedia.org/ (usabilitywiki) (deployed 6:00 UTC)
- http://strategy.wikimedia.org/ (strategywiki) (deployed 7:50 UTC)
- http://meta.wikimedia.org/ (metawiki) (deployed 8:50 UTC)
- http://he.wikisource.org/ (hewikisource) (deployed 8:50 UTC)
- http://en.wikiquote.org/ (enwikiquote) (deployed 10:14 UTC)
- http://en.wikinews.org/ (enwikinews) (deployed 10:14 UTC)
- http://en.wikibooks.org/ (enwikibooks) (deployed 10:14 UTC)
- http://beta.wikiversity.org (betawikiversity) (deployed 10:14 UTC)
- http://eo.wikipedia.org/ (eowiki) (postponed)
- http://nl.wikipedia.org (nlwiki) (postponed)
Note that the point of this first round of wikis being switched over is to be able to observe the problem or problems without overloading the site and bringing it down. This deployment should be small enough in scope that even if there are moderate performance problems, no one should notice without watching our monitoring tools. We may not roll out to every wiki listed above during the first wave, but we plan to roll out to enough of them that we can gather enough debugging information to make the second wave (full deployment) go smoothly.
We will continue to roll this out to the rest of the wikis during this window. Depending on our confidence level, we may deploy to the remaining wikis, or we may decide to deploy to a portion of the remaining wikis. If necessary, we will schedule another window to finish the deployment.
Here’s some more technical detail: one problem with the original Tuesday deploy was that the cache miss rate went up quite substantially. We believe the problem was a problem with the configuration of the $wgCacheEpoch variable, which caused more aggressive culling of our cache than the servers could handle. We have made adjustments, and so this shouldn’t be a problem during our next deployment attempt.
The $wgCacheEpoch problem explains some of the problems we had, but not all of them. Since we don’t have a clear explanation for all of the problems, we plan to modify the way we deploy this software so that we aren’t rolling this out to every wiki simultaneously. As our software is currently built, this isn’t easy to do in a general way, but it turns out this release is suited to an incremental deployment. (Note: we also plan to develop a more general capacity to roll out incrementally for future releases).
Thank you for your patience! We hope that this time around we can deploy this in a way that you won’t notice anything other than the improvements.