New two-part schedule for 1.17 deployment

Translate this post

As covered on this blog this week, we had a few problems with our initial deployment of 1.17 to the Wikimedia cluster of servers.  We’ve investigated the problems, and believe we have fixed many of the issues.  Some of the unsolved issues are complicated enough that the only timely and reasonable way to investigate them is to deploy and react, so we’ve come up with a plan that lets us do it in a safe way by deploying on just a few wikis at a time (as opposed to all at once, as we tried earlier).
We’re scheduling two deployment windows:

  • First window – This wave will be deployed between Friday, February 11, 6:00 UTC – 12:00 UTC (10pm PST Thursday, February 10 in San Francisco).  This first wave will be to a limited set of wikis (see below).
  • Second window – Wednesday February 16 (between 6:00 UTC – 12:00 UTC) – full deployment (tentative)

Repeating what is new about 1.17:  There are many, many little fixes and improvements (see the draft release notes for an exhaustive list), as well as one larger improvement: Resource Loader.  Read more in the previous 1.17 deployment announcement.
Update (2011-02-11, 8:00 UTC) – we’ve deployed to a few of the wikis now (see below for updates on which ones).  We uncovered a couple issues we were able to fix, and plan to keep going.
Update (2011-02-11, 9:07 UTC) – we added he.wikisource.org to the list due to community member request, and so we’d have a right-to-left language wiki in the mix.  Thank you he.wikisource community!  We’ve now deployed 1.17 to meta.wikimedia.org and he.wikisource.org.
Update (2011-02-11, 10:26 UTC) – we deployed to our last six wikis, and then backed off of nl.wikipedia.org and eo.wikipedia.org once we saw some issues with ParserFunctions.  We’re investigating those, and will probably try again before this window is complete.
Final update (2011-02-11, 12:28 UTC) – we found and fixed some localization problems that triggered ParserFunction bugs on both nl.wikipedia.org and eo.wikipedia.org.  However, the traffic from nl.wikipedia.org was enough to cause a very noticeable spike in the CPU usage on the web servers, as well as timeout errors in our logs.  We have profiling turned on for the list of wikis we’ve deployed to, and will use the time between now and our next deployment window to find and fix problems.

First window

This first deployment window will be to a limited set of wikis:

Note that the point of this first round of wikis being switched over is to be able to observe the problem or problems without overloading the site and bringing it down.  This deployment should be small enough in scope that even if there are moderate performance problems, no one should notice without watching our monitoring tools.  We may not roll out to every wiki listed above during the first wave, but we plan to roll out to enough of them that we can gather enough debugging information to make the second wave (full deployment) go smoothly.

Second window

We will continue to roll this out to the rest of the wikis during this window.  Depending on our confidence level, we may deploy to the remaining wikis, or we may decide to deploy to a portion of the remaining wikis.  If necessary, we will schedule another window to finish the deployment.

Technical details

Here’s some more technical detail: one problem with the original Tuesday deploy was that the cache miss rate went up quite substantially.  We believe the problem was a problem with the configuration of the $wgCacheEpoch variable, which caused more aggressive culling of our cache than the servers could handle.  We have made adjustments, and so this shouldn’t be a problem during our next deployment attempt.
The $wgCacheEpoch problem explains some of the problems we had, but not all of them.  Since we don’t have a clear explanation for all of the problems, we plan to modify the way we deploy this software so that we aren’t rolling this out to every wiki simultaneously.  As our software is currently built, this isn’t easy to do in a general way, but it turns out this release is suited to an incremental deployment. (Note: we also plan to develop a more general capacity to roll out incrementally for future releases).
Thank you for your patience!  We hope that this time around we can deploy this in a way that you won’t notice anything other than the improvements.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

6 Comments
Inline Feedbacks
View all comments

Seems like a great plan. Good luck :-). One question – will Test Wiki be a part of the first wave?

Nux : Seems like a great plan. Good luck :-). One question – will Test Wiki be a part of the first wave? No, testwiki will be the very last wiki to be switched to 1.17. We’ll switch it only after we’ve switched all other wikis to 1.17 and are reasonably happy with the way they’re running. We have test2.wikipedia.org , though, which we now use to test our setup for selective depoyments (i.e. switching some but not all wikis to 1.17). It’s running 1.17 now and will continue to run it at least until testwiki is switched over, at… Read more »

Well, all that’s left if wishing you good luck and lots of coffee 🙂

Need to test this version in that case..will see how it works.

Where to report bugs? In all the wikis with 1.17, I (still) don’t have any skin at all (Opera 11.01).

I am impressed by the new module Resource Loader, it would certainly be useful for many developers. I’ll try to make a tutorial for French developers so they can integrate this module and speed up website.
Thanks!