Migrating Wikimedia Labs to a new Data Center

Translate this post

As part of ongoing efforts to reduce our reliance on our Tampa, Florida data center, we have just moved Wikimedia Labs to EQIAD, the new data center in Ashburn, Virginia. This migration was a multi-month project and involved hard work on the part of dozens of technical volunteers. In addition to reducing our reliance on the Tampa data center, this move should provide quite a few benefits to the users and admins of Wikimedia Labs and Tool Labs.

Migration objectives

We had several objectives for the move:

  1. Upgrade our virtualization infrustructure to use OpenStack Havana;
  2. Minimize project downtime during the move;
  3. Stop relying on nova-network and start using Neutron;
  4. Convert the Labs data storage system from GlusterFS to NFS;
  5. Identify abandoned and disused Labs resources.

Upgrade and Minimize Downtime

Wikimedia Labs uses OpenStack to manage the virtualization back-end. The Tampa Labs install was running a slightly old version of OpenStack, ‘Folsom’. Folsom is more than a year old now, but OpenStack does not provide an in-place upgrade path that doesn’t require considerable downtime, so we’ve been living with Folsom to avoid disrupting existing Labs services.
Similarly, a raw migration of Labs from one set of servers to another would have required extensive downtime, as simply copying all of the data would be the work of days.
The solution to both 1) and 2) was provided by OpenStack’s multi-region support. We built an up-to-date OpenStack install (version ‘havana’) in the Ashburn center and then modified our Labs web interface to access both centers at once. In order to ease the move, Ryan Lane wrote an OpenStack tool that allowed users to simultaneously authenticate in both data centers, and updated the Labs web interface so that both data centers were visible at the same time.
At this point (roughly a month ago), we had two different clouds running: one full and one empty. Because of a shared LDAP back-end, the new cloud already knew about all of our projects and users.

Two clouds, before migration

Then we called on volunteers and project admins for help. In some cases, volunteers built fresh new Labs instances in Ashburn. In other cases, instances were shut down in Tampa and duplicated using a simple copy script run by the Wikimedia Operations team. In either case, project functions were supported in both data centers at once so that services could be switched over quickly and at the convenience of project admins.

Two clouds, during migration

As of today, over 50 projects have been copied to or rebuilt in Ashburn. For those projects with uptime requirements, the outages were generally limited to a few minutes.

Switch to OpenStack Neutron

We currently rely on the ‘nova-network’ service to manage network access between Labs instances. Nova-network is working fine, but OpenStack has introduced a new network service, Neutron, which is intended to replace nova-network. We hoped to adopt Neutron in the Ashburn cloud (largely in order to avoid being stuck using unsupported software), but quickly ran into difficulties. Our current use case (flat DHCP with floating IP addresses) is not currently supported in Neutron, and OpenStack designers seem to be wavering in their decision to deprecate nova-network.
After several days of experimentation, expedience won out and we opted to reproduce the same network setup in Ashburn that we were using in Tampa. We may or may not attempt an in-place switch to Neutron in the future, depending on whether or not nova-network continues to receive upstream support.

Switch to NFS storage

Most Labs projects have shared project-wide volume for storing files and transferring data between instances. In the original Labs setup, these shared volumes used GlusterFS. GlusterFS is easy to administer and designed for use cases similar to ours, but we’ve been plagued with reliability issues: in recent months, the lion’s share of Labs failures and downtime were the result of Gluster problems.
When setting up Tool Labs last year and facing our many issues with GlusterFS, Marc-Andre Pelletier opted to set up a new NFS system to manage shared volumes for the Tool Labs project. This work has paid off with much-improved stability, so we’ve adopted a similar system for all projects in Ashburn.
Again, we largely relied on volunteers and project admins to transfer files between the two systems. Most users were able to copy their data over as needed, scping or rsyncing between Tampa and Ashburn instances. As a hedge against accidental data loss, the old Gluster volumes were also copied over into backup directories in Ashburn using a simple script. The total volume of data copied was around 30 Terabytes; given the many-week migration period, network bandwidth between locations turned out not to be a problem.

Identify and reclaim wasted space

Many Labs projects and instances are set up for temporary experiments, and have a short useful life. The majority of them are cleaned up and deleted after use, but Labs still has a tendency to leak resources as the odd instance is left running without purpose.
We’ve never had a very good system for tracking which projects are or aren’t in current use, so the migration was a good opportunity to clean house. For every project that was actively migrated by staff or volunteers, another project or two simply sat in Tampa, unmentioned and untouched. Some of these projects may yet be useful (or might have users but no administrators), so we need to be very careful about prematurely deleting them.
Projects that were not actively migrated (or noticed, or mentioned) during the migration period have been ‘mothballed’. That means that their storage and VMS were copied to Ashburn, but are left in a shutdown state. These instances will be preserved for several months, pending requests for their revival. Once it’s clear that they’re fully abandoned (in perhaps six months), they will be deleted and the space reused for future projects.

Conclusions

In large part, this migration involved a return to older, more tested technology. I’m still hopeful that in the future Labs will be able to make use of more fundamentally cloud-designed technologies like distributed file shares, Neutron, and (in a perfect world) live instance migration. In the meantime, though, the simple approach of setting up parallel clouds and copying things across has gone quite well.
This migration relied quite heavily on volunteer assistance, and I’ve been quite charmed by how gracious the vast majority of volunteers were about this inconvenience. In many cases, project admins regarded the migration as a positive opportunity to build newer, cleaner projects in Ashburn, and many have expressed high hopes for stability in the new data center. With a bit of luck we’ll prove this optimism justified.
Andrew Bogott, DevOps Engineer

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

7 Comments
Inline Feedbacks
View all comments

[…] just posted a lengthy, esoteric article on the Wikimedia blog about the work project that’s been devouring all of my time for the […]

Out of curiosity, does this mean that the labs are only hosted in one location? Wouldn’t it make sense to have them served by both data centers with live mirroring, so that if one data centre goes down the other takes over the load and the tools remain available?

Yes, labs is only hosted in one location, as are essentially all WMF services at present. The foundation is currently in the process of setting up a second data center, at which point the labs cluster will be expanded to support instances in the new center as well.
Even then, labs is unlikely to passively mirror VMs. It will of course be an option for individual projects to set up redundant services across multiple locations.

Thanks for the update. We from Telugu Wikipedia community are using Wiki Trends (http://tools.wmflabs.org/wikitrends/telugu-uptrends-this-week.html ) live linked from Telugu Wikipedia Sidebar. we were affected by this migration, as the link was down for more than couple of weeks. Even the error message only mentioned contacting the Project admin and did not give the context or potential reasons for failure. It points to opportunity for better communication for any future migrations or changes in future.

[…] Wikidata : l’entrée 16 millionième a été créée ainsi qu’un wikiprojet France ■ Migration des labs ■ Wikimag ; Signpost : pourquoi avoir modifié la typographie ; RAW ; Wikimedia […]

Maybe we could have a cluster in Germany?

> Maybe we could have a cluster in Germany?
For a variety of legal reasons it is much simpler for the Foundation to maintain its hosting infrastructure in a single country. Managing an international tangle of IP and privacy laws is best left to companies with ten or 100 times our staffing 🙂