As part of ongoing efforts to reduce our reliance on our Tampa, Florida data center, we have just moved Wikimedia Labs to EQIAD, the new data center in Ashburn, Virginia. This migration was a multi-month project and involved hard work on the part of dozens of technical volunteers. In addition to reducing our reliance on the Tampa data center, this move should provide quite a few benefits to the users and admins of Wikimedia Labs and Tool Labs.
We had several objectives for the move:
- Upgrade our virtualization infrustructure to use OpenStack Havana;
- Minimize project downtime during the move;
- Stop relying on nova-network and start using Neutron;
- Convert the Labs data storage system from GlusterFS to NFS;
- Identify abandoned and disused Labs resources.
Upgrade and Minimize Downtime
Wikimedia Labs uses OpenStack to manage the virtualization back-end. The Tampa Labs install was running a slightly old version of OpenStack, ‘Folsom’. Folsom is more than a year old now, but OpenStack does not provide an in-place upgrade path that doesn’t require considerable downtime, so we’ve been living with Folsom to avoid disrupting existing Labs services.
Similarly, a raw migration of Labs from one set of servers to another would have required extensive downtime, as simply copying all of the data would be the work of days.
The solution to both 1) and 2) was provided by OpenStack’s multi-region support. We built an up-to-date OpenStack install (version ‘havana’) in the Ashburn center and then modified our Labs web interface to access both centers at once. In order to ease the move, Ryan Lane wrote an OpenStack tool that allowed users to simultaneously authenticate in both data centers, and updated the Labs web interface so that both data centers were visible at the same time.
At this point (roughly a month ago), we had two different clouds running: one full and one empty. Because of a shared LDAP back-end, the new cloud already knew about all of our projects and users.
Then we called on volunteers and project admins for help. In some cases, volunteers built fresh new Labs instances in Ashburn. In other cases, instances were shut down in Tampa and duplicated using a simple copy script run by the Wikimedia Operations team. In either case, project functions were supported in both data centers at once so that services could be switched over quickly and at the convenience of project admins.
As of today, over 50 projects have been copied to or rebuilt in Ashburn. For those projects with uptime requirements, the outages were generally limited to a few minutes.
Switch to OpenStack Neutron
We currently rely on the ‘nova-network’ service to manage network access between Labs instances. Nova-network is working fine, but OpenStack has introduced a new network service, Neutron, which is intended to replace nova-network. We hoped to adopt Neutron in the Ashburn cloud (largely in order to avoid being stuck using unsupported software), but quickly ran into difficulties. Our current use case (flat DHCP with floating IP addresses) is not currently supported in Neutron, and OpenStack designers seem to be wavering in their decision to deprecate nova-network.
After several days of experimentation, expedience won out and we opted to reproduce the same network setup in Ashburn that we were using in Tampa. We may or may not attempt an in-place switch to Neutron in the future, depending on whether or not nova-network continues to receive upstream support.
Switch to NFS storage
Most Labs projects have shared project-wide volume for storing files and transferring data between instances. In the original Labs setup, these shared volumes used GlusterFS. GlusterFS is easy to administer and designed for use cases similar to ours, but we’ve been plagued with reliability issues: in recent months, the lion’s share of Labs failures and downtime were the result of Gluster problems.
When setting up Tool Labs last year and facing our many issues with GlusterFS, Marc-Andre Pelletier opted to set up a new NFS system to manage shared volumes for the Tool Labs project. This work has paid off with much-improved stability, so we’ve adopted a similar system for all projects in Ashburn.
Again, we largely relied on volunteers and project admins to transfer files between the two systems. Most users were able to copy their data over as needed, scping or rsyncing between Tampa and Ashburn instances. As a hedge against accidental data loss, the old Gluster volumes were also copied over into backup directories in Ashburn using a simple script. The total volume of data copied was around 30 Terabytes; given the many-week migration period, network bandwidth between locations turned out not to be a problem.
Identify and reclaim wasted space
Many Labs projects and instances are set up for temporary experiments, and have a short useful life. The majority of them are cleaned up and deleted after use, but Labs still has a tendency to leak resources as the odd instance is left running without purpose.
We’ve never had a very good system for tracking which projects are or aren’t in current use, so the migration was a good opportunity to clean house. For every project that was actively migrated by staff or volunteers, another project or two simply sat in Tampa, unmentioned and untouched. Some of these projects may yet be useful (or might have users but no administrators), so we need to be very careful about prematurely deleting them.
Projects that were not actively migrated (or noticed, or mentioned) during the migration period have been ‘mothballed’. That means that their storage and VMS were copied to Ashburn, but are left in a shutdown state. These instances will be preserved for several months, pending requests for their revival. Once it’s clear that they’re fully abandoned (in perhaps six months), they will be deleted and the space reused for future projects.
In large part, this migration involved a return to older, more tested technology. I’m still hopeful that in the future Labs will be able to make use of more fundamentally cloud-designed technologies like distributed file shares, Neutron, and (in a perfect world) live instance migration. In the meantime, though, the simple approach of setting up parallel clouds and copying things across has gone quite well.
This migration relied quite heavily on volunteer assistance, and I’ve been quite charmed by how gracious the vast majority of volunteers were about this inconvenience. In many cases, project admins regarded the migration as a positive opportunity to build newer, cleaner projects in Ashburn, and many have expressed high hopes for stability in the new data center. With a bit of luck we’ll prove this optimism justified.
Andrew Bogott, DevOps Engineer