Last week, the Wikimedia Foundation flipped a historic switch: we transitioned our main technical services to a shiny new data center in Ashburn, Virginia. For the first time since 2004, Wikimedia sites are no longer primarily hosted in Tampa, Florida.
To help understand this grueling journey (and why it’s crucial), look through the eyes of Wikimedia Foundation engineer Peter Youngmeister. Peter joined the Wikimedia Foundation’s Technical Operations team (“Ops”) about two years ago, in March 2011. At the time, “the team” meant “about six engineers supporting the fifth-most visited site on the Web,” said Peter. The Foundation has now increased its Ops team to 14, and has several job openings.
“This also meant that out of the fast/cheap/well triangle, we’d gone with fast and cheap,” Peter recalled. We made quick-and-dirty solutions because problems had to be solved immediately. “With so few Ops engineers, you’re always playing catchup; long-term is hard.” He said that the digital infrastructure when he arrived was “kinda like many many layers of really artfully applied duct tape.”
And the biggest, most pressing flaw: Wikimedia only had one fully functional primary data center, in Tampa, Florida. If something catastrophic happened to Tampa, all the sites would go down until new servers could be brought online and data recovered from backup. So the Ops team chose a new data center location, in Ashburn, Virginia, and started preparing to integrate it into our infrastructure. But the preparation of EQIAD, which began in 2011, turned out to require much more work than the Operations and Platform engineering teams had foreseen.
We had never set up a data center of this complexity from scratch before. The systems in Tampa were “layers of duct tape that had been built up over years… Our first problem was that, for example, very little was in Puppet,” Peter said. To configure the Wikimedia servers, we use Puppet, a configuration management system, which lets us write code (Puppet “manifests”) that manages all of our servers like a single large application (and more easily track, troubleshoot, and revert changes).
Since the new data center would exactly mirror the old one, leveraging the power of Puppet to keep our configurations in sync would be crucial. But since our infrastructure included dozens of services that weren’t in Puppet yet, we had to examine each of their configurations to “puppetize” them. And in early 2011, Peter noted, “our whole search infrastructure existed outside of Puppet control. Our Puppet manifests for our databases were a file that just had a comment that said ‘domas is a slacker.'”
In short, Wikimedia needed not only to replicate the functionality that had been incrementally added over ten years, but to refactor it into an automatable form so that the third, fourth, etc. replications would be far easier. So, in addition to the Ops team’s day-to-day responsibilities for site maintenance and crisis management, Ops and Platform teams needed to find hundreds or thousands of staff-hours to refactor, automate and add monitoring to all the services it provided. We aren’t done yet with our “mass puppetization” investment, which we’ve been working on for at least two years.
The core application (MediaWiki) is only one of the myriad moving parts that needed attention; over the past two years, we’ve puppetized and strengthened databases, search, fundraising code, logging and analytics tools, caches, the Nagios monitoring software and dozens of other services. Take search as an example: several years ago, the Wikimedia Foundation used one search server to cover nearly all the wikis other than English Wikipedia — a dangerous single point of failure. Peter arrived at the Foundation and found that none of the search infrastructure was puppetized. After he worked significantly on search, as of November 2012, he noted we had “two fully independent search setups, one in each data center. Fail-over takes a couple of minutes at most.”
Puppetizing the configuration files, and using Gerrit to manage code review and approval also gave us better transparency and helped staff and volunteers collaborate better on improvements, maintenance and troubleshooting. Anyone can see how our servers are configured, read the Puppet configuration “manifests,” propose new changes and view and comment on pending proposals.
In contrast, “when I got here, everything was done on a local Subversion repository or our puppetmaster, and then pushed out from there, which kinda works if you have 6 or fewer people,” Peter said. (The Puppetmaster is the master repository that instructs all the other boxes in the cluster to update their manifests, and thus updates their packages and configurations.) To keep track of configuration changes, people simply used an IRC bot to log summaries of their actions to the server admin log, which made it hard to revert changes or help train new teammates. “But also, when the Ops team is only 6 people, and everyone has been around for years, everyone just knows all the parts,” he explained.
As they created the 700+ hostclasses currently defined in Puppet, Operations engineers moved towards treating our infrastructure as a codebase, and thus from pure systems administration towards a DevOps approach. As of November 2012, “we’re very nearly at a point where we can manage our whole infrastructure without needing to log into hosts, which is the whole goal,” Peter said with a smile. Logging into hosts is a bad thing “because it means that you’re doing things by hand and/or that what you’re doing isn’t going through code review. Moving to Gerrit for our Puppet repos is awesome: It means I can really easily see what my coworkers are doing. I can ask for review when needed. It’s a huge sign of maturation of our department.”
Their years of work have led to a nearly painless data center migration, but it also began paying off immediately with reduced downtime. You’ll read more about that in the second part of this story next week.
Sumana Harihareswara, Engineering Community Manager