As some of you may have noticed, yesterday our engineering team noticed that 16 of our Gerrit repositories were very badly broken. Their branches and tags all seemed to have vanished, along with their configuration (this is stored in a special branch on the repository itself). All of the repositories except one have been restored to their state as of about midnight UTC on Thursday, September 6. What follows is an in-depth analysis as to what happened and how I fixed it, along with some commentary about what I learned along the way.
First, a bit of background. Over the past week or two, the number of people experiencing timeouts when cloning repositories had increased dramatically. After some initial investigation, I decided that we should do some garbage collection on the repositories. This is done automatically with native Git, but the jgit that powers Gerrit does no such cleanup–this is a known flaw. Upon reading the documentation, asking upstream, and performing a small scale test on Labs, we added a cronjob to run git gc --quiet on each repo once a day. This ran at 02:00UTC on Thursday.
Fast forward to when Roan asked me what was going on with the operations/mediawiki-config repository — all the branches were missing in Gerrit, including refs/meta/config (that special branch I mentioned that stores configurations–including the access control list).
We immediately disabled the cron. I was pretty sure that this wouldn’t have caused the problem, but I certainly didn’t want to exacerbate any problems we had. The next order of business was figuring out exactly how damaged the repositories were and why their refs were all missing–and indeed, refs/heads/* was empty as was refs/meta/*. I poked around in objects/*, where everything seemed to be intact. So we had objects that didn’t seem to be used, but were sticking around. Where did they go? git-fsck had the answer for us:
So, we’ve got some dangling commits, but no dangling trees, or worse blobs? Fantastic! If we had dangling trees, it would mean our commits were lost so we would have to rebuild them based on the trees (possible, but time-consuming). If we had dangling blobs, that’s nearly disastrous–blobs don’t track filenames so you’d have to rebuild the trees and then rebuild the commits (all by hand, takes absolutely forever, better to just restore from someone else’s clone and lose the code review metadata).
So, now it just became a matter of sorting out which refs these commits could go to. Playing around with git show makes it pretty easy to figure out where the commits belong, and so this just turned into a process of editing the appropriate refs/* files to add the commit hash (and what seems so obvious now took several hours of hand-wringing, I assure you). So for the next several hours, I took the time to sort out these dangling commits, and the repos ended up working again.
So, what did we learn? I’m a huge fan of lists, so I’ll bullet-point them:
- Git stores its data like this for a reason — If Git did not store its data as commit-agnostic objects, it would be impossible to recover from a situation like this. With Subversion (which we were using until earlier this year), you would just have to restore from a backup and hope it didn’t happen again. Git makes it really hard to actually destroy your data forever. Which brings me to my next point,
- If you think you’ve made a mistake, stop and think — While Git does make it very difficult to get rid of data forever, it is possible. For example, if we had let the cron run again, git gc would’ve merrily deleted all of the unreferenced objects since they were unused. If you get into a bad situation with Git where all seems lost–it’s probably not lost! Just take a little while to think, get your bearings, and you can most likely get back where you were to begin with. And finally,
- Replication is not a substitute for backups — We had been replicating the Git repositories from the main Gerrit server (manganese) to the old Subversion server (formey) to serve as a slave/backup. However, since the problem was just “refs don’t exist anymore”, Gerrit merrily replicated the reference removals on to the slave. We need proper snapshot backups of the Git data, which I plan to take care of next week.
So, what caused this in the first place? I am still not sure what the underlying cause was to begin with. I continue to suspect the cron, but it is incredibly unlikely. It’s also possible we hit some sort of Gerrit bug, but I don’t really suspect that and nothing in the log files seems to indicate that either. Also, it would be a good idea (and I’ll do it over the weekend) to run git fsck on all of our repositories to make sure none of them are in a bad state as well. Granted that the operations/debs/mysqlatfacebook repository is still incredibly broken (it has no references & no dangling objects!) and some of the other repos have a few outstanding issues, there’s some chance to investigate further and hopefully update this post with an actual cause.