A reliable future for Programs & Events Dashboard

Over the last few months, I’ve been working on some major changes to how the Dashboard, Wiki Education’s wiki program management and tracking software, works. Keeping up with the ever-increasing usage of Programs & Events Dashboard — it’s been used by more than 3,800 program leaders to track nearly 9,000 events and more than 120,000 participants worldwide — has been a recurring motif in my work.

The global Dashboard has different usage patterns than Wiki Education’s version that we use for our Scholars & Scientists Program and Wikipedia Student Program, and often runs into different bottlenecks when it comes to importing data and serving traffic. Widespread use of the Dashboard for Wikidata programs — such as the Program for Cooperative Cataloging (PCC) Wikidata Pilot — can mean hundreds of thousands of edits from a relatively small set of editors, often with thousands of automated or semi-automated edits in a single day. Such intense bursts of editing, I learned, could run up against a memory limit for the Dashboard’s Toolforge tool that pulls revisions from the Wikimedia Cloud replica databases. The sheer number and size of highly active programs was also straining the capacity of the single server that was running Programs & Events Dashboard in early 2021. Updates for the largest programs were taking hours or even days, and the time between updates stretched to more than a week.

Even worse, I would not infrequently wake up to an inbox full of emails letting me know that Programs & Events Dashboard was down, with screenshots of Apache’s all-too-familiar “Internal Server Error” page. (I’m very grateful for how kind the Wikimedia community has been when these problems came up, as I know how frustrating it can be when websites are broken or unreliable.) I hadn’t been able to pin down the cause of this problem, and I usually had to turn to the unsatisfying solution of simply restarting the server and hoping it wouldn’t happen again.

To diagnose and help fix the Dashboard’s flailing performance, I started looking for help from a Ruby on Rails performance expert. Luckily, I found just the right person: Nate Berkopec, a well-known figure in the Ruby community who runs the consultancy Speedshop. In March, Nate started with a review of the situation, digging in to performance profiling data to compile a long list of performance weak points and possible solutions. Once we enabled performance profiling on Programs & Events Dashboard, however, it quickly became clear that the biggest problem was simply that a single server couldn’t handle everything being thrown at it: running a database with hundreds of millions records, serving a steady stream of web traffic, and continually pulling in new data for hundreds of active programs. We needed to shift from a single server to a distributed system.

Throughout March and much of April, Nate worked on building infrastructure as code scaffolding for the Dashboard, while I focused on the acute performance problems. I undertook a review of more than 100 of the slowest active programs, to see why they were taking an hour or more for each update. I rewrote a number of inefficient database queries, found ways to avoid duplicating calculations unnecessarily during the data import process, and moved several of the slowest tasks — like generating CSV exports for large campaigns — into background jobs that wouldn’t tie up the webserver for other Dashboard visitors. This made a dramatic impact on the update speed for most of the large, active programs — although a few were still running for multiple hours, and intermittently failing.

A new problem major problem also emerged in mid-March: database corruption. We couldn’t find an immediate cause for it, but with advice from the Wikimedia Cloud Services team, I took a major step toward a distributed system — moving the database to its own dedicated virtual server. This change didn’t go very smoothly, with a day of downtime and a bevy of unexpected errors that surfaced once I brought the system back online. But once I’d worked through those problems, I had a sense that we’d turned a corner. A few weeks later I took the distributed system a step further, adding a third server dedicated solely to running data updates.

After coming through the worst of these problems, with a highly stable website and update latency back down to between 30 minutes (for most programs) and a few hours (for the larger, more resource-intense programs), I have a much clearer understanding of why the Dashboard was failing. One failure mode involved the webserver running out of resources to process new requests. Many program organizers, understandably impatient for updated statistics, were using the “manual update” feature to attempt to pull in new data for their programs. Unlike most data updates, these manual updates were not processed as background jobs; instead, they would tie up webserver capacity until the update completed. With enough of these running at once, the webserver couldn’t serve additional traffic. Another failure mode, which I believe was at the root of the database corruption problem and most of the unexplained errors, was the server running out of memory. In particular, the update process for very large programs could run for hours and progressively use more and more memory until the server had none left — resulting in the database, the webserver, or the update process crashing in the middle of operations, with unpredictable consequences.

By May, Nate’s work was ready to put into action. We migrated Wiki Education Dashboard to a whole new architecture based on the “HashiStack”, a suite of open source tools for running distributed systems on cloud servers. With this new architecture, the Dashboard’s components are broken up into more than a dozen separate services that have their own dedicated resources — making failures in one service unlikely to break the rest of the Dashboard, and making it straightforward to scale up simply by adding additional servers. This infrastructure-as-code framework isn’t quite ready for use with Programs & Events Dashboard yet; we plan to work on making it fully compatible with Wikimedia Cloud in August.

This performance and stability work is just the start of our plans to improve Programs & Events Dashboard, which is the main focus of Wiki Education’s technology plans in the coming year. In the coming months, I’ll be learning more about how people are using Dashboard globally, and what the most important areas for improvement are. Based on what I learn, I’ll a publish roadmap for the next year and beyond. If you have ideas or feedback — or if you’d like to get involved with developing it — I’d love to hear from you.

(This post was originally published on Wiki Education’s blog.)