Our Data Analytics team is responsible for building out our logging and data mining infrastructure, and for making Wikimedia-related statistics useful to other parts of the Foundation and the movement. Up until fairly recently, Erik Zachte has been the main analytics person for Wikimedia (with support from many generalists here), working first as a volunteer building stats.wikimedia.org, then on behalf of Wikimedia Foundation starting in 2008. It started off as a large number of detailed page view and editor statistics about all Wikimedia wikis, large and small, and has since been augmented to include various summary formats and visualizations. As the movement has grown, it has played an increasingly important role in helping guide our investments.
Since 2009, Erik has also published the “Monthly Report Card”, which aggregates several vital statistics from various sources (including other places on stats.wikimedia.org, as well as comScore data and other useful bits), and presents them in a more digestible format. Assembling the report card has been a lot of work for Erik, as much of that system is still a manual process.
Behind the scenes, Erik is processing logs created through our Squid logging infrastructure. When you visit Wikipedia or another Wikimedia site, you are directed to one of many Squid caching servers (or, in the case of our mobile site, a Varnish caching server), which in turn either fulfill your request, or proxy your request to one of several Apache web servers. Since every request must go through the Squid caches, that’s where we log requests. By default, Squid caches will log requests to a file, but our servers are configured to fire a UDP packet off to our centralized logging infrastructure, using a standard Squid feature written by our very own Tim Starling. For a very long time, the aforementioned “logging infrastructure” was a single machine, but now is two machines (“locke” and “emery”).
The log collectors are custom infrastructure (also written by Tim) that route the UDP packets (each containing the text of a single log line) through a series of configurable filters. While it’s possible to write these filters in any programming language that can handle piped data from standard input, we’ve written most of these in C for performance reasons. We keep a complete log of one in every one thousand packets, and log select items at more granular levels. For example, volunteer Domas Mituzas implemented a counter for every article view request, storing only the fact that the page was viewed.
The sampled logs collected by the log collectors are run through a series of Perl scripts written by Erik Zachte. You see the output of those scripts on stats.wikimedia.org today. Erik processes them and reports them onto stats.wikimedia.org via a mix of manual and automated tools.
One place Erik gets the data for stats.wikimedia.org is the XML dumps. Full XML dumps of the entire contents of Wikipedia and other Wikimedia sites have been provided since the early days of the project. The generation process for these logs is now maintained and improved by Ariel Glenn in our Operations department.
In 2010, the Wikimedia Foundation began investigating new logging systems that support standard features such as session tracking. While our existing system worked for basic analytics, we found that it was difficult to add features important to constituencies such as our fundraising team. For that and other reasons, we experimented with Open Web Analytics (OWA), but later discovered that while OWA is a great general purpose analytics tool, it wasn’t suited to our specific needs at that time. In the meantime, we haven’t yet chosen a new web analytics platform.
Earlier this summer, several staff members (including Nimish Gautam and Mani Pande from Global Development, as well as engineers Sam Reed and Erik Zachte) started to revamp the Monthly Report Card, using modern tools like jqPlot to display the information. An early prototype version is available.
Openings and the future
We are hiring for two full-time analytics positions right now, plus a contract opportunity. Please apply and tell your friends and networks about them:
- Systems Engineer – Data Analytics
- Software Developer Backend – Data Analytics
- Contractor RFP, Data Analytics
The opportunity here is huge for you and for Wikimedia. Relative to most websites our size, we have far more work ahead to build new tools than to deal with maintenance chores. A visionary engineer has a chance to build a great system, and know that it will be used on a top 10 website.
If you are someone who works with open source technology, but are frustrated with your current employer’s inability to play nice with upstream providers, this may be your kind of place. We work with upstreams. Throughout the Platform Engineering group, we work with the communities surrounding the open source tools we use. For example, when Tim Starling modified our Squid proxy servers to log via UDP, he submitted the patch upstream to the Squid maintainers (subsequently accepted). We also worked with Chris Leonello (of jqPlot fame) to add confidence interval functionality in jqPlot. We generally intend to make sure everyone benefits from our work, not just our own projects.
If you have big ideas for promoting open source analytics, this is one of the largest platforms you could ever want. You’ll get the support you need to succeed and to make the project succeed, because the entire movement wants and needs this. And if you’re happy where you are, but you run an open source web analytics platform that you think we ought to use, please (a) talk to us and (b) suggest to your fellow hackers that they apply for these jobs. You’ll have a great chance to sway us. 🙂
Director of Platform Engineering