Meet the Analytics Team

Translate this post

Over the past few months, the Wikimedia Foundation has been gearing up a variety of new initiatives, and measuring success has been on our minds. It should come as no surprise that we’ve been building an Analytics Team at the same time. We are excited to finally introduce ourselves and talk about our plans.
The team is currently a pair of awesome engineers, David Schoonover and Andrew Otto, a veteran data-analyst, Erik Zachte, and one humble product manager, Diederik van Liere. (We happen to be looking for a JavaScript engineer — if beautiful, data-driven client apps are your thing – or you know someone, drop us a line!)
We’ve got quite a few projects under way (and many more ideas), and we’d like to briefly go over them — expect more posts in the future with deeper details on each.
First up: a revamp of the Wikimedia Report Card. This dashboard gives an overview of key metrics representing the health and success of the movement: pageviews, unique visitors, number of active editors, and the like.
Illustration of the revamped Reportcard
The new report card is powered by Limn, a pure JavaScript GUI visualization toolkit we wrote. We wanted non-technical community members to be able to interact with the data directly, visualizing and exploring it themselves, rather than relying on us or analysts to give them a porthole into the deep. As a drop-in component, we hope it will contribute to democratizing data analysis (though we plan to use it extensively across projects ourselves). So play around with the report card data, or fork the project on GitHub!

Kraken: A Data Services Platform

But we have bigger plans. Epic plans. Mythical plans. A generic computational cluster for data analytics, which we affectionately call Kraken: a unified platform to aggregate, store, analyze, and query all incoming data of interest to the community, built so as to keep pace with our movement’s ample motivation and energy.
How many Android users are there in India that visit more than ten times per month? Is there a significant difference in the popularity of mobile OS’s between large cities and rural areas of India? Do Portuguese and Brazilian readers favour different content categories? How often are GLAM pictures displayed off-site, outside of Wikipedia (and where)?
As it stands, answering any of these questions is, at best, tedious and hard. Usually, it’s impossible. The size of the success of Wikimedia projects is a double-edged sword, in that it makes even modest data analysis a significant task. This is something we aim to fix with Kraken.
More urgently, however, we don’t presently have infrastructure to do A/B testing, measure the impact of outreach projects, or give editors insight into the readers they reach with their contributions. From this view, the platform is a robust, unified toolkit for exploring these data streams, as well as a means of providing everyone with better information for evaluating the success of features large and small.
This points toward our overarching vision. Long-term, we aim to give the Wikimedia movement a true data services platform: a cluster capable of providing realtime insight into community activity and a new view of humanity’s knowledge to power applications, mash up into websites, and stream to devices.
Dream big!

Privacy: Counting not Tracking

The Kraken is a mythical Nordic monster with many tentacles, much like any analytics system: analytics touches everything — from instrumenting mobile apps to new user conversion analysis to counting parser cache lookups — and it needs a big gaping maw to keep up with all the data coming in. Unfortunately, history teaches us that mythical cephalopods aren’t terribly good at privacy. We aim to change that.
We’ve always had a strong commitment to privacy. Everything we store is covered by the Foundation’s privacy policy. Nothing we’re talking about here changes those promises. Kraken will be used to count stuff, not to track user behaviour. But in order to count, we need to store and we want you all to have a good idea of what we’re collecting and why we’re collecting it and we will be specific and transparent about that. We aim to be able to answer a multitude of questions using different data sources. Counts of visitors, page and image views, search queries and number of edits and new user registrations are just a few of the data streams currently planned; each will be annotated with metadata to make it easier to query. To take a few more examples: page views will be tagged to indicate which come from bots. Traffic from mobile phones will be tagged as mobile. By counting these different types of events and adding these kind of meta tags, we will be able to better measure our progress towards the Strategic Plan.
We’ll be talking a lot more about the technical details of the system we’re building, so check back in case you’re interested or reach out to us if you want to provide feedback about how to best use the data to answer lots of interesting questions while still preserving users’ privacy. This post only scratches the surface, but we’ve got lots more to discuss.

Talk to Us!

Sound exciting? Have questions, ideas, or suggestions? Well then! Consider joining the Analytics mailing list or #wikimedia-analytics on Freenode (IRC). And of course you’re also very welcome to send me email directly.
Excited, and have engineering chops? Well then! We’re looking for a stellar engineer to help build a fast, intuitive, and beautiful toolkit for visualizing and understanding all this data. Check out the Javascript/UI Engineer job posting to learn more.
We’re definitely excited about where things are going, and we are looking forward to keeping you all up to speed on all our new developments.
Finally, we are hosting our first Analytics IRC office hours! Join us on July 30th, at 12pm PDT (3pm EDT / 9pm CEST) in #wikimedia-analytics to ask all your analytics and statistics related questions about Wikipedia and the other Wikimedia projects.
Best regards,
David Schoonover, Analytics Engineer
Andrew Otto, Analytics Engineer
Erik Zachte, Data Analyst
Diederik van Liere, Product Manager

Archive notice: This is an archived post from, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Inline Feedbacks
View all comments

/me misses functional names for things
Kraken vs ClickTracking (although a brief look suggests Kraken is a much broader project)
Flow vs LiquidThreads (Although i suppose liquid is far from functional)
But you get my point – Adjective-y names vs unrelated verby names (Echo, Flow) or mathology names (Kraken, Athena)
Otoh I suppose there are benefits to these types of names. They certainly sound cooler (Mono is a disease, I don’t want no disease in my books. Athena is a diety and dietys are usually at the top of the social hierarcy)

Really interesting stuff, Diederik!

I’m not sure it is a good idea to name it Kraken. In German the word “Datenkraken” (“Daten” meaning data) is used for companies like Google, which “grab” your data and make profit out of it. To me, it has a distinctively negative connotation in this context.

This is really fantastic news. I am continually hoping for ways to address the kinds of questions you raise above, and am happy to see there is a clear path toward building the needed infrastructure. Looking forward to learning more!

Analytics and Data! My favourite! Thanks for sharing all this info and good luck to the new Wiki Analytics team.