As you probably know, we publish on a regular basis backups of the different Wikimedia projects, containing their complete editing history. As time progresses, these backups grow larger and larger and become increasingly harder to analyze. To help the community, researchers and other interested people, we have developed a number of analytic tools to assist you in analyzing these large datasets. Today, we want to update you about these new tools, what they do and where you can find them. And please remember they are all still in development:

Wikihadoop
Diffdb
WikiPride

Wikihadoop

Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files. What this means is that we can embarrassingly easy parallelize the processing of our XML files and this means that we don’t have to wait for days or weeks to finish a job.
We used Wikihadoop to create the diffs for all edits from the English XML dump that was generated in April of this year.

DiffDB

DiffIndexer and DiffSearcher are the two components of the DiffDB. The DiffIndexer takes as raw input the diffs generated by Wikihadoop and creates a Lucene-based index. The DiffSearcher allows you to query the index so you can answer questions such as:

Who has added template X in the last month?
Who added more than 2000 characters to user talk pages in 2008?

WikiPride

Finally, WikiPride allows you to visualize the breakdown of a Wikipedia community by age of account and by the volume of contributed content. You need a Toolserver account to run this, but you will be able to generate cool charts.
If you are having trouble getting Wikihadoop to run, then please contact me at dvanliere at wikimedia dot org and I am happy to point you in the right direction! Let the data crunching begin!
Diederik van Liere, Analytics Team

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Welcome to Diff

Welcome to Diff, a community blog by – and for – the Wikimedia movement. Join Diff today to share stories from your community and comment on articles. We want to hear your voice!

Subscribe to Diff via Email

2 Comments

Inline Feedbacks

View all comments

Pedro

13 years ago

#3913

what I see is a sustantial lack of documentation. I have an account on Toolserver, but I can’t even think how to start trying wikipride when the only documentation is
“A Wikipedia analytics framework in the works….” and a cryptic configuration example.

Diederik van Liere

#3914

Hi Pedro,
So that is why you can contact us and ask for specifics….
But step 1 would be to install a copy of WikiPride on Toolserver and to make changes to the config file.
Best,
Diederik

Diff

Do It Yourself Analytics with Wikipedia

Wikihadoop

DiffDB

WikiPride

Can you help us translate this article?

Related

Welcome to Diff

Subscribe to Diff via Email

🎈 Let’s Connect Learning Clinic on Gender Sensitivity Training

WikiCon Australia 2024

Wikidata Data Reuse Days 2025

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole

	This comment is spam
	This comment is a violation of the Code of Conduct
	Other