Over the past couple of months, the Wikimedia Foundation, Kaggle and ICDM organized a data competition. We asked data scientists around the world to use Wikipedia editor data and develop an algorithm that predicts the number of future edits, and in particular predicts correctly who will stop editing and who will continue to edit.
The response has been great! We had 96 teams compete, comprising in total 193 people who jointly submitted 1029 entries. You can have a look for yourself at the leaderboard.
We are very happy to announce that the brothers Ben and Fridolin Roth (team prognoZit) developed the winning algorithm. It is elegant, fast and accurate. Using Python and Octave they developed a linear regression algorithm. They used 13 features (2 are based on reverts and 11 are based on past editing behavior) to predict future editing activity. Both the source code and the wiki description of their algorithm are available. Congratulations to Ben and Fridolin!
Second place goes to Keith Herring. Submitting only 3 entries, he developed a highly accurate model, using random forests, and utilizing a total of 206 features. His model shows that a randomly selected Wikipedia editor who has been active in the past year has approximately an 85 percent probability of being inactive (no new edits) in the next 5 months. The most informative features captured both the edit timing and volume of an editor. Asked for his reasons to enter the challenge, Keith named his fascination for datasets and that
“I have a lot of respect for what Wikipedia has done for the accessibility of information. Any small contribution I can make to that cause is in my opinion time well spent.”
We also have two Honourable Mentions for participants who only used open source software. The first Honorable Mention is for Dell Zang (team zeditor) who used a machine learning technique called gradient boosting. His model mainly uses recent past editor activity.
The second Honourable Mention is for Roopesh Ranjan and Kalpit Desai (team Aardvarks). Using Python and R, they developed a random forest model as well. Their model used 113 features, mainly based on the number of reverts and past editor activity, see the wikipage describing their model.
All the documentation and source code has been made available, the main entry page is WikiChallenge on Meta.
What the four winning models have in common is that past activity and how often an editor is reverted are the strongest predictors for future editing behavior. This confirms our intuitions, but the fact that the three winning models are quite similar in terms of what data they used is a testament to the importance of these factors.
We want to congratulate all winners, as they have showed us in a quantitative way important factors in predicting editor retention. We also hope that people will continue to investigate the training dataset and keep refining their models so we get an even better understanding of the long-term dynamics of the Wikipedia community.
We are looking forward to use the algorithms of Ben & Fridolin and Keith in a production environment and particularly to see if we can forecast the cumulative number of edits.
Finally, we want to thank the Kaggle people for helping in organizing this competition and our anonymous donor who has generously donated the prizes.
Diederik van Liere
External Consultant, Wikimedia Foundation
Senior Product Manager, Wikimedia Foundation
2011-10-26: Edited to correct description of the winning algorithm
Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.