How we partnered with volunteers to clean up copy-paste plagiarism on Wikipedia

Translate this post

Photo by Arturo de Frias Marques, CC BY-SA 4.0.
Photo by Arturo de Frias Marques, CC BY-SA 4.0.

Every year, the Wikimedia Foundation’s Community Tech team invites the most active Wikimedia contributors to participate in the community wishlist survey—proposing, discussing and voting on the features and improvements that they’d most like to see. When the votes are counted, Community Tech is responsible for addressing the top ten requests on the list.
The #9 wish this year was to improve the plagiarism detection bot, which was created by volunteer developer Eran, and has been running on English Wikipedia since January 2015. The bot is a clever solution to a tricky problem—identifying text which is copy-pasted from other websites. EranBot looks at every edit which adds a significant amount of text to a Wikipedia article, and compares it against a search database for potential matches. When the database finds a possible match, EranBot flags the edit for human review.
The original interface for EranBot’s reports was difficult to use. The reports were published on a wiki page using a series of complicated templates, and users had to click through to other sites to see the text comparisons. Worst of all, to start using the tool, a user had to add some code to their personal common.js page, an annoying hassle that prevented people from trying it out.
When the Community Tech team started to work on this wishlist item in April 2016, there was only one dedicated volunteer using the tool: Diannaa, a longtime admin and copy editor, who reviewed and resolved hundreds of copy-patrol cases every week. There was a growing backlog of several thousand unreviewed cases, waiting to be checked.
Working with Eran and Diannaa, the Community Tech team built CopyPatrol, a new interface that makes reviewing cases easier and faster. On CopyPatrol, a user can compare the Wikipedia article’s text with the suspected source text directly on the page by clicking the Compare button, which opens a side-by-side comparison. There are links for all of the information that the patroller needs to resolve the case—the editor’s name, talk page and contribution history, the suspected edit and the article’s history.
When there’s confirmation that the text was copy-pasted from another source without attribution, the patroller needs to revert the edit and leave a talk page message for the editor, explaining the wiki’s guidelines about plagiarism and copyright violation. Once that’s done, the patroller marks the case as “Page fixed”.
Sometimes, the bot finds a false positive, flagging text that was properly cited in the article, or matching text from a Wikipedia mirror site. In that case, the patroller marks it as “No action needed”.
The Community Tech team’s goal for this project was to build an interface that would attract and retain more patrollers, so that Diannaa didn’t have to work on this alone. She’s still the most active patroller, but now she’s backed up by a team of six regular patrollers.
As Diannaa says, this work “is something we really need to do in order to be taken seriously as a world-class website and resource. The job is actually two-fold: clearing out copyright violations, and educating people as to our copyright rules. Many people, accustomed to the ways of Facebook and LinkedIn, don’t even realise that we don’t accept copyright content. It’s great that the bot picks up on so many copy vios and we get the opportunity to do this teaching right away, before the user has made hundreds or thousands of copyright violations.”
In the four months since CopyPatrol was launched in July, more than 9,000 articles have been reviewed. Nearly 5,000 of them were found to be copyright violations and were fixed by patrollers. Thanks to all the new patrollers, there isn’t a growing backlog anymore, and new cases are reviewed within twenty-four hours. The Community Tech team is now working on expanding the tool to other languages, so that volunteers can review copy-paste cases on French Wikipedia and others.
CopyPatrol is a great example of contributors, volunteer developers and Foundation staff working together to improve the quality of the Wikimedia projects. The next Community Wishlist Survey opens today— help us choose more projects to work on in 2017!
Danny Horn, Senior Product Manager, Community Tech
Wikimedia Foundation

Archive notice: This is an archived post from, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Inline Feedbacks
View all comments

Big thanks to iThenticate for providing the software backend for these comparison reports.

This is a huge success and a proof of the concept that if we work together we can achieve much more. Thank you to all who were involved in bringing it about 🙂 And to Diannaa for taking a lead on the follow up.

I was one of the copy/ paste users no doubt when using wiki for health condition information to help sort out my own health, where Drs can’t! I’m sorry if I violated your rule which I didn’t know about.