Find, Prioritize, and Recommend: An article recommendation system to fill knowledge gaps across Wikipedia

Wikidata-20150622-map-items-enwiki-2880x1440
A map indicating how much you can learn about the world through Wikipedia if English is the only language you speak. There is little to no content available in the many dark areas in the world, especially in Central and South America, Africa, and Asia. Map by Markus Krötzsch, TU Dresden, public domain/CC0.

The French Wikipedia may have more than 20,000 articles on individual asteroids, but if you are one of 27 million people speaking Hausa as a first language, Wikipedia doesn’t yet have an entry on the universe. The English Wikipedia may have more than 5 million articles on topics as diverse as extreme sports or unusual causes of death, but if English is the only language you speak, there is still little to no content to learn from about vast regions of the world—as the map above suggests.
Each day, thousands of volunteer editors are filling knowledge gaps by creating new Wikipedia articles, translating existing ones, and identifying poorly covered topics in any given language. However, discovering and deciding what to edit can be a daunting task, both for editors who are new to Wikipedia and for more-seasoned ones.
Understanding how to improve and accelerate content creation across languages and providing guidance to volunteers is what motivated us in Wikimedia Research to team up with computer science researchers from Stanford University. The team set out to design and test a system that would find, rank, and recommend missing articles to be created across different languages.
We designed personalized recommendations by taking into account editor interests (extracted from their public contribution history), proficiency across languages, and the projected popularity of an article in the target language, if it were to be created. We ran a controlled test of these recommendations on the French-language Wikipedia, by comparing personalized recommendations and non-personalized recommendations against a baseline: our results show that recommendations tripled the rate at which editors create articles, while maintaining the same level of article quality as articles created organically in French Wikipedia. The experimental design, algorithm implementation and results are described in detail in a study recently presented at the 25th World Wide Web Conference (WWW 2016) in Montréal, Canada.[1]
Motivated by the results of the experiment, we were joined by software developers and designers to create a first, prototype version of an article recommendation tool that can recommend articles to be created or translated across any of the languages currently supported in Wikipedia. The tool uses a simplified version of the algorithm, based on the pageview, search, and Wikidata APIs, to identify trending articles in a given source language and missing in a target language. It also allows you to search for recommendations based on the specific topics you are interested in.
Wikipedia_GapFinder
Screenshot by Dario Taraborelli, public domain/CC0.
The tool also comes with an API, currently integrated into the Content Translation tool—a product designed by the Wikimedia Language team to create new articles by translating from one language into another. Specifically, the API powers the Suggestions feature of the tool, providing recommendations to volunteers based on articles they previously translated. Tool developers have also started integrating the API in third-party applications, like Dexbot’s tools. Both the article recommendation tool and its API are open source: anyone can access, use, and build on this technology to design or improve new applications.
Over the coming months, we will be monitoring the tool closely to learn more about how it’s being used by editors and how it can be further improved. If you try out the article recommendation tool, you can provide us with feedback on our discussion page. We are particularly interested in seeing how the tool can be used by larger groups participating in edit-a-thons, meetups, or other outreach events, as a handy solution to generate lists of missing articles. If you would like a demonstration of the tool for your local edit-a-thon, let us know!
Leila Zia, Research Scientist
Dario Taraborelli, Director, Head of Research
Wikimedia Foundation

Notes

[1] Ellery Wulczyn, Robert West, Leila Zia, and Jure Leskovec. 2016. Growing Wikipedia Across Languages via Recommendation. In Proceedings of the 25th International Conference on World Wide Web (WWW ’16). Geneva, Switzerland, 975–985. DOI:10.1145/2872427.2883077 arXiv:1604.03235

This study was nominated for best paper at WWW ‘16. You can read more about it in a Stanford University press release.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

19 Comments
Inline Feedbacks
View all comments

[…] “The French Wikipedia may have more than 20,000 articles on individual asteroids, but if you are one of 27 million people speaking Hausa as a first language, Wikipedia doesn’t yet have an entry on the universe. The English Wikipedia may have more than 5 million articles on topics as diverse as extreme sports or unusual causes of death, but if English is the only language you speak, there is still little to no content to learn from about vast regions of the world—as the map above suggests.” (via Wikimedia blog) […]

I’d love to help create content about famous people in, say, The Congo, or about neighborhoods in, say, Bogota, or about animals native to, say, the Mato Grosso region of Brasil. Finding reliable published sources about these subject can be quite difficult. Wikipedia’s emphasis on verifiable content creates a bias toward topics which not only are covered by reliable, neutral, published sources but also which are easily accessible by most Wikipedians. This creates barries to contributing articles about (for example) politicians who may be notable and even well-known in a developing nation but whose lives and activities are not documented… Read more »

Tim, nobody forces you to stick to English sources. We created thousands of articles on Poland in English Wikipedia using mostly sources that were never translated to English yet meet the verifiability criteria. See the two bright spots on the map above? That’s right, England and Poland 🙂

I never said anyone “forces” me to use English-language sources. But let’s be honest with ourselves: Most English-speaking contributors feel comfortable in their own language. Some very committed contributors will be able to use non-English sources, but the vast majority won’t. And we need to find ways to get the vast majority working on “dark area” articles (that sounds so conspiratorial, doesn’t it?). If we don’t, we’ll just worsen Wikipedia’s ongoing problems with high- and moderate-activity contributors as well as its issues with a lack of contributors overall. The proposed tool is a great step foward in helping to identify… Read more »

[…] used by editors and how it can be further improved,” said the Wikimedia Foundation, in a blog post. “We are particularly interested in seeing how the tool can be used by larger groups […]

[…] being used by editors and how it can be further improved,” said the Wikimedia Foundation, in a blog post. “We are particularly interested in seeing how the tool can be used by larger groups […]

[…] used by editors and how it can be further improved,” said the Wikimedia Foundation, in a blog post. “We are particularly interested in seeing how the tool can be used by larger groups […]

Speaking of the map, what is that so bright to the left of India, isn’t that Iran? How is that so bright in Iraq so incredibly dark?

[…] of languages on the Wikipedia database, the company's research department has launched an initiative to encourage more of the volunteer editors to translate articles. In order to do this, they have […]

[…] used by editors and how it can be further improved,” said the Wikimedia Foundation, in a blog post. “We are particularly interested in seeing how the tool can be used by larger groups […]

[…] of languages on the Wikipedia database, the company’s research department has launched an initiative to encourage more of the volunteer editors to translate articles. In order to do this, they have […]

[…] with computer scientists from Stanford University to hone in on the language gaps of the website. Wikipedia noted their work in a blog post on April […]

You know, one thing that the volunteer community is very effective at is identifying issues for improvement. We are buried under a mountain of backlogs of these issues, some still have issues identified a decade ago – and they are still growing. We really don’t need much help with identifying yet more issues and filling yet more backlogs. We need help to reduce the backlogs that have already been identified. It is easy to see from the Wikidata inter-language links in the left sidebar of every article what languages already cover the topic – and by extension, which don’t.

The amp in the article is used in a biased way, because it assumes that everything must be described in English. If Wikipedia focuses first on reliable sources, those sources are probably better contibuting in their own language, so the problem is not that articles do not exist in English for topic on Africa, but lack of sources and articles in ANY language for that region, and only in a second step, lack of resources (and interest) to translate them into English or other languages. There’s also a strategy problem on English Wikipedia: they want sources only in English, and… Read more »

Also, the map is biased by the fact that it requires “geolocated” articles. But in many countries (notably large parts of Africa and Central Asia), we still don’t have very precise maps to help geolocating articles precisely. The geolocating templates also are not available on all Wikipedias. Given that the map was created from Wikidata entries that have geolocations (largely borrowed from Wikipedia by bots, that do not scan all wikipedias but scan just a few), it’s normal you cannot see a lot of locations on this map (even if there are actually contents on Wikimedia about them) We could… Read more »

Also (still about the effect of censorship and massive surveillance) Wikimedia should look at the very visible effect that has occured on Wikipedia for some topics: when you know you are being monitored, you stop contributing on some topics even if what you do is legal from your country of residence or nationality. This was caused after Snowden revelations on the NSA. And the consequences are exploding everywhere: in adidtion to the national official censorship, people are consoring themselves. This effect is now measurable and proven to be extremely significant. This explains why China is so “black” on the map… Read more »

[…] Find, Prioritize, and Recommend: An article recommendation system to fill knowledge gaps across Wiki… […]

[…] ↑ Zia, Leila; Taraborelli, Dario (2016-04-27). “Find, Prioritize, and Recommend: An article recommendation system to fill knowledge gaps acro….  […]