Wikimedia Research Newsletter, January 2015

Translate this post
Wikimedia Research Newsletter
Wikimedia Research Newsletter Logo.png

Vol: 5 • Issue: 1 • January 2015 [contribute] [archives] Syndicate the Wikimedia Research Newsletter feed

Bot writes about theatre plays; “Renaissance editors” create better content

With contributions by: Kim Osman, Federico Leva, Tilman Bayer and Maximilian Klein.

Bot detects theatre play scripts on the web and writes Wikipedia articles about them

US playwright Alice Gerstenberg. A bot-generated article about her 1920 comedy Fourteen was accepted with minimal changes.

A paper[1] presented at the International Conference on Pattern Recognition last year (earlier poster) presents an automated method to improve Wikipedia’s coverage of theatre plays (“only about 10% of the plays in our dataset have corresponding Wikipedia pages”). It searches for playscripts and related documents on the web, extracts key information from them (including the play’s main characters, relevant sentences from online synopses of the play, and mentions in Google Books and the Google News archive in an attempt to ensure that the play satisfies Wikipedia’s notability criteria). It then compiles this information into an automatically generated Wikipedia article. Two of the 15 articles submitted as result of this method were accepted by Wikipedia editors. For the first, Chitra by Rabindranath Tagore, the initial bot-created submission underwent significant changes by other editors (“the final page reflects some of the improvements we can incorporate in our bot”). The second one, Fourteen by Alice Gerstenberg, “was moved into Wikipedia mainspace with minimal changes. All the references, quotes and paragraphs were retained”.

“Renaissance Editors” create better Wikipedia content

A study of the German Wikipedia[2], about the diversity of editor contributions among the 8 “main categories”, shows a relationship between editor diversity and quality. The authors start by defining an “interest profile” of an editor – the proportion of bytes contributed across all categories. Then an entropy measure is proposed which rewards an interest profile for being more distributed across more categories – having a polymath style.

Leonardo Da Vinci is a famous example of a “Renaissance man” or “polymath

There is a correlation shown between the average diversity of contributors and what types of article quality they’ve contributed to. Article quality is determined based on whether the article is a “Good Article“, “Featured Article“, or neither. It is also shown that total productivity, measured by bytes contributed, is linked to diversity, only marginally insignificantly. Finally, a logistic regression shows that diversity more than productivity significantly determines article quality.

Despite too many simplifications (e.g. single language, naive article quality ratings, too broad categories), the methods used by the researchers are well-defined, clear, and convincing in a limited scope, and place a finger on the notion that our most lauded editors tend to run all over Wikipedia.

Briefly

In-depth examination of the history of three featured articles on the Swedish Wikipedia, and their main editors

This paper[3] looks at collaboration on the Swedish Wikipedia via a qualitative analysis of three Featured Articles. Information is pulled into the articles from a variety of sources including other language Wikipedias and curated by editors. The qualitative study found the articles’ growth followed a similar trajectory and were contributed to by both content and process oriented editors, in what the author calls a process of ‘intercreation.’

“Contropedia” tool identifies controversial issues within articles

This paper [4] discusses the formation of a new method for identifying and examining controversial issues within Wikipedia articles. The paper outlines the development of an algorithm used to identify the most contested topics via an analysis of the edits surrounding wikilinks. The resulting Contropedia tool (already presented at WikiSym 2014[5]) provides an excellent visual presentation of hot button issues in a given article. The authors note that the tool has the potential to be of use to researchers interested in studying the evolution of controversial issues over time in an article, as well as affording Wikipedians insight into potential sites of controversy.

“Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach”

This volume on natural-language processing was semi-recently published, several chapters of which are about wikis, confirming their value for NLP research. Some results are still of some use.

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

  • “The dynamic nature of conflict in Wikipedia”[10] From the abstract: “With a small number of simple ingredients, our model mimics several interesting features of real human behaviour, namely in the context of edit wars. We show that the level of conflict is determined by a tolerance parameter, which measures the editors’ capability to accept different opinions and to change their own opinion.”
  • “Comprehensive Wikipedia Monitoring for Global and Realtime Natural Disaster Detection”[11] (slides)
  • “Digital doorway: Gaining library users through Wikipedia”[12] (about Template:Library resources box)
  • “Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History”[13] From the abstract: “Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles’ revision history within a day in a medium-scale cluster, and supports flexible data structures for various kinds of semantic web study.”
  • “Learning to Identify Historical Figures for Timeline Creation from Wikipedia Articles”[14]
  • “WiiCluster: A Platform for Wikipedia Infobox Generation”[15]
  • “Proceed With Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies”[16]
  • “Wikipedia: helping to promote the art and science of civil engineering”[17]

References

  1. Banerjee, Siddhartha; Cornelia Caragea, Prasenjit Mitra (2014). “Playscript Classification and Automatic Wikipedia Play Articles Generation”. 2014 22nd International Conference on Pattern Recognition (ICPR). pp. 3630–3635. DOI:10.1109/ICPR.2014.624.  Closed access, preprint, dataset
  2. Does a “Renaissance Man” Create Good Wikipedia Articles?“. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014). doi:10.5220/0005155804250430. Retrieved on 28 January 2015. 
  3. Mattus, Maria. “The Anyone-Can-Edit Syndrome – Intercreation Stories of Three Featured Articles on Wikipedia“. Nordicom Review (35) 2014: pp. 189–203. Retrieved on 28 January 2015. 
  4. Borra, Erik et al.. “Societal Controversies in Wikipedia Articles“. Proceedings of CHI 15, April 18–23, 2015, Seoul, Republic of Korea. ACM. doi:10.1145/2702123.2702436. Retrieved on 28 January 2015. 
  5. Erik Borra, Esther Weltevrede, Paolo Ciuccarelli, Andreas Kaltenbrunner, David Laniado, Giovanni Magni, Michele Mauri, Richard Rogers, Tommaso Venturini: Contropedia – the analysis and visualization of controversies in Wikipedia articles PDF
  6. Wasala, Asanka; Schäler, Reinhard; Buckley, Jim; Weerasinghe, Ruvan (21 Feb 2013). Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 69–99. ISBN 978-3-642-35085-6. http://link.springer.com/chapter/10.1007/978-3-642-35085-6_3. Retrieved 26 January 2015. 
  7. Alegria, Iñaki; Cabezon, Unai; Betoño, Unai Fernandez de; Labaka, Gorka (21 Feb 2013). Reciprocal Enrichment Between Basque Wikipedia and Machine Translation. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 101–118. ISBN 978-3-642-35085-6. http://link.springer.com/chapter/10.1007/978-3-642-35085-6_4. Retrieved 26 January 2015. 
  8. Ferschke, Oliver; Daxenberger, Johannes; Gurevych, Iryna (21 Feb 2013). A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 121–160. ISBN 978-3-642-35085-6. http://link.springer.com/chapter/10.1007/978-3-642-35085-6_5. Retrieved 26 January 2015. 
  9. Oltramari, Alessandro; Vetere, Guido; Chiari, Isabella; Jezek, Elisabetta (2013). Senso Comune: A Collaborative Knowledge Resource for Italian. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 45–67. ISBN 978-3-642-35085-6. http://art.torvergata.it/handle/2108/98513. Retrieved 26 January 2015. 
  10. Gandica, Y.; F. Sampaio dos Aidos, J. Carvalho (2014-08-19). “The dynamic nature of conflict in Wikipedia“. arXiv:1408.4362 [physics]. 
  11. Thomas Steiner: Comprehensive Wikipedia Monitoring for Global and Realtime Natural Disaster Detection. ISWC 2014 Developers Workshop PDF
  12. A Spencer, B Krige, S Nair: Digital doorway: Gaining library users through Wikipedia PDF
  13. Tuan Tran and Tu Ngoc Nguyen: Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History PDF
  14. Sandro Bauer, Stephen Clark , Thore Graepel: Learning to Identify Historical Figures for Timeline Creation from Wikipedia Articles. PDF
  15. Zhang, Kezun; Yanghua Xiao, Hanghang Tong, Haixun Wang, Wei Wang (2014). “WiiCluster: A Platform for Wikipedia Infobox Generation”. CIKM ’14. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York, NY, USA: ACM. pp. 2033–2035. DOI:10.1145/2661829.2661840. ISBN 978-1-4503-2598-1. http://doi.acm.org/10.1145/2661829.2661840.  Closed access
  16. Wilson, Jodi L. (2014). “Proceed With Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies”. JETLaw: Vanderbilt Journal of Entertainment & Technology Law 16 (4): 857. 
  17. Armstrong, Richard (2014-08-01). “Wikipedia: helping to promote the art and science of civil engineering“. Proceedings of the ICE – Civil Engineering 167 (3): 101–101. doi:10.1680/cien.2014.167.3.101. ISSN 0965-089X.  Closed access

Wikimedia Research Newsletter
Vol: 5 • Issue: 1 • January 2015
This newletter is brought to you by the Wikimedia Research Committee and The Signpost
Subscribe: Syndicate the Wikimedia Research Newsletter feed Email @WikiResearch on Identi.ca WikiResearch on Twitter[archives] [signpost edition] [contribute] [research index]

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

1 Comment
Inline Feedbacks
View all comments

Frankly, the idea that Banerjee, Caragea and Mitra’s bot is “writing Wikipedia articles” is a huge exaggeration. Their aptly named Theatremania bot appears to have created about 10 articles, almost all of which were rejected, some on multiple occasions. Looking at the their most successful example, Fourteen, even though this was accepted by some editor, it still contained large amounts of plagiarized text from other websites, and no more value than would have been derived from a web search for the terms “Fourteen” and “Alice Gerstenberg”. I took the time just now to turn this article into a barely passable… Read more »