Synchronising Wikidata and Wikipedia: an Outreachy project

Wikidata is Wikimedia’s structured data repository. It is connected to Wikipedias and the other Wikimedia projects, and holds structured data about a huge number of concepts. This includes every topic covered by a Wikipedia article, as well as scientific papers, astronomical objects, and many other topics.

Wikidata has been growing rapidly since it started in 2012, but is still a long way behind Wikipedia articles in terms of the amount of data it stores. In particular, the English Wikipedia has a lot of semi-structured content in the form of infoboxes, lists, and introduction sections. Unlike other language Wikipedias, there hasn’t yet been significant mass-migration of this content to Wikidata. While this information is kept only in English, rather than being in Wikidata’s multilingual structured data, it can’t be easily used by other Wikipedias with more Wikidata integration, or by external websites that use Wikidata information.

While this is being slowly tackled by volunteers importing data manually or with the help of scripts, there is no deadline in the Wikimedia projects – so this process can take a long time. The problem also presents a great student project to automate the imports into Wikidata. Enter Outreachy – an internship program with a focus on diversity, which the Wikimedia Foundation regularly participates in, and which comes with scholarship of US$6,000. This presented a great opportunity to get a motivated student focused on this work for a fixed period of time, while they also learnt about Wikimedia and coding, and at the same time helps it increase the diversity of the Wikimedia movement and the tech community in general.

Project Description

The project focused on using Python, and in particular the ‘pywikibot’ package, to automatically read in Wikipedia articles, and write out structured data to Wikidata. This focused on relatively simple things like external website identifiers, as well as much more complex things like lists of historical buildings. Mike Peel was the project mentor; he has written various Pywikibot scripts that run on a Raspberry Pi as User:Pi bot.

The initial period of Outreachy is a set of contributor tasks, which both introduce students to the community and the work to be done during the internship, and also serve as the evaluation process for the students – unlike traditional internships, there are no CVs to look at within the Outreachy process! Students started understanding how Wikidata works by looking through Wikipedia articles and matching up how pieces of information are stored in Wikipedia articles and Wikidata, going on to write initial scripts to access that information in Python, and to write information into Wikidata.

Over 20 students submitted a starter task during the initial contribution period, which lasted a month – after which Mike had to make the really difficult decision to select a student – or fortunately in this case, two students were able to do the internship, between May and August 2021: Ammar and Nirali. Ammar is from Nigeria and has some experience editing Wikipedia and Wikidata before. Nirali is from India. Both were new to Pywikibot and bot editing.

Steps Involved

The import of data begins with the extraction of information, which involves the selection of topic areas of the Wikipedia articles, and then extends to the selection of focus area of each article from the chosen topic, for example: infoboxes, external links, etc. The topic areas might vary from different categories or lists to different templates that are being used in the Wikipedia articles.

The actual extraction and import of information then occurs with the help of python scripts which are developed using the pywikibot module. The pywikibot module is an extensive tool allowing developers to interact with the pages in different Wikimedia projects like Wikipedia, Wikidata and Commons, in a simplified way. When interacting with wiki pages, pywikibot offers variety of functions that allow to get the raw wikitext of the page, the title, the canonical url, permanent link as well as old version text for a specified old revision. It also provides the developer with functions for retrieving different Wikimedia projects’ pages, accessing and making changes to the contents of Wikidata Items.

While the scripts vary on the basis of the topic area chosen and the information that is aimed to be imported, they do have certain common elements or pattern to follow:

  1. Retrieving the contents of the Wikipedia article’s page (when the focus is only on a single article) or iterating through the articles in a given page and retrieving each article’s page (when a list or category of articles is targeted)
  2. Extracting the information to be imported to Wikidata from the Wikipedia article. This is usually achieved with two main approaches:
    1. Using regular expressions if extracting from the article body text or working on the entire raw text.
    2. Expanding templates and iterating through their key-value pairs, if looking for certain information which is known to be in a particular template.
  3. Validating the data. In some cases validation might be necessary, for instance to prevent error on attempted save in Wikidata or to prevent exporting malformed/invalid value.
  4. Importing the extracted information to Wikidata by assigning appropriate property IDs to them

Additional steps might include task-specific steps and the addition of import sources, i.e., the latest revision of the Wikipedia article during the time of import and the language wiki from which the information is being imported, which is considered as a good practice. Out of the above-mentioned steps, most of them can easily be achieved by the pywikibot module. Often times, queries are made of Wikidata information using SPARQL in order to avoid the addition of information that should ideally be present only once (for example, Soccerway IDs for players).

The validity of the developed scripts is ensured after the approval of bot requests. Bot requests are used to submit the scripts for our bots for community discussions where other community members and admins verify the credibility and check for errors in the functioning of scripts. Bot requests involve describing the purpose and tasks to be performed by the script and, after the relevance of the functionality of the bot is confirmed, the bot operator is asked to perform test edits on a small number of articles to ensure the accuracy of the imports. Once the test edits are approved, the bot gets the permission to perform the live edits, i.e., the import of data from all articles in the topic area. This can be done in batches, in order to keep track of any errors in the scripts, or as a whole. It can be a one-off run, or it can be something that runs regularly (daily/weekly/monthly).

As well as working on the project, Outreachy asks the selected interns to write bi-weekly blogs on various themes provided by the Outreachy Organizers. These themes can range from the initial experience of the interns in the Outreachy program to what the interns plan to do once the Outreachy internship comes to an end. Though the themes are provided by the Outreachy Organizers, the interns are free to write the blogs on any theme or topic that they want to. Nirali’s blog can be found on WordPress and Ammar’s blog can be found on Hashnode. Outreachy also organizes bi-weekly chats with other interns, mentors and the organizers and provides assignments to both, mentors and interns, in order to maintain an interactive environment all the while indulging in a continuous evaluation process with periodic feed-backs about the work done.

The Hurdles

The import of data from Wikipedia to Wikidata can be quite complex, and can have more conflicts or issues than just syntactic errors preventing the script from running.

  • Property mapping: Proper knowledge of the topic area is essential so as to avoid the import of bad data due to the incorrect assignment of properties to data, especially in cases of properties that have similar meanings but are used for different contexts. For example: The use of P276 (location) and P131 (located in the administrative territorial entity) to mark the location of a monument, landmarks, etc.
  • Varying structures of articles: Various articles have different structures for mentioning the same details—even if they belong to the same topic area. This makes the development of scripts challenging since they need to cover as many different structures as possible. For example, the headquarters of an organization can be mentioned as “headquarters” or “garrison” in the Wikipedia article.
  • Missing properties: Certain properties that might be important have still not been added in the list of Wikidata properties. An example can be the property of “leader” for organizations, especially the non-profit ones. This leads to some data being left out while importing other information from the article into Wikidata.
  • Incorrect Value: Some values in Wikipedia can be plain wrong (and exporting them to Wikidata worsens the problem). This happens for instance for identifiers like Netflix ID, where in some cases articles were linked to the wrong identifier in Netflix website due to similarities in the name (or even exact same title, for movies produced in different years).

Of the mentioned hurdles, the most challenging and exciting task was to develop scripts to accommodate the various structures in Wikipedia articles. It made us think of the optimal ways to consider all possible test cases and to implement them in the scripts—all the while caring for the time consumed writing and running the import scripts.

Apart from these, perhaps the most difficult hurdle was to identify what to import. Given the huge number of topic areas and articles in Wikipedia, the selection of one of these out of the thousands of others was the most time consuming task, especially when initial searches were in the wrong place. While it is true that only around 20-25% of Wikipedia articles have been synchronized with Wikidata, even this can be a huge number of articles to avoid while searching for articles whose information has not yet been imported to Wikidata.

However, despite these few hurdles, the project made us familiar not only with the use of python scripts but also with the structure of Wikipedia articles and Wikidata items. Often times, we, as general users, tend to neglect what goes on behind the scenes for managing the articles. Getting involved in the project made us realize the intricacies involved with the management of data and it’s structure in both, Wikipedia and Wikidata.

What next?

Nirali presented at Wikimania 2021 as part of the ‘Integrating Wikidata into the Wikimedia projects’ panel. Both Nirali and Ammar are continuing to operate their bots on Wikidata to import information from the English Wikipedia regularly, now as volunteers rather than interns. AmmarBot now has over 8,000 edits, and NiraliBot has over 3,000 edits.

Wikimedia Foundation continues to participate in Outreachy – if you’re interested in doing an internship with Outreachy and Wikimedia, have a look at Outreachy.org