3,000 medical images uploaded to Wikimedia Commons. In this blog post, we will explain how the first larger project supported by the Helpdesk of the Content Partnerships Hub was technically carried out – in collaboration with Netha Hussain, who requested the support.
Early in 2023, The Content Partnerships Hub initiative, through its Helpdesk function, supported the Wikimedia Community by uploading a couple of thousand medical illustrations shared by the Les Laboratoires Servier. We have previously written about the material and its potential for the Wikimedia platforms. In this post, we delve into the technical aspect of how this project was done.
The first step was confirming that the images were compatible with the Wikimedia platforms copyright-wise. This is something that you should always research before you even start thinking about uploading something. If you need support on how to understand free licenses, or how to request them, we are happy to support you.
Fortunately, the Les Laboratoires Servier have marked the files very clearly with a Creative Commons Attribution 3.0 Unported License, leaving no doubt that they are indeed free for us to use.
The images on the Smart Servier websites are organized in a number of categories, such as Glands, Nucleic acids and Infectiology. This logical structure was easy to replicate on Wikimedia Commons. What was less convenient was the limited information available about each individual file and what it depicts. The files have short names such as muscle fiber or embryo but there’s no additional text information. If you’re an expert, of course, you can look at the files and know in what context they are appropriate to use – for example, which section of the Wikipedia article on muscles they can illustrate, but it does make it harder for non-experts.
The Smart Servier website does not have an API through which the images can be accessed. The best way of downloading the files was to simply crawl and scrape the HTML of all the gallery pages.
Once a local copy of the image catalog was saved, they were ready to upload to Wikimedia Commons. The first step was choosing a suitable tool to do that. When you want to work with a large collection of files, you have several options. Pattypan, which has been around for years, is a very popular and useful upload tool, but recently more and more people have been using OpenRefine (OR). This open-source data cleaning tool, already popular among prolific Wikidata editors, has supported Wikimedia Commons – both uploading files and editing their Wikitext and structured data – since about a year ago. As you can see in the category Uploaded with OpenRefine, it has been used to upload over 85,000 files already.
We have been using OpenRefine for working with Wikimedia Commons since the new functionalities have been implemented, and chose it for this project as well. Two factors weighed in favor of it. Firstly, OR gives you access to both Commons and Wikidata – for example, when you have a word like “kidney” in the filename, you can automatically look it up on Wikidata and add a link to the item in the file description or as structured data. Secondly, OR makes it possible to not only upload files but also edit existing files, regardless of whether you uploaded them yourself or not. This is practical for operations like adding categories to a large batch of files. While there are several batch-editing tools available, like Cat-a-lot or VisualFileChange, it’s incredibly convenient to be able to do everything in one tool.
When you upload a large batch of files, you want to make them easy to find and understand. Apart from informative file names and descriptions, you can also add Structured Data on Commons – links to the Wikidata items of the objects depicted in the images. Structured metadata has several advantages over plain text descriptions. For one, by plugging into Wikidata it uses its multilingual potential. A plain text description in English doesn’t help users who want to search Wikimedia Commons in other languages. On the other hand, if the corresponding Wikidata item has labels in several languages, the file can be found by searching in any of those languages. Moreover, the structured data is machine-readable, making it more reusable, for example for external applications.
For this reason, we chose to add as many structured depicts statements to the files as possible. We used the reconciliation feature of OpenRefine to try and identify the correct Wikidata items based on the file names. In many cases this was not possible to do automatically – human input was necessary to ensure that the matchings were correct. As the Helpdesk team does not have medical terminology expertise, we got a lot of help from Netha Hussein, who requested the support, and who did a magnificent job looking up the terms on Wikidata. Thank you!
This was one of the first Commons projects we did using OpenRefine, and we got a very positive impression. We have since continued to use OR to work on large batches of media files, both for uploading and editing. Its built-in support for structured data is an important feature for us. However, we must admit we did have a head start, as we had been active users of OR for all things Wikidata, and as such we were already familiar with its interface and workflow.
Complete newcomers will have a learning curve. If you have a spare hour, make sure to watch Sandra Fauconnier’s tutorial on uploading files with OpenRefine.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?Start translation