Adding DOIs to Chinese scientific articles on Wikidata

The headquarter of CNKI (the building on the right side of the photo) is located on the campus of Tsinghua University. Photo by そらみみ under CC BY-SA 4.0.

In recent years, China has become the world’s largest producer of scientific articles. However, bibliographic data of Chinese scientific articles are still very limited on Wikidata, because most commonly used bibliographic databases on Wikidata, such as Crossref, do not include articles published in most Chinese academic journals. There have been some efforts to import Chinese articles from China National Knowledge Infrastructure (CNKI), the most comprehensive database for Chinese articles. But for most articles on CNKI, there is no DOI information (the exception is when the resolved pages are hosted by CNKI itself). DOI is a key component for developing a database of open citations and linked bibliographic data, so it would be very beneficial to collect such information and add them to Wikidata. There is no available database containing comprehensive DOIs for Chinese articles, and many DOIs can only be found on journals’ official websites. A WikiCite e-scholarship was granted to develop a tool to collect those data scattered across different journal websites. During the development, more than 20,000 DOIs have already been added to Wikidata.

The most challenging part of developing such a tool is that data need to be scrapped from websites with various structures. Fortunately, many Chinese journal websites are built using similar templates, so that one scraping script can be used for different websites with minor modifications. After manually checking dozens of journal websites, three main website types were identified, and scrapers were developed for each of them. Since websites within each type also have some minor differences, efforts were also made to deal with those differences.

The source code of the tool can be found on GitHub, which is written in Python and contains four main parts:

  1. Scraping: Based on the journal website URL and website type provided by the user, the script can scrape the website and saves titles, publication years, issues, and DOIs for all articles on the website. For example, the website URL showing all volumes of Acta Aeronautica et Astronautica Sinica is http://hkxb.buaa.edu.cn/CN/article/showOldVolumn.do, and the website type assigned in this project is 1 (other journal websites with type 1 include Chinese Journal of Scientific and Technical Periodicals and Chinese Journal of Theoretical and Applied Mechanics, and you can easily notice the similarities in their website structures).
  2. Validation: Sometimes DOIs displayed on the websites may be invalid due to various reasons. In order to remove invalid DOIs, the validation script tries to resolve all scraped DOI links and filters out DOIs when encountering status code 404.
  3. Matching: Using the journal QID provided by the user (for example, the corresponding QID for Acta Aeronautica et Astronautica Sinica is Q98518725), the matching script can collect all Wikidata items of articles published on this journal and then match them with the scraped articles based on title, publication year, and issue number.
  4. Generating QuickStatements commands: Finally, the QuickStatements script can generate a CSV file that can be used to add DOIs to matched articles using QuickStatements.

The four parts can be run together or separately. For instance, after adding the journal information in the file journals.csv, you can use the command python main.py -j Q98518725 to run the scraping part, or use python main.py -a Q98518725 to run all four parts together. You can also use python main.py -h to see many more options to run the script. Moreover, sometimes running scripts may be interrupted due to network or other issues. In such cases, partially completed tasks will be saved and scripts can be restarted at the interrupted point without starting from scratch again.

The scripts were successfully tested on a dozen of Chinese journal websites, and more than 20,000 DOIs in those journals were already uploaded to Wikidata. Currently, there are nearly 1 million articles on Wikidata that have been imported from CNKI, and more than 70% of them do not have DOI information, which could potentially benefit from this project. As more Chinese journal articles are imported into Wikidata, more DOIs can be added using the scripts. If a user would like to run the scripts on a new journal, he/she only needs to provide the journal website URL, the website type, and the QID of the journal. Due to reasons stated above, sometimes small fixes need to be applied to deal with websites with a slightly different structure than the tested ones. Besides, although all surveyed journal websites can be classified as one of the three types, there could be other potential website types out there, and the scripts can be easily modified to handle new types.

Any issue, advice or suggestion regarding the project are welcome.