The Biodiversity Heritage Library is Round Tripping Persistent Identifiers with the Wikidata Query Service

Translate This Post

In the Spring of 2022, the BHL Cataloging and Metadata Committee investigated the possibility of harvesting persistent identifiers (PIDs) from Wikidata as part of the group’s longstanding project to disambiguate and deduplicate author records in the BHL database. The motivation behind this one-time experimental data harvest was to see if BHL could:

  1. Enhance BHL author records with additional PID data points;
  2. Improve the committee’s ability to disambiguate author names in the BHL database; and
  3. Respond to an outstanding user request from two of Wikimedia’s super star editors, Siobhan Leachman and Andy Mabbett, to expose BHL’s author data on BHL and include hyperlinks to other authoritative knowledge bases on the web.
How do PIDs work?
Persistent identifiers can become resolvable links that connect two knowledge bases (like BHL and Wikidata) together.

In particular, Wikimedians wanted to see the Wikidata Q identifier exposed, providing a link to the corresponding creator item record in Wikidata.

There are multiple motivations for undertaking this work. By adding the BHL Creator ID to the corresponding Wikidata item, Wikidata editors help link BHL to the richer biographical data about that person held in Wikidata. The Wikidata item for a person may contain links to their Wikipedia page or to images of the person held in the image repository Wikimedia Commons. Wikidata items also act as identifier hubs and contain links to other databases and identifiers.

Maria Sybilla Merian in Wikidata
Maria Sibylla Merian in Wikidata, rendered with the Reasonator Tool; note the External sources section and BHL’s Creator ID entry.

By adding the BHL Creator ID to this list of identifiers, the Wikidata editor is linking the content held in BHL to the content held in multiple other datasets and repositories. 

These extra author data points provide Wikimedians and BHL catalogers with crucial clues that aid in name disambiguation. In particular, hyperlinks to other knowledge bases are incredibly valuable because they lead to new knowledge pathways that help confirm a person’s identity in a complex game of “Who’s Who?”

Workflow Overview

The diagram below illustrates the experimental data pipeline from BHL to Wikidata and back.

The BHL-Wikidata Round Trip Overview
The BHL-Wikidata Round Trip Overview.

The BHL Wikidata Round Trip – Steps

  1. BHL Creator ID and record are created when digital content is harvested into BHL and/or articles are defined in BHL.
  2. Wikimedians record BHL Creator IDs as a statement in corresponding Wikidata items via various community workflows. (See: Mix’n’match deep dive below and/or read about QuickStatements for two ways volunteers are populating Wikidata with BHL’s Creator IDs).
  3. BHL subsequently harvests additional metadata from Wikidata for any author item where a BHL Creator ID statement has been added. 
  4. Data outputs are analyzed for quality and breadth.
  5. SPARQL queries and data outputs are iteratively refined to match BHL’s requirements until a quality dataset can be generated for import.
  6. Clean data is imported into the BHL database.
  7. The new author record sidebar displays data on BHL.
  8. PIDs are converted into resolvable URIs, opening up new research pathways for BHL’s users.

Quick note: In Wikidata BHL author records are represented by the Wikidata BHL Creator ID; the presence of this Wikidata property in a Wikidata item provides a powerful connection point that can be used later by BHL to ingest more information about any named entity. A popular author matching tool in Wikidata is Mix’n’match.

Getting BHL Authors in Wikidata with Mix’n’match

Mix’n’match is a Wikidata tool that empowers editors to match Wikidata items to entries in other databases. One of the datasets in Mix’n’match is the BHL Creator ID dataset.

BHL Creator Dashboard
A summary of the completeness of matching the BHL Creator ID dataset to Wikidata as of January 2023

When the BHL Creator ID dataset is uploaded into Mix’n’match, an algorithm undertakes fuzzy name matching. The tool then suggests a preliminary match to a Wikidata item. Wikidata editors working on the dataset have the choice of whether to confirm the suggested match or to remove it.

MixNMatch - Match Screen
Examples of possible matches suggested by the Mix’n’match algorithm with the BHL creator in green and the Wikidata item in blue.

If the Wikidata editor confirms the match, the BHL Creator ID is automatically added to the Wikidata item for the creator. If the editor rejects the match, the editor can add a correct Wikidata item ID, create a new Wikdiata item if the creator is not yet in Wikidata or, in some cases, decide that the BHL Creator ID isn’t applicable to Wikidata. If the editor simply rejects the match without further action, the BHL Creator ID will be added to the unmatched portion of the dataset.

MixNMatch - Match Screen - Unmatched entries
Results after the Wikidata editor Ambrosia10 has confirmed or removed the Mix’n’match algorithm suggestions.

The act of linking the BHL Creator ID to the Wikidata item also helps to disambiguate the creator. It removes much of the uncertainty about who the creator is. Names are not unique but by linking identifiers to biographical data, editors can ensure that similarly named people or people whose names have changed over time are all associated with the correct Wikidata item. This addresses the age-old problem of how to assign correct attribution to the right person.

This work also assists BHL. By ensuring BHL Creator ID’s are matched to the Wikidata item, the Wikidata editor can assist BHL in weeding out the duplicate entries for creators in BHL’s database. After matching has taken place, BHL is also able to ingest any of the other identifiers listed on the creator’s Wikidata item, thus enriching the metadata held in BHL’s database.Currently there are over 230,000 entries in the BHL Creator ID dataset. Of these, just over 41,000 have been matched to Wikidata items. There is a long way to go before this dataset is complete. In the spirit of “many hands make light work,” BHL encourages Wikidata editors to work alongside BHL staff who were recently trained and certified in Wikidata Advanced Concepts by Wiki Education. Our collaborative work will help increase the number of Creator IDs in Wikidata.

MixNMatch - Statistics
BHL Creator ID dataset completeness statistics as of January 2023.

Once BHL Creator ID statements were recorded on Wikidata items, either through manual edits or workflows like Mix’n’match or QuickStatements, then BHL created custom SPARQL queries using the Wikidata Query Service to output data of interest.

Wikidata Query Service SPARQL
The SPARQL query language is used to query and return data from semantic databases like Wikidata.

Once the data was brought back, the BHL Cataloging and Metadata Committee discussed and reviewed the records with the aim of bringing key data points back into the BHL database.

BHL Raw Data of Interest
Snippet of raw data brought back by the BHL Cataloging and Metadata Committee’s SPARQL query.

In total, the BHL Cataloging and Metadata Committee was able to “round trip” 88,507 persistent identifiers (PIDs) associated with BHL Creator records from the following authoritative knowledge bases:

There are still more PIDs out there to gather but practicality calls for moderation – moderation of the quantity that the Wikidata Query Service can bring back and the amount feasible to curate in BHL. After many discussions, the above list was chosen by narrowing down the PIDs that seem most appropriate in the BHL context. A comprehensive policy on Uniform Resource Identifiers (URIs) in the BHL ecosystem is now being drafted by BHL committees.

Additionally, in post-work analysis, the Committee did find that the data modeling of corporate names in Wikidata differs from the library world, and pulling in identifiers for corporate names was not worth the effort (at least at this time). 

Below is an example of Académie des sciences (France), which receives multiple item records in library databases like VIAF for each name change. However, in Wikidata these name changes are collapsed and will appear on one item record; this divergence means that one-to-one matches are not possible without a lot of manual clean-up from BHL Staff.

Divergent data models: Wikidata corporate names collapse name changes.
Divergent data models: Wikidata corporate names collapse name changes.

It’s important to note that data modeling in Wikidata is in its nascent stages and librarians should have a vested interest in shaping the best practices of the community so all of humanity can model the world’s knowledge together. As we have seen with this one-time experiment, there is little to lose, and so much to gain!

The Result: A New Author Record Sidebar for BHL

Thanks to the feedback from the Wikimedia community, BHL’s Lead Developer created an Author Record sidebar. Click on any BHL author name on the BHL website and the author details will appear on the right-hand side. Below is an example of Australian paleontologist and ornithologist, Patricia Vickers-Rich in BHL.

A goal of this new feature was to expose existing and recently harvested identifiers as resolvable links that could open up critical knowledge pathways for BHL’s users on their information journeys. Other key data points are provided such as preferred name forms, entity type, and alternative name forms that give additional context for disambiguation research.

New Knowledge Pathways
A New Wikimedian Driven Feature: The BHL Author Sidebar!
Adding these identifiers to the Author sidebar makes my life so much easier. At a glance, I can quickly confirm the identity of creators. The author sidebar or lack thereof also highlights whether further work is needed in Wikidata to link these BHL creators to their identifiers.

Next Steps for the BHL’s Committees?

Please let us know in the comments what you think of this experiment. Should BHL pursue similar persistent identifier harvests for other entity types? Or perhaps, a recurring harvest for BHL Authors?

Additional BHL entities:

  • Titles 
  • Scientific Names
  • Subjects
  • Media (illustrations, scientific plates, photos)

Let us know! Your feedback is crucial to BHL’s evolution as a biodiversity knowledge base. 

If you are new to Wikidata, Mix’n’match is a great place to start. There are many tutorials and resources available on the web including this great YouTube introduction on how to use the tool. 

Additionally, a workflow for BHL articles has been piloted and is currently underway thanks to BHL’s Persistent Identifier Working Group. For more details on how to roundtrip BHL Article Q identifiers, please refer to the group’s documentation at: Wikidata:WikiProject_BHL/Projects

BHL committees and working groups are actively scoping many projects. Sign-up here to get involved!

Related Posts

For more information on the assignment of DOIs, a very important type of persistent identifier specific to scholarly publications, check out Nicole Kearney’s blog post What Is BHL’s New Persistent Identifier Working Group DOI’ng?

For a list of all of the new features and data quality improvements BHL made in 2022, check out the post BHL Technical Development: Year in Review

This post was originally published on the Biodiversity Heritage Library blog to highlight International Love Data Week.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?