In the Spring of 2022, the BHL Cataloging and Metadata Committee investigated the possibility of harvesting persistent identifiers (PIDs) from Wikidata as part of the group’s longstanding project to disambiguate and deduplicate author records in the BHL database. The motivation behind this one-time experimental data harvest was to see if BHL could:
- Enhance BHL author records with additional PID data points;
- Improve the committee’s ability to disambiguate author names in the BHL database; and
- Respond to an outstanding user request from two of Wikimedia’s super star editors, Siobhan Leachman and Andy Mabbett, to expose BHL’s author data on BHL and include hyperlinks to other authoritative knowledge bases on the web.
In particular, Wikimedians wanted to see the Wikidata Q identifier exposed, providing a link to the corresponding creator item record in Wikidata.
There are multiple motivations for undertaking this work. By adding the BHL Creator ID to the corresponding Wikidata item, Wikidata editors help link BHL to the richer biographical data about that person held in Wikidata. The Wikidata item for a person may contain links to their Wikipedia page or to images of the person held in the image repository Wikimedia Commons. Wikidata items also act as identifier hubs and contain links to other databases and identifiers.
By adding the BHL Creator ID to this list of identifiers, the Wikidata editor is linking the content held in BHL to the content held in multiple other datasets and repositories.
These extra author data points provide Wikimedians and BHL catalogers with crucial clues that aid in name disambiguation. In particular, hyperlinks to other knowledge bases are incredibly valuable because they lead to new knowledge pathways that help confirm a person’s identity in a complex game of “Who’s Who?”
Workflow Overview
The diagram below illustrates the experimental data pipeline from BHL to Wikidata and back.
The BHL Wikidata Round Trip – Steps
- BHL Creator ID and record are created when digital content is harvested into BHL and/or articles are defined in BHL.
- Wikimedians record BHL Creator IDs as a statement in corresponding Wikidata items via various community workflows. (See: Mix’n’match deep dive below and/or read about QuickStatements for two ways volunteers are populating Wikidata with BHL’s Creator IDs).
- BHL subsequently harvests additional metadata from Wikidata for any author item where a BHL Creator ID statement has been added.
- Data outputs are analyzed for quality and breadth.
- SPARQL queries and data outputs are iteratively refined to match BHL’s requirements until a quality dataset can be generated for import.
- Clean data is imported into the BHL database.
- The new author record sidebar displays data on BHL.
- PIDs are converted into resolvable URIs, opening up new research pathways for BHL’s users.
Quick note: In Wikidata BHL author records are represented by the Wikidata BHL Creator ID; the presence of this Wikidata property in a Wikidata item provides a powerful connection point that can be used later by BHL to ingest more information about any named entity. A popular author matching tool in Wikidata is Mix’n’match.
Getting BHL Authors in Wikidata with Mix’n’match
Mix’n’match is a Wikidata tool that empowers editors to match Wikidata items to entries in other databases. One of the datasets in Mix’n’match is the BHL Creator ID dataset.
When the BHL Creator ID dataset is uploaded into Mix’n’match, an algorithm undertakes fuzzy name matching. The tool then suggests a preliminary match to a Wikidata item. Wikidata editors working on the dataset have the choice of whether to confirm the suggested match or to remove it.
If the Wikidata editor confirms the match, the BHL Creator ID is automatically added to the Wikidata item for the creator. If the editor rejects the match, the editor can add a correct Wikidata item ID, create a new Wikdiata item if the creator is not yet in Wikidata or, in some cases, decide that the BHL Creator ID isn’t applicable to Wikidata. If the editor simply rejects the match without further action, the BHL Creator ID will be added to the unmatched portion of the dataset.
The act of linking the BHL Creator ID to the Wikidata item also helps to disambiguate the creator. It removes much of the uncertainty about who the creator is. Names are not unique but by linking identifiers to biographical data, editors can ensure that similarly named people or people whose names have changed over time are all associated with the correct Wikidata item. This addresses the age-old problem of how to assign correct attribution to the right person.
This work also assists BHL. By ensuring BHL Creator ID’s are matched to the Wikidata item, the Wikidata editor can assist BHL in weeding out the duplicate entries for creators in BHL’s database. After matching has taken place, BHL is also able to ingest any of the other identifiers listed on the creator’s Wikidata item, thus enriching the metadata held in BHL’s database.Currently there are over 230,000 entries in the BHL Creator ID dataset. Of these, just over 41,000 have been matched to Wikidata items. There is a long way to go before this dataset is complete. In the spirit of “many hands make light work,” BHL encourages Wikidata editors to work alongside BHL staff who were recently trained and certified in Wikidata Advanced Concepts by Wiki Education. Our collaborative work will help increase the number of Creator IDs in Wikidata.
Once BHL Creator ID statements were recorded on Wikidata items, either through manual edits or workflows like Mix’n’match or QuickStatements, then BHL created custom SPARQL queries using the Wikidata Query Service to output data of interest.
Once the data was brought back, the BHL Cataloging and Metadata Committee discussed and reviewed the records with the aim of bringing key data points back into the BHL database.
In total, the BHL Cataloging and Metadata Committee was able to “round trip” 88,507 persistent identifiers (PIDs) associated with BHL Creator records from the following authoritative knowledge bases:
- Virtual International Authority File (VIAF)
- Wikidata
- Social Networks and Archival Context (SNAC)
- ResearchGate
- ORCID
- Library of Congress Authorities
There are still more PIDs out there to gather but practicality calls for moderation – moderation of the quantity that the Wikidata Query Service can bring back and the amount feasible to curate in BHL. After many discussions, the above list was chosen by narrowing down the PIDs that seem most appropriate in the BHL context. A comprehensive policy on Uniform Resource Identifiers (URIs) in the BHL ecosystem is now being drafted by BHL committees.
Additionally, in post-work analysis, the Committee did find that the data modeling of corporate names in Wikidata differs from the library world, and pulling in identifiers for corporate names was not worth the effort (at least at this time).
Below is an example of Académie des sciences (France), which receives multiple item records in library databases like VIAF for each name change. However, in Wikidata these name changes are collapsed and will appear on one item record; this divergence means that one-to-one matches are not possible without a lot of manual clean-up from BHL Staff.
It’s important to note that data modeling in Wikidata is in its nascent stages and librarians should have a vested interest in shaping the best practices of the community so all of humanity can model the world’s knowledge together. As we have seen with this one-time experiment, there is little to lose, and so much to gain!
The Result: A New Author Record Sidebar for BHL
Thanks to the feedback from the Wikimedia community, BHL’s Lead Developer created an Author Record sidebar. Click on any BHL author name on the BHL website and the author details will appear on the right-hand side. Below is an example of Australian paleontologist and ornithologist, Patricia Vickers-Rich in BHL.
A goal of this new feature was to expose existing and recently harvested identifiers as resolvable links that could open up critical knowledge pathways for BHL’s users on their information journeys. Other key data points are provided such as preferred name forms, entity type, and alternative name forms that give additional context for disambiguation research.
Next Steps for the BHL’s Committees?
Please let us know in the comments what you think of this experiment. Should BHL pursue similar persistent identifier harvests for other entity types? Or perhaps, a recurring harvest for BHL Authors?
Additional BHL entities:
- Titles
- Scientific Names
- Subjects
- Media (illustrations, scientific plates, photos)
Let us know! Your feedback is crucial to BHL’s evolution as a biodiversity knowledge base.
If you are new to Wikidata, Mix’n’match is a great place to start. There are many tutorials and resources available on the web including this great YouTube introduction on how to use the tool.
Additionally, a workflow for BHL articles has been piloted and is currently underway thanks to BHL’s Persistent Identifier Working Group. For more details on how to roundtrip BHL Article Q identifiers, please refer to the group’s documentation at: Wikidata:WikiProject_BHL/Projects
BHL committees and working groups are actively scoping many projects. Sign-up here to get involved!
Related Posts
For more information on the assignment of DOIs, a very important type of persistent identifier specific to scholarly publications, check out Nicole Kearney’s blog post What Is BHL’s New Persistent Identifier Working Group DOI’ng?
For a list of all of the new features and data quality improvements BHL made in 2022, check out the post BHL Technical Development: Year in Review
This post was originally published on the Biodiversity Heritage Library blog to highlight International Love Data Week.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation