Enabling Handwritten Text Recognition on Wikisource using Transkribus OCR Engine

Translate this post

Wikisource is home to a wide variety of historical documents and manuscripts that encompass different languages, writing styles, and scripts, belonging to different eras, and of varying image quality. The Optical Character Recognition (OCR) technology built into the platform greatly simplifies the objective of transcribing and preserving these documents in a digital format. In Wikisource terms, this speeds up the process by which community contributors extract text from scanned images and apply corrections to them before sharing it on the platform.

Launched on the 24th International Mother Language Day, the Wikisource Loves Manuscripts project is focused on bringing more manuscripts to Wikimedia projects through scanning and digitization efforts as well as partnerships with allied institutions. The project is also focused on improving and integrating technology to support the transcription of handwritten manuscripts on Wikisource. The latter was identified as a big challenge by the Balinese community which manually transcribed thousands of manuscripts on Balinese Wikisource

Existing OCR engines available on Wikisource did not support Balinese, Javanese, or other languages in the region. It was to enable the transcription of handwritten documents in these under-resourced languages that Transkribus was integrated as a third text recognition engine.

Transkribus

Transkribus is an AI-powered platform that eases the process of working with handwritten or printed manuscripts. It provides a plethora of models depending upon variations in writing script, the period of the historical document, and other factors. The Transkribus B2022 English Model M4, Transkribus German Handwriting, and the Devanagari Mixed M1A are some of the models that show high accuracy in transcribing texts.

After testing on multiple types of documents and models using the Wikimedia OCR tool, the Transkribus engine has been made available as an option, alongside Google and Tesseract. As of 14 July 2023, Transkribus has been enabled on the 13 Wikisources listed on this page.

READ-COOP, the organization behind Transkribus, has generously provided over 60000 free Transkribus credits to the Wikimedia community. Wikisource Technical Fellows have been working on improving the Balinese model, initially created by the team at IIIT Hyderabad led by Dr. Ravi Kiran, and creating a new Javanese model with Transkribus. The dataset to be used to train the models, also known as ground truth, was provided by both communities.

What’s next?

The Transkribus engine is a promising addition to Wikimedia OCR as it allows users to work with a wide variety of models for transcribing text, with a focus on handwritten manuscripts. Even though the functionality on Wikimedia OCR does not take into account the full range of features offered by Transkribus, there are more activities planned in the near future.

  • As mentioned earlier, Transkribus offers multiple models for any particular language. Keeping this in mind, we plan to have a model selection widget that allows the user to choose and switch between different models while proofreading on Wikisource.
  • Wikisource communities will be able to train models on their choice of documents and eventually publish these models for general use. This page details the procedure involved in the creation and training of models using Transkribus.
  • We will be available to provide support to communities that are planning to implement Wikisource Loves Manuscripts projects for their literary heritage. Eight communities have already signed-up as learning partners on the project. Sign up now if you and your community are also interested in working with manuscripts and handwriting recognition. Stay tuned for more information on the upcoming Wikisource Loves Manuscripts Learning Partners Network!

The integration of a next-generation AI-powered text recognition engine into Wikimedia OCR opens up exciting new possibilities in terms of transcription accuracy and efficiency. We cannot wait for all contributors, from across the global Wikisource community, to start using Transkribus on Wikisource!

Parthiv Menon is a Wikisource Technical Fellow who worked on integrating Transkribus into Wikimedia OCR along with Kolawole Lawal.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?