Building a 50,000 pronunciation data repository in the Odia language

Last year, I started a small pilot under the OpenSpeaks project for building voice data as a foundational layer for speech synthesis research and application development. To test some of the learning in this field, I started building a wordlist by collecting words from multiple sources including Odia Wikipedia and Odia Wiktionary, and started recording pronunciations using Lingua Libre. Recently, the pilot hit a 55,000 pronunciation milestone. The repository also includes pronunciations of 5,600 words in Baleswaria, the northern dialect of Odia. All the recordings were also released under a Public Domain (Creative Commons CC0 1.0) release on Wikimedia Commons. These recordings make the largest repository of Public-Domain voice data in Odia, and add to another 4,000+ recordings of sentences in Odia on Mozilla Common Voice.

The “Odia pronunciation” category on Wikimedia Commons that houses these recordings also includes further contributions from many other volunteer contributors that are available under CC-BY and CC-BY-SA Licenses. The dialect has been scarcely studied before and a comprehensive corpus of all the spoken words was never made available. As a bonus of this pilot recording project, two word corpus are being built and expanded in both Odia and Baleswaria.


Recording a word using Lingua Libre. A detailed tutorial in Odia explains how to record words from the Odia Wiktionary. User:Psubhashish CC-BY-SA-4.0.

Collecting words for building wordlists

To record pronunciations of words, a unique list of words are often used. Such lists exist for many languages as Natural Language Processing (NLP) research generally requires such corpuses. In the case of Odia, many of us in the Wikimedia community had created a list of 500K unique words back in 2017 primarily using the content that we ourselves created on Odia Wikipedia at that time. There also exists a few other lists such as PanLex and UniLex, as pointed out by Bengali Wikimedian Mahir Morshed over a conversation, which use data crawled from multiple sources. Nikhil Mohan Patnaik, a scientist and co-founder of nonprofit Srujanika that is involved in digitization and scientific resource-building in Odia, once shared the frustration that most wordlists in Odia did not have a decent diversity of topics, especially words that find contemporary use or use in science and technology.

Through a series of interconnected steps, it was possible to create a wordlist and clean it further. Cleaning up a wordlist is a must as the collection of textual data often includes content that is not always edited for spelling and other mistakes such as the use of alien characters in a particular writing system (say, Cyrillic characters in Korean text). Data dumps from Odia Wikipedia, Odia Wikisource, and Odia Wiktionary (which itself contains over 100K words and lexical definitions that are taken from the 1931-1941 lexicon “Purnnachandra Ordiya Bhashakosha”) were the primary sources. At the beginning of this pilot, I started expanding the existing wordlist on a few fronts. The Odia Wikipedia by this time not only has a few thousand more articles, but the quality of the articles have been consciously improved by community effort. More importantly, individual editors who lead projects, such as articles on politicians and films by Sangram Keshari Senapati, medicine-related articles by Subas Chandra Rout and articles on species and more from both plant and animal kingdom by Sraban Kumar Mishra, have created a rich body of diverse articles. Many annual editing sprints focusing on diciplines such as feminism and social sciences are also helping diversify Odia Wikipedia slowly. It was useful to add words by downloading and scraping the Wikipedia data dump and also look beyond. Despite these efforts, a small community of volunteer editors still translates into some spelling mistakes. Within three days of running a bot, I could fix over 10,000 errors just inside Odia Wikipedia. There are many more and some would even need manual reading, discovery, and editing. Wikisource tends to retain many spelling mistakes or writing styles, which could drift from the standard writing style that the original author of a book uses. While recording words in a dialect helps expand the speech diversity, plain typing and other errors can only increase the total number count of words. It is possible to manually notice many mistyped words during the recording process and it was possible to correct them on the wordlist side.

Odia newspapers that have a text archive and new sites became the primary sources for contemporary topics and the Bigyan Diganta magazine was key for science and technology related words. By web-scraping, crowdsourcing such source, converting publicly available text typed using legacy encoding systems (such as ASCII) into text with Unicode encoding by using encoding converters, and cleaning up such data with a mixed approach (in a semi-automated process of using a script written by Wikimedian T. Shrinivasan and a manual cleanup) helped add more words to the wordlist. A tutorial by Wikimedian Tito Dutta was also useful for the wordlist creation before this script was available.

These steps also helped expand the diversity in terms of topics as Wikipedia content has many levels of gaps—starting from a significantly high number of articles of local relevance in terms of geography, religion, and ethnicity because of the existing contributor diversity to a much higher level of coverage of medicine-related articles, to a staggeringly low-level of coverage of articles in topics such as feminism or human-computer interaction or social justice. While the publicly available text includes many known gaps, collating all available words is still required from a wordlist standpoint.

Lexicographers generally segregate words as headwords or lemmas and forms. From a pronunciation library standpoint, forms are important as they carry phonological features such as intonation. For over 10,000 of the 500K+ list, I also created a spreadsheet with rules to produce more than 100 forms from a single lemma. Later, continued interactions with Wikimedians Bodhisattwa Mondal and Mahir helped me understand that there is a place for lemmas and forms for creation of Lexemes on Wikidata, and work towards Abstract Wikipedia, an ambitious project that would eventually help translate definitions and descriptions used on Wikipedia into multiple languages. Abstract Wikipedia can potentially be the kind of automation that might reduce the workload of many Wikipedians as volunteer labor is an extreme privilege in many parts of the world. The demand for way too much manual work across Wikimedia projects can not only lead to contributor burnout but can also create an entry-level huge barrier for many who have no-limited time and other forms of affordability to volunteer.

The wordlist creation also helped in listing for the first time headwords/lemmas that did not make it to available dictionaries either because of lack of wider coverage or lack of subject experts. Emergence of neologies because of COVID is an example of how unexpected situations add new words to a language. The wordlist has some such words. What was more useful is to create forms from headwords using concatenation forumas (e.g. “COVID-led” and “COVID-related” are two forms of the lemma “COVID”). Recording forms are important as speech incorporates intonation and other variations even though a lemma and its corresponding forms will have repetition of the same lemma. The audio library does contain some such cases. Building the wordlist in a speech-led process also helped create lemmas along with definitions and forms in some cases and that prepared ground for importing them into a Wikidata lexeme structure.

The recording process

Lingua Libre recording setup for batch recording of word pronunciations. Package cushioning pads are used for frugal soundproofing. User:Psubhashish CC-BY-SA 4.0

The recording process includes recording with a professional desktop studio microphone directly into Lingua Libre. The latter being a web application that works well on the Firefox browser, it was possible to change the system settings on the computer and make the external mic (instead of the computer’s built-in mic which has low quality reception) the default mic. The mic also has rotating knobs to adjust the gain during the recording as Lingua Libre has no other effective means of cleaning up audio post recording. Some amount of soundproofing was done by adding four package cushioning pads while the microphone and the computer were placed over a cotton mat that covered the wooden table underneath. The recording process includes recording with a professional desktop studio microphone. Some amount of soundproofing was done by adding four package cushioning pads while the microphone and the computer were placed over a cotton mat that covered the wooden table underneath. All these steps were frugal and used for ensuring acoustic soundproofing. Lastly, Lingua Libre’s two levels of sound monitoring and reviewing were key to ensuring that the audio quality is right before uploading. I have created on the Odia Wiktionary a tutorial with the step-by-step process for using Lingua Libre for Odia entries from Wiktionary.

Platforms and hardware used

This pilot uses platforms that are open source and otherwise adhere to the philosophy of Openness. OpenSpeaks started as a project on Wikiversity for helping archivists that are creating multimedia archival documentations of low-medium-resource languages. It includes a range of Open Educational Resources (OER), open strategies and even templates for specific needs in different linguistic and other demographic environments. Lingua Libre is a web platform for recording pronunciations of a list of words in a language or dialect. The project supports virtually all languages that can be written using a writing system with Unicode-compliance encoding. Common Voice is a web platform by Mozilla that encourages contributors to record pronunciations of sentences and review recordings made by others. It only accepts at the moment sentences available under Public Domain and the recordings made are also made available under a Public Domain release after they are anonymized. The script created by T. Shrinivasan for creating a unique list of words from any text file (with .txt or even .xml or .json extensions) is available on GitHub and can be tailor-made to accommodate the needs of different languages typed using varied writing systems.

Things in the horizon

The recorded words are now uploaded on Wikimedia Commons paving a way for them to be used across Wikimedia projects, starting with lexemes and Wiktionary, and also for other NLP research and application development. As a native speaker of Baleswaria who always switches code between the dialect and the standard Odia, it was also important to document the variations in intonation and accent which are essential attributes that need to be documented for speech synthesis. The repository severely lacks phonological diversity as all the 55,000 recordings are made by a single male speaker. The next step is to document the ethnolinguistic strategies that can be useful for other low-medium-resource languages as a part of the OpenSpeaks project which focuses on helping archivists with strategies and tools for multimedia archival of languages with low and limited resources. In the meantime, the wordlists are made available on GitHub.

No comments

Comments are closed automatically after 21 days.