Mozilla Common Voice Meets Wikidata (How the Dagbanli Dictionary Got Audio Usage Examples)

Translate this post
Wikidata provides pronunciation audio for words; Mozilla Common Voice provides spoken example sentences.

Mozilla Common Voice has thousands of Dagbanli sentences with native‑speaker audio. We built a pipeline to match these sentences to dictionary words, creating audio‑rich usage examples.

Introduction

Wikidata gave us the bones for the Dagbanli dictionary: Lexemes, Senses, Forms, and even the pronunciation audio of words. But a word without context is just… a word. Beyond listening to how a word sounds, users need to see how it is used in real life. Hearing it spoken by a real person in an everyday setting is even better.

Wikidata itself already supports usage examples through the property P5831. Those are valuable, but they are text only. Mozilla Common Voice provides something different: thousands of spoken sentences, each with an audio file recorded by a native speaker. We built a pipeline to connect these audio‑rich sentences to the right dictionary entries. This post explains how we did it, the challenges we faced, and why it matters for Dagbanli learners and speakers.

1. What Is Mozilla Common Voice?

Mozilla Common Voice is a global initiative to create open‑source speech datasets for any language. Anyone can record sentences from a prompt, and others listen and validate the recordings. The data is then released under CC0, meaning it can be used freely for any purpose.

Dagbanli joined Mozilla Common Voice through community-led contributions and the documentation work of the Dagbanli datasheet authors: Emmanuel Ngue Um, Osman Mohammed Nindow, and Hadjia Natuusamata Abubakari. As of early 2026, the Dagbanli dataset contains over fourty thousands validated sentences:

  • Each with a written sentence in Dagbanli
  • Nearly a quarter have metadata such as the speaker’s gender, age, and dialect
  • Over twenty thousand have an audio recording of a native speaker reading that sentence

For our dictionary, these sentences are a goldmine. They provide authentic, spoken examples of how words are used in everyday speech, not just in isolated definitions.

2. From Tar File to R2: Handling the Audio

Common Voice does not provide an API to stream individual audio files. Instead, the entire dataset is distributed as a large .tar file containing all audio recordings plus a metadata spreadsheet. We downloaded this .tar file, extracted it, and uploaded each audio file to our Cloudflare R2 bucket. R2 gives each file a unique public URL.

The metadata spreadsheet (TSV) contains the sentence texts and the corresponding audio filenames. We wrote a script to combine this information into a single CSV, adding a new column with the R2 URL for each sentence. This CSV became the source for our matching pipeline.

We also considered using other metadata fields like speaker gender and dialect. However, the current coverage for Dagbanli is too sparse to be reliable. In the future, if enough recordings exist, we plan to add filters for dialect or speaker attributes.

3. From CSV to Dictionary Matches

Our script, build‑examples‑from‑csv.mjs, processes the CSV as follows for each sentence:

  1. Tokenize the sentence into words (handling punctuation and case).
  2. Normalize each word by removing diacritics and converting to lowercase.
  3. Look up the normalized word in our master Lexeme index (built from the harvested Wikidata Lexemes).
  4. If a match is found, link the sentence to the corresponding Lexeme ID and store the R2 audio URL.
  5. If multiple words in the sentence match different Lexemes, the sentence is attached to all of them.

The output is a JSON file where each Lexeme ID points to an array of matching example sentences, complete with audio URLs and source metadata. This file is uploaded to our R2 bucket as cv‑tokenized.json and eventually synced to users’ devices.

We only include MCV sentences that have audio. If a sentence lacks a recording, it is not used.

4. The Matching Challenge: Agglutinative Morphology

Dagbanli is an agglutinative language, meaning words are formed by adding suffixes to a base stem. For example, the verb “pie” (to milk or to line up in a row) can take on different forms: pieya, piemi, piela, piema, piemiya, etc. A simple exact match would miss “pieya” because it does not equal “pie”. We needed a more sophisticated approach.

Our solution is a 3‑tier matching strategy that we will detail in the next post. In short:

  • Tier 1 (exact match): The normalized token equals the Lexeme’s lemma.
  • Tier 2 (suffix stripping): Common Dagbanli suffixes are removed, and the remaining stem is checked.
  • Tier 3 (prefix progression): Characters are trimmed from the end until a match is found (fallback).

We must be honest: this approach is not perfect. Dagbanli, like any natural language, is full of exceptions and irregular forms that do not follow clean rules. Our algorithm catches many cases, but it also misses some or makes incorrect matches. We are improving it one exception at a time, by manually reviewing problematic matches and adding custom rules. This is ongoing work, and we welcome contributions from linguists and native speakers.

For the purpose of this post, it is enough to know that the pipeline works well enough to surface thousands of useful examples, and we are committed to making it better over time.

5. Audio Integration and UI Placement

When the dictionary app displays a word, the main card shows Wikidata‑provided usage examples (text only) at the top. Below that, in a separate section titled “Mozilla Common Voice Audio Examples”, we show the matched sentences from Common Voice. Each example includes a speaker icon. Tapping it plays the audio through our existing audio player (see Post 3).

6. The Update Challenge

Since we launched the dictionary three months ago, new Dagbanli sentences have been recorded on Common Voice. However, Common Voice does not provide a live API. Instead, it releases new versions of the full dataset periodically. The current version is v24.0 (as of early 2026). To get the new recordings, we would need to download the next major release, re‑extract, re‑upload to R2, and rebuild our JSON.

This is a significant hurdle. It means our dictionary cannot instantly reflect new Common Voice contributions. We are exploring options to reduce this latency, such as setting up a notification system for new releases and automating the update process. In the long term, if Common Voice ever offers a direct streaming API, we would adopt it eagerly.

7. Community Impact

Despite the update challenge, the integration of Common Voice has already created a positive feedback loop:

  • Contributors record sentences on Common Voice, knowing they are helping build an open speech dataset.
  • The dictionary uses those sentences as rich, real‑world examples, giving contributors immediate feedback (once the dataset is updated) that their recordings are being used.
  • Learners hear authentic speech, improving their own pronunciation and understanding.
  • More contributors are motivated to record additional sentences, expanding the dataset and improving the dictionary.

We have already seen this cycle in action. Several Dagbanli speakers have recorded new sentences on Common Voice, and those sentences will appear in the dictionary after the next dataset release. It is a powerful example of open collaboration between a Wikimedia project and a community‑driven language tool.

Conclusion

Mozilla Common Voice gave our dictionary something new: spoken usage examples that bring words to life in everyday sentences. Each sentence and audio file adds depth, context, and authenticity. The pipeline we built to match sentences to Lexemes is not perfect, but it is effective enough to surface thousands of examples.

Common Voice and Wikidata are not our only sources of rich media. The University of Ghana’s Department of Computer Science Human-Computer Interaction (DCS HCI) Lab contributed a complementary dataset that includes visual descriptions as structured audio sentences. In the next post, we will describe how we merged these external sources and built the 3‑tier matching engine that powers our example system.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?