We had the data model down. We had the Lexemes. But when a language writes “gballi” with a two‑letter “gb”, Unicode doesn’t care. It sees two separate characters. Dagbanli’s unique alphabet meant we couldn’t just sort alphabetically and call it done. We had to build something that understood the language on its own terms.
Introduction
If you’ve used Wikipedia, you’ve probably seen Wikidata in action. In many language versions of Wikipedia those infoboxes on the right side of articles pull their data from Wikidata. But since 2018, Wikidata has offered something far more powerful for language work: a full lexicographical model for representing words. This means we can store not just that a word exists, but its meanings, its grammatical Forms, its pronunciation, and even example sentences, all in a structured, machine-readable way that anyone can reuse.
For the Dagbanli Dictionary, this model is the foundation. Every word you search, every Form you see, every audio icon that appears, it all originates from Wikidata Lexemes. In the first post, we explained why Dagbanli needs a digital dictionary and why we chose Wikidata as its backbone. Now let’s look under the hood at how we actually structure a language on Wikidata.
1. The Lexeme Data Model
What is a Lexeme?
In Wikidata, a Lexeme (L‑entity) represents a lexical unit: a word in a specific language. Each Lexeme has:
- Lemma: The word Form used as the headword (e.g., “kuli”)
- Language: Dagbanli is identified by the Wikidata Item Q32238
- Lexical category: noun (Q1084), verb (Q24905), etc.
Attached to each Lexeme are Senses (meanings) and Forms (grammatical variants).
Senses: Meaning in Context
A Sense (S‑entity) captures a specific meaning of the word. Each Sense can have:
- Glosses: short definitions in multiple languages (Dagbanli, Hausa, French, etc.)
- Statements: just like regular Wikidata Items, Senses can have claims. For example, an image (P18) illustrating the Sense, or a usage example (P5831) giving real-world use case examples.
For example, the Lexeme for “kuli” (L307875) has:
- Sense 1: “hoe” (with glosses in Dagbanli, and other languages)
- Sense 2: “place name in Ghana” (with a usage example sentence)
- Sense 3: “funeral” (with an illustrative image)
Forms: The Building Blocks of Grammar
Forms (F‑entities) represent the actual surface variants a word takes or the morphological surface shapes a word could assume. For a noun like “kuli”, Forms might include:
- Singular: “kuli”
- Plural: “kuya”
Each Form has:
- Representation: the actual string (e.g., “kuli”)
- Grammatical features: a set of Items describing the Form’s grammatical role, e.g., plural (Q146786), singular (Q110786)
How This Maps to a Dictionary Entry
When you look up a word on dagbanli.info, here’s what happens behind the scenes:
- The app searches for the matching Lexeme. It queries our pre‑harvested data (or falls back to a live Wikidata query) to find the Lexeme that contains the Form you typed. For example, searching for “kuya” finds the plural Form, which points back to its parent Lexeme L307875 (the dictionary entry for “kuli”).
- It retrieves the full Lexeme record. Once the Lexeme is identified, the app loads all of its associated data: the headword (lemma), lexical category (noun, verb, etc.), all Senses (meanings), and all grammatical Forms.
- Senses are grouped under the headword. Each sense appears with its glosses (definitions in Dagbanli prominently, and secondarily in any other existing language glosses on the sense), synonyms (P5973), antonyms (P5974), any attached images (P18), and Item for this sense (P5137) to retrieve related videos, missing images and labels (in place of missing Dagbanli glosses). If a Sense has multiple glosses, they’re displayed together.
- Forms are listed with their grammatical labels. All variants (singular, plural, etc.) appear below the Senses, each labeled with its grammatical function (e.g., “plural”, “singular”). If a form has a pronunciation audio (P443), a speaker icon appears next to it including the dialect (P5237) of the speaker.
- The result is rendered in a clean card layout. The app organizes everything into a structured, readable format so you can quickly navigate between the headword, its meanings, and its Forms.
In short, looking up “kuli” or “kuya” (plural of “kuli”) takes you directly to the Lexeme L307875. You’ll see:
- Headword: kuli
- Category: noun
- Senses: hoe, funeral, place name
- Forms: kuli (singular), kuya (plural), with audio icons and IPA transcriptions where available
This structure means the dictionary is never just a flat wordlist. It’s a rich, interconnected graph of linguistic data, all powered by Wikidata.
2. Properties We Use
Wikidata’s Property system lets us attach rich information to Lexemes, Senses, and Forms. Here are the key properties powering the dictionary:
| Property | ID | What it does | Where it appears |
|---|---|---|---|
| Pronunciation audio | P443 | Links a form to an audio file on Wikimedia Commons | Audio icon on forms |
| Usage example | P5831 | Provides an example sentence at the Lexeme level | “Usage examples” section on Sense |
| Image | P18 | Attaches an image directly to a Sense | Under Sense in word card |
| Item for this sense | P5137 | Links a Sense to a Wikidata Item (e.g., a concept, place, or thing) | Provide labels as glosses, or images/video where one is missing on the Lexeme |
| Synonyms | P5973 | Links to another Lexeme with similar meaning | Under Sense section |
| Antonyms | P5974 | Links to another Lexeme with opposite meaning | Under Sense section |
| Pronunciation variety | P5237 | Indicates dialectal variation (e.g., Tomosili, Nayahili, Nanuni) | Shown near audio icon |
| IPA transcription | P898 | Provides International Phonetic Alphabet representation | Shown next to Forms for precise pronunciation guidance |
The Harvest Query
Every six hours, an automated cron job (built into our Cloudflare Worker) triggers a refresh of the dictionary data. This process runs a SPARQL query that fetches all Dagbanli Lexemes in one go. The query (simplified) looks like this.
This returns thousands of rows. To stay within Cloudflare’s CPU limits, the actual harvest is chunked: each cron invocation processes a batch of Lexemes, and the job resumes where it left off in the next run until all Lexemes are fully hydrated. The final nested JSON is stored in R2 and served via the /data endpoint. The dictionary app (with the live toggle off) always reads this cached, pre‑processed data, ensuring fast and reliable lookups.
The live toggle is available for those rare moments when you need to see edits made within the last few hours, but for everyday use, the cached data is more than sufficient and provides a much smoother experience.
3. The Digraph Challenge
Dagbanli’s alphabet is not a simple subset of the Latin alphabet. It includes several digraphs (two characters that together represent a single letter):
| Digraph | Example word |
|---|---|
| GB | gballi (mat of woven grass) |
| KP | kpɛŋ (strength) |
| NY | nyuli (envy) |
| CH | chɔɣu (place name) |
| SH | shinkaafa (rice) |
| ŊM | ŋmani (look-like) |
| ʼ (glottal stop) | do’biɛɣu (ugly male) |
The Sorting Problem
In standard Unicode sorting, “gballi” would be treated as “g” + “b” + “a”. It would appear under G, not under GB as a distinct letter. This is wrong for Dagbanli. A native speaker expects words beginning with “gb” to be grouped together, not scattered among other g‑words.
Our Solution: A Custom Alphabet
We defined a constant array in the codebase that lists Dagbanli’s letters in their correct order:
export const DAGBANLI_ALPHABET = [
'A', 'B', 'CH', 'D', 'E', 'Ɛ', 'F', 'G', 'GB', 'Ɣ', 'H', 'I',
'J', 'K', 'KP', 'L', 'M', 'N', 'NY', 'Ŋ', 'ŊM', 'O', 'Ɔ', 'P',
'R', 'S', 'SH', 'T', 'U', 'V', 'W', 'Y', 'Z', 'Ʒ', 'ʼ'
];
When we need to:
- Find a word’s first letter: we check digraphs before single characters. “gballi” –> check “GB” first (match), not “G”.
- Sort Lexemes: we use this array as the ordering reference, not Unicode code points.
The Gballi Browser
This is why the Gballi Browser (the “Browse” feature) works correctly. When you type “GB”, you see words like “gballi”, “gbana”, “gbina”, all grouped under their proper letter. No other dictionary for Dagbanli does this, because most assume Latin alphabet sorting is sufficient.
Gballi browser showing the full alphabet grid with digraph buttons
4. Special Characters: ɛ, ɔ, ɣ, ŋ, ʒ
Dagbani special characters and diagraphs
Beyond digraphs, Dagbanli uses several special characters not found on standard keyboards:
- Ɛ ɛ (open e)
- Ɔ ɔ (open o)
- Ɣ ɣ (voiced velar fricative)
- Ŋ ŋ (velar nasal)
- Ʒ ʒ (ezh, voiced postalveolar fricative)
- ʼ (glottal stop)
Typing Made Easy
If you’re a speaker trying to look up “Ʒiɛɣu”, how do you type the “Ʒ” on a phone keyboard? Most users can’t.
Our solution: a special character bar that appears whenever the search input is focused. It shows buttons for each special character; tapping one inserts it at the cursor position.
Special character bar (ɛ, ɣ, ŋ, ɔ, ʒ, ʼ) below search input
Normalization Without Loss
When users search, we need to be flexible, but not too flexible. We lowercase the input and strip punctuation, but we never collapse “ɛ” to “e” or “ɔ” to “o” in our internal data representation, because these distinctions are essential for linguistic accuracy.
However, we also understand that users may type ‘e’ when they mean ‘ɛ’ or ‘o’ when they mean ‘ɔ’ due to keyboard limitations or uncertainty. To handle this, we employ a fuzzy search algorithm that includes matches where the typed character could correspond to a similar special character. For example: if a user searches for “paga” (typing with a standard ‘g’), our fuzzy search will include results for “paɣa” (woman) because it recognizes ‘g’ as a likely substitute for ‘ɣ’. The special character bar remains available for users who want to type precisely, while the fuzzy search catches those who don’t. This way we make the dictionary accessible to all users regardless of their keyboard setup.
5. How We Query It All
The dictionary’s backend (our Cloudflare Worker) exposes endpoints like `/search?q=...`. But the heavy lifting happens in the harvest process:
- SPARQL query (as above) fetches all Dagbanli Lexemes, Senses, Forms, and their claims.
- Hydration: we make additional `wbgetentities` calls to pull full data for each Lexeme (Wikidata’s API returns more detail than SPARQL).
- Indexing: we build in‑memory lookup tables: lemma –> Lexeme ID, Form representation –> Lexeme ID, etc.
- Caching: the final nested JSON is stored and served from R2.
From day one, we made a deliberate choice: by default, the app never queries Wikidata live. Instead, on first visit, users are invited to download a small language pack (~2 MB) for offline use. Once cached locally in IndexedDB, every lookup is lightning fast, with the initial page load under 100kb, and completely resilient.
Conclusion
Wikidata gives us a powerful, community‑maintained backbone for the dictionary. But making that data work for Dagbanli meant respecting the language’s own writing system: digraphs as first‑class letters, special characters preserved, sorting by Dagbanli rules, not Unicode’s. The digraph‑first letter detection algorithm is just a few lines of code but took days to get right. It’s open source. Check `src/lib/constants.ts` and `src/lib/gballi.ts` in the repository if you’re curious. The result is a dictionary that feels native to its users, not a Latin‑alphabet tool awkwardly repurposed for an African language.
In the next post, we’ll dive into the audio pipeline: how we extract pronunciation recordings from Wikimedia Commons, handle iOS Safari’s OGG incompatibility, and build an audio index for fast filtering.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation



