What is lexicographical data?

Translate this post
Group photo of Wikidata Trainings For Turkic Wikimedians conference, Istanbul 2022

In late November 2022, Üsküdar University in Istanbul played host to the “Wikidata Training for Turkic Wikimedians” conference, organized by Wikimedia UG Turkey & Turkic Languages UG. The primary aim of this workshop was to improve Wikimedia projects, with a particular focus on Wikidata, while enhancing the competencies of participants predominantly hailing from Turkic-speaking regions. It was during the conference’s second day that Asaf Bartov introduced a previously unfamiliar topic: Lexicographical data, or simply Lexemes.

Wikidata, established in 2012, initially concentrated on conceptual elements, where Q-items were associated with ideas rather than the words representing them. Since 2018, a fresh category of data has been incorporated into Wikidata, encompassing words, phrases, and sentences in numerous languages, each meticulously described in corresponding languages. This repository of linguistic knowledge is housed within unique entities known as Lexemes (L), Forms (F), and Senses (S).

The precise depiction of words directly correlates with underlying concepts, affording editors the ability to meticulously delineate all words across various languages. This structured data, akin to Wikidata’s entire repository, is reusable, serving as a valuable resource for diverse tools and queries that the community can employ. Lexicographical data, in particular, can greatly support Wiktionary.

A logo for lexicographical data in Wikidata CC0 1.0 Universal Public Domain Dedication.
The logo of the project

Within this initiative, homonyms (e.g., live /lɪv/ and live /laɪv/) and homographs (e.g., close, meaning near, or close, meaning to shut) are interlinked. Furthermore, it identifies words with multiple grammatical roles. An essential aspect of this undertaking entails furnishing practical examples showcasing the morphological and semantic variations of different words, each accompanied by an illustrative use case. The project meticulously tags words and their meanings, providing invaluable information about word registers and formats. For instance, “man” is typically employed in official and academic texts, whereas “guy” is colloquial.

In the realm of Wikidata lexemes, one can pinpoint word roots, even in compound words, with individual components enjoying separate entries. This project also enables the inclusion of word meanings in various languages and the precise equivalents in those languages. Moreover, it offers numerous possibilities, including the addition of images, human voice recordings, and the International Phonetic Alphabet (IPA) for vocabulary. A particularly advantageous aspect is the linkage of words to their corresponding Wikidata pages, where users can access in-depth explanations, going beyond the abstract meanings of words.


This endeavour bestows users with a plethora of capabilities, and I shall elucidate the most pivotal among them:

  • Translation Machine: Unlike extant translation systems reliant on peer-to-peer matching, this project facilitates the development of a translation machine with the ability to read and comprehend text comprehensively. This is made possible by incorporating the meanings of words in various languages.
  • Text-to-Speech: Owing to the presence of pronunciation and IPA for all words, the creation of software that assists the visually impaired in reading texts becomes feasible.
  • Grammar and Spelling Checking Tools: Leveraging the data generated by this project, grammar and spelling-checking tools can be devised to rectify linguistic errors.
  • Flashcards: In the realm of language instruction, flashcards are highly effective, and modern learners frequently employ them to acquire new languages. The data derived from this project serves as a valuable resource for crafting flashcards.
  • Grammar Practice: Given that grammatical attributes of diverse words are included within this project, the available data can be optimally utilized for grammar practice. Software can be devised to aid learners in rectifying their grammatical inaccuracies.
  • Pronunciation Practice: As each word is equipped with a pronunciation audio file, learners can compare their pronunciation with that of native speakers.

Now, you possess the opportunity to become the inaugural contributor to this pioneering project within your country. Given its novelty and limited participation thus far, the project’s user community is welcoming and committed to assisting newcomers. Should you have inquiries, please utilize the Telegram group at the following address: Telegram Group Link.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?