How I Created 33,000 Arabic Lexemes in Wikidata

Translate this post
المعجميات أيام ويكي بيانات

In a world driven by semantic search, artificial intelligence, and structured data, language is no longer merely a means of communication; it has become a data layer upon which modern systems are built.

For Arabic, despite its rich morphological and derivational nature, its representation within Wikidata has remained significantly limited. Before the start of this project, Arabic lexemes did not exceed 2,500 entries, and many of them lacked essential morphological and derivational data. Roots were often unlinked, patterns incomplete, verbs without forms, and lexemes without a clear ontological framework.

From this point, I began working on the “Arabic Lexeme Enrichment” project during the period from August to December 2025 – not merely as a partial improvement effort, but as a practical attempt to build a coherent Arabic linguistic layer within the platform that can later be relied upon in Wiktionary and other systems.

From a Clear Gap to a Structured Approach

At the beginning, I developed a dashboard to better understand the actual state of Arabic lexemes. As I began analyzing the data, it became clear that the gap was much larger than I had expected, while at the same time the potential for building was achievable given the right tools and methodology.

I relied on a combination of manual work and supporting tools to create lexemes, link them to Arabic roots, and add verb conjugations and derived nominal forms.

When Numbers Become Structure

During this project, I was able to create more than 33,000 Arabic lexemes in Wikidata, covering verbs, nouns, adjectives, and roots. This was not merely quantitative expansion, but the result of a structured approach that included:

  • Organizing and linking patterns such as active participles, passive participles, and verbal nouns to their related lexemes.
  • Linking around 4,000 Arabic roots to historical dictionaries such as the Doha Historical Dictionary of Arabic and the Sharjah Historical Dictionary.
  • Building morphological conjugations for verbs, averaging around 120 forms per verb. (example)
  • Developing an interactive dashboard to analyze Arabic lexemes, track data quality, detect duplication, and improve overall consistency.

Free Knowledge Is Not Built Individually

Although the work began as an individual effort, its most important turning point came when the community began to engage with it. This was reflected in receiving an invitation from the Arabic Wikidata community to deliver a training session as part of Wikidata Days. During this session, I presented the core outcomes of the project in a one-hour training, focusing on lexeme construction, morphological linking, and how to model Arabic within the platform. I also prepared a presentation consisting of 47 slides and published it as a PDF on Wikimedia Commons to serve as an open reference for anyone interested.

For me, this moment confirmed that the project was no longer just an individual effort, but had become part of a broader learning and community-driven process.

What Does This Mean for the Future?

Building a structured Arabic lexical layer within Wikidata is not an end in itself, but a foundational step toward:

  • Improving machine translation
  • Supporting semantic search
  • Enabling Arabic language processing tools
  • Developing applications that rely on structured linguistic data

All of these areas depend directly on the quality and depth of lexical data.

Conclusion

What began as an attempt to understand the data evolved into an effort to build it. What started as a simple tool became an entry point for reshaping how Arabic is represented within one of the most important open knowledge platforms. Creating more than 33,000 Arabic lexemes does not mark the end of the journey; rather, it lays the foundation for a structure that the community can build upon and expand. What has been achieved so far is only a first step toward strengthening the presence of Arabic in the semantic web and enriching its knowledge ecosystem.

In conclusion, I would like to thank the Million Wiki Project for their financial sponsorship of this project.

By: Mr. Ibrahem

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?