Language identification, commonly referred to as LID, plays a pivotal role in many natural language processing (NLP) systems. Consider the user interfaces we interact with daily. They usually have an option allowing users to specify the language of the content they’re dealing with. However, imagine if this manual selection step was bypassed and the system could predict the language on its own! This advancement would certainly elevate the user experience.

For instance, consider the need for viewing translated messages in platforms like Wikipedia Talk pages. If users are given translations without having to pinpoint the source language, it simplifies their interaction with the platform. Another example is a machine translation system, where user provides source text and selects target language. The system automatically selects the source language based on the language identification.

While there are numerous LID tools in existence, none can boast detecting all 300+ languages that Wikipedia is available in. For perspective, the Compact Language Detector 2 library can identify 83 languages, whereas FastText’s lid model can discern up to 176 languages. A notable challenge here is that many of these models don’t make their training data public.

This is where the project “An Open Dataset and Model for Language Identification” spearheaded by researchers from the University of Edinburgh steps in. Their efforts have culminated in a dataset and a model that can detect an impressive 201 languages. This potentially makes it the most adept and high-performing LID system available.

In light of this development, Language team and in collaboration with Machine learning team is introducing a new API designed to predict the language of any given text. This is hosted in the LiftWing system – a scalable machine learning model serving infrastructure by Wikimedia.

An animated image showing text in various languages identified

Using the API

Please refer the API documentation at Wikimedia API Portal

An example using curl:

$ curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Some sample text in any language that we want to identify"}' -H "Content-type: application/json"

About the potential usage, ethical consideration, caveats and recommentation, please see the model card

Thanks

We thank Laurie Burchell and Alexandra Birch and Nikolay Bogoychev and Kenneth Heafield of University of Edinburgh for their research and the model that made this API possible.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Diff

Open language identification API for 200+ languages

Using the API

Thanks

Can you help us translate this article?

Related

Welcome to Diff

Subscribe to Diff via Email

Wikimania Katowice

Wikimedia CEE Meeting 2024

Celtic Knot 2024

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole