Wikipedia seeks to speak your language

Translate This Post
Photo by Colin, CC BY-SA 4.0.
Photo by Colin, CC BY-SA 4.0.

Wikipedia readers speak many languages, so it’s not a surprise that sometimes they search for phrases not in the language of the wiki that they’re currently reading. This, unfortunately, can lead to poor search results. A recent survey we completed on English Wikipedia identified searches done in 40 different languages! The Wikimedia Discovery department wants to help people easily find what they are looking for. In order to do this, the Discovery Search team is rolling out new language identification software to the Wikipedia search engine.
This new software will detect when a search is unsuccessful, but appears to be in a different language. When this happens, the search results page will include results from the Wikipedia of the automatically detected language. These new cross-wiki results will be displayed along with the local-wiki results, if there are any. We’ve recently enabled the language identification and search results for the English, French, German, Italian, and Spanish-language Wikipedias.
Like many Wikipedia searches, language detection can be difficult, especially for short snippets of text. That’s what makes the task of trying to make it work so much fun, but sometimes it is nearly impossible to tell what the intended language was meant to be. The word cubo, for example, might look like it’s Spanish, unless you know Portuguese, Italian, Galician, or even Latin!
The shorter the search, the more likely it is to be ambiguous and hard to detect which language it’s meant to be.

Screenshot by Deborah Tankersley, CC BY-SA 3.0.
Old search results on the English Wikipedia for a search in Russian. Screenshot, CC BY-SA 3.0.

 

Screenshot by Deborah Tankersley, CC BY-SA 3.0.
New results on the English Wikipedia for a search in Russian, after the addition of language detection. Screenshot, CC BY-SA 3.0.

The next group of Wikipedias to have language detection enabled will include Indonesian, Japanese, Portuguese, and Russian. We are investigating ways to bring language detection to more Wikipedias and to other Wikimedia projects.
The Search team has other language detection ideas and plans in the works. We’re thinking about ways to improve language detection with smarter measures of confidence. We are also exploring detection of search in one character set while using a keyboard from another character set. Early experiments with English and Russian are promising!
You can find technical details about our new language detection module (TextCat) on MediaWiki.org. PHP and updated Perl libraries are also available and the libraries include language models for dozens of languages. You can also test the language detection using our online demo. The demo lets you try all the different language models on your own text. It also includes tutorials and lots of additional information about TextCat’s internal workings.
Let’s get searching—now with language detection and better results!
Deborah Tankersley, Product Manager, Discovery
Wikimedia Foundation

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

6 Comments
Inline Feedbacks
View all comments

That’s SO COOL Ms. Tankersley! Thank you!

Awesome! This will help a lot of people.

wow, its amazing. Wiki Rocks.

Waiting for Telugu language integration. This will help Te-wiki to get more readers from English wiki also.

Hi,
There is actually a need for a reverse use case. If someone searches in English in Tamil Wikipedia, it should detect the correct Tamil article. This can be implemented using Wikidata links. Many non-roman script (especially, indic language users) language Wikipedia users search in English inside the search box. They actually hope to see an article in their own language. Just that they could not type in their own language. There are some Wikipedias which actually create 1000s of redirects with English titles to solve this problem.

[…] have happened, things like: adding ascii-folding and stemming, detecting when a visitor might be typing in a language that is different than the Wikipedia that they are on, switching from tf-idf to BM25, dropping trailing question […]