Wikipedia readers speak many languages, so it’s not a surprise that sometimes they search for phrases not in the language of the wiki that they’re currently reading. This, unfortunately, can lead to poor search results. A recent survey we completed on English Wikipedia identified searches done in 40 different languages! The Wikimedia Discovery department wants to help people easily find what they are looking for. In order to do this, the Discovery Search team is rolling out new language identification software to the Wikipedia search engine.
This new software will detect when a search is unsuccessful, but appears to be in a different language. When this happens, the search results page will include results from the Wikipedia of the automatically detected language. These new cross-wiki results will be displayed along with the local-wiki results, if there are any. We’ve recently enabled the language identification and search results for the English, French, German, Italian, and Spanish-language Wikipedias.
Like many Wikipedia searches, language detection can be difficult, especially for short snippets of text. That’s what makes the task of trying to make it work so much fun, but sometimes it is nearly impossible to tell what the intended language was meant to be. The word cubo, for example, might look like it’s Spanish, unless you know Portuguese, Italian, Galician, or even Latin!
The shorter the search, the more likely it is to be ambiguous and hard to detect which language it’s meant to be.
The next group of Wikipedias to have language detection enabled will include Indonesian, Japanese, Portuguese, and Russian. We are investigating ways to bring language detection to more Wikipedias and to other Wikimedia projects.
The Search team has other language detection ideas and plans in the works. We’re thinking about ways to improve language detection with smarter measures of confidence. We are also exploring detection of search in one character set while using a keyboard from another character set. Early experiments with English and Russian are promising!
You can find technical details about our new language detection module (TextCat) on MediaWiki.org. PHP and updated Perl libraries are also available and the libraries include language models for dozens of languages. You can also test the language detection using our online demo. The demo lets you try all the different language models on your own text. It also includes tutorials and lots of additional information about TextCat’s internal workings.
Let’s get searching—now with language detection and better results!
Deborah Tankersley, Product Manager, Discovery