Search is an important part of any web app like a wiki, but search is harder than it looks — especially in a multilingual environment. MediaWiki has to support not just your standard Western languages like English and Spanish, but many more with special requirements:

Some can be written in multiple scripts (such as Serbian in Cyrillic or Latin), and searches should match text written either way.
Some languages don’t use word spacing, like Chinese and Japanese. To let the search index know where word boundaries are, we have to internally insert spaces between some characters:

维基百科 -> 维基百科

Then to add insult to injury, we need to fudge the Unicode characters to ensure things work reliably with older and newer versions of MySQL:

维基百科 -> u8e7bbb4 u8e59fba u8e799be u8e7a791

For a long time, this word segmentation wasn’t being handled correctly for Chinese in our default MySQL search backend, so searching for a multi-character word often gave false matches where the characters were all present, but not together.
This is now fixed for MediaWiki 1.16; the intermediate query representation passed to the search backend now internally treats your multi-character Chinese input as a phrase, which will only match actual adjacent characters:

维基百科 -> +”u8e7bbb4 u8e59fba u8e799be u8e7a791″

Note that Wikimedia’s sites such as Wikipedia run on a fancier, but more demanding, search backend with a separate Java-based engine built around Apache Lucene. Sometimes we have to remind ourselves that third-party users will mostly be using the MySQL-based default, and oh boy it still needs some lovin’! 🙂

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Welcome to Diff

Welcome to Diff, a community blog by – and for – the Wikimedia movement. Join Diff today to share stories from your community and comment on articles. We want to hear your voice!

Diff

Chinese-language search fixes for MediaWiki

Can you help us translate this article?

Related

Welcome to Diff

Subscribe to Diff via Email

Wikimania Katowice

Wikimedia CEE Meeting 2024

Celtic Knot 2024

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole