Have you ever heard of Bashkortostan?
It’s a region of Russia, about a 1000 miles away from Moscow. Few people outside of Russia have heard of it, but inside Russia it’s quite well-known for its traditional honey and kumis industries, and tourists who visit its many rivers, forests, and mountains.
The region’s name comes from the Bashkirs—a distinct ethnic group that lives there and speaks its own language, which belongs to the Turkic family. Twelve years ago the first article in the Wikipedia in that language was written. Today the community of editors around it is among the most active Wikipedia communities in languages of Russia.
That community recently asked the MediaWiki software developers to solve a technical problem for them: Category collation in the Bashkir alphabet. Put simply, “collation” is the process of sorting words according to the alphabet. It’s not as simple as it may sound, and it works slightly differently in every language.
Bashkir is written in the Cyrillic alphabet, like Russian, but with several additional letters for special Bashkir sounds. These letters have their places all along the alphabet, but MediaWiki showed all of them incorrectly. For example, in the “Capitals of republics of Russia” category, the entry for Ufa (Өфө), Bashkortostan’s capital, was appearing in the end of the list, even though it was supposed to be in the middle.
MediaWiki software relies on an external library called ICU—International Components for Unicode— to apply collation to different languages. It has collation information about many languages, but not all, and Bashkir is not one of them.
I submitted a request to get this language into ICU, but the process of getting a new language into it can take many months, if not years. We could have just waited for that to happen, but then our colleague Brian Wolff wrote some brilliant code that resolves this issue inside MediaWiki’s code, making it unnecessary to wait for the ICU to update.
When the fix was ready, I got it deployed and tested on the Bashkir Wikipedia. And when this started working, the Bashkir Wikipedians were so happy about it that the biggest Bashkir newspaper, simply called Bashkortostan, got interested, and published a story about it.
And yes, it mentions Brian Wolff. Search for “Брайан Вулфф”. (Bashkir is not supported by Google Translate, but it is supported by Yandex.Translate. Machine translation is never perfect, but if you’re curious, you can try using it to get an idea of what the article says.)
Bashkir is the first language for which complete collation is implemented inside of MediaWiki. I am already starting to hear requests to do something like this for other languages, and thanks to Brian’s work it will now be much easier. The fact that Bashkir was the first one shows how an active editing community which cares about its language can get things to happen.
We are doing amazing things that affect the world in ways we don’t even imagine!
Amir Aharoni, Wikimedian
Editor’s note: While Amir is an employee of the Wikimedia Foundation, this post is written in a volunteer capacity.