In this post, weâll dip into examples of several multi-script languages, with a deeper dive into Serbian and Chinese, which have interestingly different needs. Weâll try to get a better sense of the complications that arise from supporting readers, editors, and searchers in multi-script languages, and briefly get to know some of the tools that help make it all possible. While the subject can be complicated and the tools are undoubtedly complex, handling multi-script languages well is an essential part of providing information to people in a form that they can readily use.
Making software engineers cry
Depending on your definition of âlanguageâ and your definition of âsupportâ, the Wikimedia Foundation supports a bit shy of 300 languages across more than 800 projects. Thatâs a lot of languages, and the variation across those languagesâand the complexity of supporting themâcan be staggering.[1] Itâs enough to strike fear in the hearts of software engineers everywhere.
Human languages, however, donât care one iota how hard they make software engineersâ lives, so in addition to the baffling variability between languages, there is often considerable variability within a language. English dialects seem able to proliferate without end, and there are many differences in words and phrases[2] (elevator vs lift) and spelling[3] (color vs colour) just between the standard American and British varieties. But at least we all use the same writing system.
Not so in other languages! Letâs take a lookâŠ
SerbianâCyrillic and Latin
Serbian is one of the standard forms of the Bosnian-Croatian-Montenegrin-Serbian language. It can be written in either the Cyrillic or Latin alphabets. While having two scripts complicates matters, the correspondence between the Cyrillic and Latin alphabets is mercifully exact, which makes converting between the two relatively straightforward.
âââ
Enter language converter!
On the Serbian Wikipedia, most articles are written in Cyrillic, though some are written in Latin. If you havenât made any language preference suggestions on the Serbian Wikipedia, then where English Wikipedia has âMain Page | Talkâ near the upper left of the page, Serbian Wikipedia has âĐлаĐČĐœĐ° ŃŃŃĐ°ĐœĐ° | Đ Đ°Đ·ĐłĐŸĐČĐŸŃ | ĐĐžŃ./lat.â Under âĐĐžŃ./lat.â you have three options: âĐĐžŃ./lat.â, âĐĐžŃОлОŃĐ°â, and âLatinicaâ. Thatâs our language converter in action!
Not too surprisingly, âLatinicaâ converts the page to Latin text, âĐĐžŃОлОŃĐ°â (which in Latin script is âÄirilicaâ) gives you Cyrillic, and the default, âĐĐžŃ./lat.â, gives you however the text was originally written. Logged in users can set a preference so that they normally see their preferred script.
Even with the relatively straightforward transliteration, there are complications. For example, in the article about Serbian-American actress Sasha Alexander on Serbian Wikipedia, her stage name is provided in English. It wouldnât be helpful to have the specifically English version of her name converted along with the general text on the page when transliterating to Cyrillic. In this case, the language converter is smart enough to know not to transliterate inside language-specific templates. Thereâs also ‑ {special markup} ‑ available that can block the conversion for any bit of text. It often gets used for standard abbreviations like units such as km (kilometer) or mm (millimeter), and cardinal directions in coordinates (e.g., the N and E in â44°48âČN 20°28âČEâ). Thereâs also a magic word[4] to block title conversion: __NOTC__ (also available in Cyrillic as __ĐĐĐĐĐ__), which gets used for domain names, abbreviations and initialisms, scientific and technical terms, etc.
Complicated is as complicated doesâediting and search
So youâre cruising around the Serbian Wikipedia, with your preference for Cyrillic set, reading about your favorite TV show with a controversial ending, ĐĐ·ĐłŃбŃĐ”ĐœĐž (English: Lost), and you decide to add a little detail or correct a minor typo. When you get to the edit page, you discover that the article is actually titled Izgubljeni and itâs written in Latin script, which you arenât so comfortable reading or writing. Bummer.
Help is coming! Itâs a very complicated issue, though. Do you convert the entirety of every article, from Cyrillic to Latin and back, every time someone wants to edit it in a different script? Or do you try to identify just what they changed in one script and convert it to the majority script of the article? What about cases unlike Serbianâoooo, foreshadowing!âwhere the conversion is good, but far from perfect? Fortunately for me, thatâs not my problemâwhew! But the WMF Parsing team has plans.[5]
Similarly, searching in a given script generally only finds matches in the same script. So on Serbian Wikipedia, searching for Izgubljeni gives a few dozen results, while searching for ĐĐ·ĐłŃбŃĐ”ĐœĐž returns several hundred. This one is my problem, as Iâm part of the WMF Search Platform team. Iâm working on a plugin for our search engine that will not only merge the Cyrillic and Latin scripts in the search index, but will also do some basic stemming, which lets searches for one form of a word return related formsâlike hope, hoped, and hoping in English. Speaking of hope, I also hope in the future to be able to bridge part of the mixed-script gap for other languages and projects where the conversion is, like that for Serbian, relatively straightforward.
Less straightforward transliterationâUzbek, Kazakh, and Crimean Tatar
Not all transliteration systems are as straightforward as Serbian. To varying degrees, Uzbek (Wikipedia), Kazakh (Wikipedia),[6] and Crimean Tatar (Wikipedia)âall Turkic languagesâneed more complicated support for their Cyrillic/Latin transliteration, up to and including regular expressions and lists of exceptions that just canât be handled in any straightforward way.
For these languages, the difficulties of reading, editing, and searching are greater than for Serbian, because any automatic conversion has to be significantly more clever, which also makes it more likely to make mistakes.
More straightforward transliterationâInuktitut and Shilha
Language converter, of course, supports scripts other than Cyrillic and Latin. The transliteration for Kazakh includes Arabic as well. Other scripts, including some you may be unfamiliar with, are supported, too! For exampleâŠ
Inuktitut (Wikipedia) is an Inuit language spoken in Canada and written in both Latin and Inuktitut syllabics. Fortunately the mapping between the two is straightforward, like Serbian.
Shilha is a Berber language spoken in Morocco and written in Arabic, Latin, and Tifinagh. Its Wikipedia is small and still in the incubator, but it already uses language converter to support Latin/Tifinagh transliteration, in part because the mapping is straightforward.
âââ
Confounded, confused, and confuzzledâChinese characters
The Chinese[7] Wikipedia (language code zh) uses the language converter to transform its text into several varieties, including those of mainland China (zh-cn), Hong Kong (zh-hk), Macau (zh-mo), Singapore (zh-sg), and Taiwan (zh-tw). These varieties can have small differences in punctuation and in a few particular words, but the largest split is between whether they use Traditional or Simplified Chinese characters. Chinese Wikipedia also supports the language codes zh-hant and zh-hans, which are generic Traditional and Simplified Han (Chinese) characters, respectively; weâll use those codes for our next example.
For those who donât read Chinese, the difference between Traditional and Simplified characters can be subtle, but itâs easier to see when you can compare the exact same text in the two systems. Open the article for âWikipediaâ rendered in Traditional (ç¶ćșçŸç§) and Simplified (绎ćșçŸç§) characters in adjacent tabs in your browser. Flip back and forth between them and notice that the Traditional variants of characters are often a bit more complex and have a few more strokes, and look a bit darker on the screen as a result. Simplified characters areâwellâa bit simpler looking. Punctuation, like periods, commas, and quotes, are also a bit different.
The Chinese language and Chinese Wikipedia together have a number of additional complexities that make the situation even more challenging:
- The mapping between Traditional and Simplified is not one-to-oneâin multiple ways! Some words that are a single Traditional character are written with two Simplified characters. A Traditional character thatâs part of a multi-character word or phrase might get converted to a different Simplified character as part of that  phrase than it would if it were on its own.
- Chinese is written without spaces, making it hard for a computer to break it into words (so that it can make decisions about how to transliterate). As a simple analogy in Englishâwritten without spacesâthe string âMARGARITAâ could be âMargaritaâ or âMarga Ritaâ. Context provides clues: âIWANTTODRINKAMARGARITA.â vs âHERFIRSTANDMIDDLENAMESAREMARGARITA.ââbut computers are terrible at context.
- Chinese Wikipedia, unlike Serbian, often has a mix of Traditional and Simplified characters in a given article. Not just in the same article or sentence, but in the same name!
An example I found of the last problem: âUEFA Champions League Finalâ appears on Chinese Wikipedia in Traditional characters (ææŽČć è»èŻèłœæ±șèłœ), in Simplified characters (æŹ§æŽČć ćèè”ćłè”), and in a mix of the two (æŹ§æŽČć ćèè”æ±șèłœ). In that last instance, the last two characters are Traditional, the rest are Simplified. It strikes me as very odd because the last and third-from-last characters are the same!âso two different versions of the same character are used in the name of a soccer/football[8] league.
In the Bad Old Days, searching for any of these variants would only find articles that contained those specific characters. As of spring 2017, the situation has improved considerably because we convert all text on Chinese Wikipedia to Simplified characters[9] before indexing them for search.
Editing Chinese Wikipedia is still pretty complicated. Unlike Serbianâs Cyrillic/Latin situationâwhere you could presumably study really hard for a few weeks and become passingly familiar with the few dozens of characters in the script you donât already knowâTraditional and Simplified Chinese have thousands of different characters to learn.
In conclusionâŠ
âŠthere is no conclusion! Well, this blog post is about to end, but the road to fully supporting all the languages of Wikipedia and its sister projectsâfor reading, editing, and searchingâis probably never-ending.
But that shouldnât be dishearteningâevery day we come a little bit closer to a world in which every single human being can freely share in the sum of all knowledge. It may never be perfect, but itâs always getting better.
Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation
Footnotes
1. For a very brief but very entertaining review of just some of that complexity, see Roan Kattouwâs lightning talk, given at linux.conf.au 2017 (âHuman language watsâ): YouTube video, slides on Wikimedia Commons.
2. And of course there is nearly endless variety to be found in English around the world: Appalachian English has sigogglin; Australian English has fair dinkum; Canadian English has namaycush; East African English has boda-boda; Hawaiian English has makai; Indian English has burra-khana; Indonesian English has gotong-royong; Irish English has knawvshawl; Maryland English has moonack; Namibian English has oukie; Philippine English has kilig; Quebec English has sugar pie; Singapore English has taxi uncle; and Texas English has whomperjawed.
3. This is only exacerbated by the fact that English spelling is horrible. The relevant technical term is âorthographic depthâ, which is a rough sense of how much a spelling system is WYSIWYG. In the English Wikipedia article on orthographic depth, English is the only example in the category âirregularâ. The French Wikipedia article specifically calls out English -ough. Itâs a travesty.
4. Thatâs a rather technical termâfollow the link!
5. See C. Scott Ananianâs slides for his Wikimania 2017 presentation on multi-script editing. The slides include his speaker notes, so while itâs not as good as being there, itâs still quite good and full of useful information.
6. Kazakhstan is currently planning to officially shift from the Cyrillic to Latin alphabets by 2025, and the the language converter will need to adapt to that change. An earlier proposal, from October 2017, involved a lot of apostrophes, and was widely criticized. (See an article from The New York Times.) Very recently, a new version was announced that favors acute accents and digraphs. (See an article from Kazinform.) Tables showing the Oct 2017 and Feb 2018 transliterations are on Commons.
7. The label âChineseâ is itself complicated because it can refer to many languages, which can differ as much from each other as the Romance languages doâmeaning they are often mutually unintelligible. The Chinese Wikipedia is written in Modern Written Chinese, which is based on the varieties of Chinese spoken throughout China. See also the article on Chinese Wikipedia, on English Wikipedia.
8. To-may-to, to-mah-to. I already said that English is a mess.
9. The software libraries available to handle segmenting Chinese text into words operate on Simplified characters, so converting everything to Simplified first allowed us to also index articles by actual words, rather than by character n-grams; n-grams are much better than nothing, but not great. For more info on the process, you can read my write up for that project.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation