A galloping overview
Letâs first get a birdâs-eye view of the parts of the search process: text comes in and gets processed and stored in a database (called an index); a user submits a query; documents that match the query are retrieved from the index, ranked based on how well they match the query, and are then presented to the user. That sounds easy enough, but each step hides a wealth of detail. Today weâll focus on another part of the step where âtext gets processedââand look at normalization.[1]
Also keep in mind that humans and computers have very different strengths, and what is easy to one can be incredibly hard for the other.
A foolish consistency
The simplest kind of normalization that readers of Latin, Greek, Cyrillic, Armenian and many other scripts often donât even notice is caseâthat is, uppercase vs. lowercase. For general text, we want Wikipedia, wikipedia, WIKIpedia, wikiPEDIA, and WiKiPeDiA to all be treated the same. The usual method is to convert everything to lowercase.
There are exceptionsâcalled capitonymsâwhere the capitalized form means something different. In English, we have March/march, May/may, August/augustâso many months!âPolish/polish, Hamlet/hamlet, and others. In German, where nouns are always capitalized, there are also words that differ only by capitalization, such as Laut (âsoundâ) and laut (âloudâ). Their conflation through lowercasing is often something we just have to live with.
As with everything else when dealing with the diversity of the worldâs languages, there isnât just one ârightâ way to do things. A speaker of English, for example, will tell you that the lowercase version of I is i, while a Turkish speaker would have to disagree, because Turkish has a dotted uppercase Ä° and dotless lowercase ı, and the corresponding pairs are Ä°/i and I/ı. As a result, we lowercase I differently on Turkish wikis than we do on other wikis in other languages. Cyrillic Đ (âgeâ) has different lowercase forms in Russian, Bulgarian, and Serbian, and then different italic lowercase forms in those languages as well.
Other complications include German Ă, which, depending on who and when you ask, might or might not have an uppercase form. It can be capitalized as SS,[2] or using an uppercase version (that is not well supported by typical fonts), and was only accepted by the Council for German Orthography in 2017.
There are also uppercase vs. lowercase complications with digraphs used in some languages, like Dutch ij or Serbian dĆŸ, lj, or njâwhich are treated as single letters in the alphabet. The Serbian letters have three case variants: ALL CAPS DĆœ / LJ / NJ, Title Case DĆŸ / Lj / Nj, and lowercase dĆŸ / lj, nj. The Dutch letter, in contrast, only comes in two variants, UPPERCASE IJ and lowercase ij. Though they are usually typed as two letters, there are distinct Unicode characters for all three variants: i.e., Ç, Ç , and Ç, but only ÄČ and Äł.
The calculus of variations
Another common form of normalization is to replace âvariantâ forms of a character with the more âtypicalâ version. For example, we might replace the aforementioned single Serbian character Ç with two separate characters: d and ĆŸ; or Dutch Äł with i and j. Unicode also has precomposed single-character roman numeralsâlike â Č and â §âand breaking them up into iii and VIII makes them much easier to search for.
So-called âstylistic ligaturesâ are also relatively common. For example, in some typefaces, the letter f tends to not sit well with a following letter, particularly i and l; either thereâs too much space between the letters, or the top of the f is too close to the following letter. To solve this problem, ligaturesâa single character made by combining multiple charactersâare used. The most common in English are ïŹ, ïŹ, ïŹ, ïŹ, and ïŹ. Open almost any book in English (in a serif font and published by a large publisher) and youâll find these ligatures on almost every page. The most obvious is often ïŹ, which is usually missing the dot on the i (in a serif fontânot necessary in a sans-serif font).[3] Some stylistic ligatures, like the st ligature (ïŹ) are more about looking fancy.
Other ligatures, like ĂŠ, Ć, and Ă (see footnote 1) canâdepending on the languageâbe full-fledged independent letters, or just stylistic/posh/pretentious ways of writing the two letters. Separating them can make matching words like encyclopaedia and encyclopĂŠdia easier.
Non-Latin variants abound! Greek sigma (ÎŁ/Ï) has a word-final variant, Ï, which is probably best indexed as âregularâ Ï. Many Arabic letters have multiple formsâas many as four: for initial, medial, final, and stand-alone variants. For example, bÄÊŸ  has four forms: Űš, ÙŰš, ÙŰšÙ, ŰšÙ.
For other letters, their status as âvariantsâ is language dependent. In English, we often donât care much about diacritics. The names ZoĂ« and Zoe areâwith apologies to people with those namesâmore or less equivalent, and you have to remember who uses the diaeresis[4] and who doesnât. Similarly, while rĂ©sumĂ© or resumĂ© always refer to your CV, resume often does, too. In Russian, the Cyrillic letter Đ/Ń is treated as essentially the same as Đ/Đ”, and many people donât bother to type the diaeresisâexcept in dictionaries and encyclopedias. So, of course, we have to merge them on Russian-language wikis. In other languages, such as Belarusian and Rusyn, the letters are treated as distinct. And, whether you want to keep the diacritics or not, you may still need to normalize them. For example, you can use a single Unicode character,[5] Ă©, or a regular e combined with a âcombining diacriticâ, eÌ, which is two characters, not one.[6] Similarly, in Cyrillic, Ń and Đč are one character each, while Đ”Ì and ĐžÌ are two characters each.
Some characters are difficult to tell apart, while others are hard to identifyâand even if you could identify them, could you type them? Normalizing them to their unaccented counterpart makes a lot of sense in many cases. My general rule of thumb is that if a letter is a separate letter in a languageâs alphabet, then it needs to be a separate letter when you normalize that language for search. Russian is an exception, but itâs a good first place to start.
Below is a collection of letters related to A/a, presented in image form just in case you donât have the fonts to support them all. All of them except one (in grey) are separate Unicode characters.[7] For some reason, there is a âlatin small letter a with right half ringâ[8] but no version with a capital A. Good thing we donât need to convert it to uppercase it for searching!
Now for our last character-level variation to consider: in some applications, it makes sense to normalize characters across writing systems. For example, numbers probably do represent the same thing across writing systems, and where multiple version are common, it makes sense to normalize them to one common form. Thus, on Arabic-language wikis, Eastern Arabic numerals are normalized to Western Arabic numerals so that searching for ÙĄÙ©ÙšÙ€ will find 1984, and vice versa. For multi-script languages, where the conversion is possible to do at search time, it makes sense to normalize to one writing system for searching. In Serbian-language wikis, Cyrillic is normalized to Latin; in Chinese-language wikis, Traditional characters are normalized to Simplified characters.[9]
Further reading / Homework
You can read more about the complexities of supporting Traditional and Simplified Chinese characters and Cyrillic and Latin versions of Serbian (and other multi-script languages) in one of my  earlier blog posts, âConfound it!â Wikipedia has lots more information on the surprisingly complex topic of letter case, and the examples of ligatures in a wide variety of languages.
If you canât wait for next time, I put together a poorly edited and mediocrely presented video in January of 2018, available on Wikimedia Commons, that covers the Bare-Bones Basics of Full-Text Search. It starts with no prerequisites, and covers tokenization and stemming, inverted indexes, basic boolean and proximity retrieval operations, TF/IDF and the vector space model of similarity, field-level indexing, using multiple indexes, and then touches on some of the elements of scoring.
Up next
In my next blog post, we will almost certainly actually look at stemmingâwhich involves reducing a word to its base form, or a reasonable facsimile thereofâas well as stop words, and thesauri.
Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation
âââ
Footnotes
1. Last time I said weâd talk about stemming and other normalization, but character-level normalization kind of took over this post, so weâll put off stemming and related topics until next time.
2. Why does a symbol that stands for an âsâ sound look like a B (or Greek ÎČ)? Well, you see, long ago there was a written letter form in common use called long sâwhich looks like this (in whatever font your computer is willing and able to show it to you): Ćż. It was historically written or printed in a way that looks like an integral sign or an esh: Ê, or in a chopped-off version that looks like an f without the crossbar (or, maddeningly, with only the left half of the crossbar). If you take the long s and the regular sâĆżsâand squish them together, and make the top of the long s reach over to the top of the regular s, you get an Ăâwhose name in German, Eszett, reflects that even earlier it was a long s and a tailed z: ĆżÊ.
A fun side note: optical character recognition (OCR) software generally isnât trained on either form of long s, and often interprets it as an âfâ. As a result, The Google Books Ngram Viewer will (incorrectly) tell you that fleek was wildly popular in the very early 1800s. In reality, itâs usually âsleekâ written with a long s (for example: full height or chopped), or some other unusual character, like a ligature of long s and t in Scots/Scottish English âsteekâ.
Typography is fun!
3. A small miscellany: Turkish, which distinguishes dotted i and dotless ı doesnât use the ïŹ ligature since it often removes the dot. The spacing between letters is called kerning, and once you start paying attention to it, you can find poor kerning everywhere. Finally, another character that most people donât use in their everyday writingâbut which shows up a lot in printed booksâis the em dash (i.e., â); I personally love itâobviously.
4. The technical name for the double-dot diacritic ( š ) is âdiaeresisâ or âtremaâ, though many English speakers call it an âumlautâ, because one of its common uses is to mark umlaut in Germanâwhich is a kind of sound change. In English, the diaeresis is usually used to mark that two adjacent vowels are separate syllablesâe.g., ChloĂ« and ZoĂ« rhyme with Joey, not Joe, and naĂŻve is not pronounced like knave. For extra pretentiousness points, you can use it in words like coöperate and reĂ«nter, tooâor you can just use a hyphen: co-operate, re-enterâthough doing so may mess with your tokenization!
5. The term âUnicode characterâ can be more than a little ambiguous. It can refer to a code point, which is the numerical representation of the Unicode entity, which is what the computer deals with. It can refer to a grapheme which is an atomic unit of a writing system, which is usually what humans are thinking about. You can also talk about a glyph, which is the specific shape of a graphemeâfor example in a particular font. There are invisible characters used for formatting that have a code point but no glyph, surrogate code points that can pair up in many ways to represent a Chinese character as a single glyph, special ânon-characters,â and lots of other weird corner cases. Itâs complicated, so people often just use âcharacterâ and sort out the details as needed.
6. This kind of normalization is relatively common, and thereâs a reasonable chance that between me writing this and it getting published on the Wikimedia blog, some software along the way will convert my two-character eÌ to the one-character Ă©. Not all letter+diacritic combinations have precomposed equivalents, though.
7. You may have noticed that in the last row of lowercase aâs, the third from the right has a different shape. Different fonts can and do use either of the letter shapes as their base form, but in more traditional typography, the a with the hook on top is the âregularâ or âromanâ form, and the rounder one is the âitalicâ form. Lowercase g can have a similar difference in form, and there is also a Unicode character for the specific âsingle-storyâ gââlatin small letter script gâ (see footnote 6). Some Cyrillic letters also have very different italic formsâsee the grey highlighted examples below.
For the typography nerds, traditional italic versions of a font have distinct forms. When they are just slanted versions of the roman forms, they are technically âobliqueâ, rather than italic. Below is the same pangram in the font Garamond, set in roman, italic, and oblique type.
8. Unicode descriptions of characters are always in ALL CAPITAL LETTERS / small caps. Why? Because we like it that way! Seriously, though, I donât really know why. Hmmm.
9. Chinese is a case where real life gets a bit messier than our idealized abstraction. Sometimes you have to do some normalization before tokenization. Because Chinese doesnât use spaces, tokenization is much more difficult than it is for English and other European languages. The software we use that does tokenization only works on Simplified characters, so we have to normalize Traditional characters to Simplified before tokenization.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation