Hello, my name is ________: Searching for names is not always straightforward

Translate This Post
Photo by Travis Wise, CC BY 2.0.

What’s in a name? That which we call a rose
By any other name would smell as sweet;

—Juliet, Act II, Scene II of Shakespeare’s Romeo and Juliet, more or less

I really like names. There’s so much variation in the way people use their own names—formally and informally, at home, work, or online. And there’s even more variation in names across cultures. In this blog post I’m going to touch on some of my favorite kinds of name variation and how such variation can make it bafflingly hard to search for people “by name”, on Wikipedia or elsewhere.

———

Some “simple” variation

Let us consider a hypothetical American male named “Robert John Smith, Jr.” Our hypothetical pal might go or have gone by any of the following given names at some point in his life: Bobby, Bob, Robbie, Rob, Robert, Robin, or even Bert. Unless, of course, as a “Jr.” he decided to go by his middle name because his father was already “Rob”—in which case he might go by any of John, Johnny, or Jack.
Depending on which given name he uses, and the formality of the context, he might write out his “full” name—or have it written out for him—as any of the following:

  • Robert John Smith, Jr.
  • Robert J. Smith, Jr.
  • Robert J. Smith
  • Robert Smith
  • Bob Smith
  • R. John Smith
  • John Smith
  • Jack Smith
  • R. J. Smith, Jr.
  • R. J. Smith
  • R.J. Smith
  • RJ Smith

… and many others.
Mr. Smith might also go by “Junior”, especially with his family, since he is a “Jr.”, or his friends may have nicknamed him “Smitty”, after his last name. Or maybe like a certain Mr. Lund he’s 6’5” (1.95 m) and 270 pounds (120 kg), so his friends ironically nicknamed him “Tiny”.
A lot of this variation is understandable: nicknames tend to be shortened, and in English -y or -ie is a common diminutive suffix. But have you ever wondered how Bob came from Robert, Bill from William, or Peggy from Margaret? Rhyming nicknames have been popular in English-speaking countries at various times (see Bob for more details)—so normal shortened forms of some names picked up rhyming variants, and some of those have more-or-less permanently embedded themselves in the culture. So rhyming Rob to Bob, Will to Bill, and Meg to Peg, plus a diminutive -y or -ie, can explain a lot.
Our fictitious friend might even end up affecting his father’s name! Dad might add a “Sr.” to his name to distinguish him from his son, or go by “Big Rob”, especially within the family. “Rob Junior” might tire of “Jr.” and try to class things up a bit by changing his suffix from “Jr.” to “II”.
If there were ever to be a Robert John Smith, III, the little guy might pick up the nickname “Trey”, which is one that I’m personally fond of, and not just for its interesting etymology. It’s also worth noting that—since naming patterns and customs evolve over time—while many Treys are secretly Somebody S. Something, III, Trey can also be a shortened nickname for Tremaine, and has sometimes been used as a regular given name.
And, of course, once little Bobby gets his law degree, medical degree, and doctorate, he might add one or more of “Esq.”, “Esqr.”, “MD”, “M.D.”, “PhD” or “Ph.D.” to the end of his name, or “Dr.” to the front.
Let us now consider Dr. Smith’s sister, who is also Dr. Smith, M.D., Ph.D., Esq.—though she might also go by Miss Smith or Ms. Smith—whose full name is “Caitlin Roberta Smith”. She will of course be used to people badly misspelling her first name. The English Wikipedia disambiguation page lists Caitlin, Caitlín, Caitlyn, Catelyn, Catelynn, Kaitlyn, Kaitlin, Katelyn, Katelynn, and Katlyn—though many more are attested, including KVIIIlyn—I’ll wait while you figure that one out. Of course, to avoid all the fuss, Cait/Kate/Cate might decide to just go by a nickname based on her middle name, say, Bobbie.
Some women—and some men—decide to change their names when they get married, which creates another unpredictable—but in this case very official—variant of their names.

———

Why it stinks for search: In a search context, all of this variation can become quite bewildering. Computational approaches to resolving names are called entity linking (and that’s after you’ve figured out what’s actually a name—which is named-entity recognition). Using patterns for initials, titles, and suffixes, dictionaries of nicknames, and sometimes context can help, but it’s very easy to miss unusual variants or get false positives.
Alas, nothing other than real-world knowledge will help with people who have changed their names or are known primarily by an uncomputable nickname like Tiny or Squeak. Among the more practical solutions—if there are tireless hordes of dedicated WikiGnomes out there doing the work—are disambiguation pages and redirects! High-quality redirects in particular can be very useful in search because they get treated much like an alternate title for the page. Thanks, WikiGnomes!

———

A tale of two (or more) writing systems

To shift gears a bit, I want to tell you about one of my all-time favorite things related to searching for names—it has to do with the transliteration of Russian names!
First, a few preliminaries
 In case you’ve never noticed, the sound represented by English “ch” is really just a “sh” following a “t”. Yep, “tsh” is the same as “ch”. (Some additional semi-mind-blowing facts: many English speakers pronounce word-initial “tr“ as “chr”, because of an epenthetic “sh” that pops up between the “t” and the “r”, so many people say “tree” as “chree”—and “Trey” as “Chray”. Also, similarly to t + sh = ch, d + zh = j. Really.)
Also, since the sounds represented by English “sh” and “ch” are not as common across languages as, say, p, b, t, d, k, and g, they have much less consistent spelling in the Latin alphabet. For example, in French, English “sh” is spelled “ch”, and “ch” is spelled “tch”. German has “sch” and “tsch”. Polish uses “z” a bit like English uses “h”, and so has “sz” and “cz”. Several Slavic languages have nice special-purpose letters: “ơ” and “č”.
Back to Russian and Russian names
 The Cyrillic character Đ©/щ is called shcha in English, and in some languages it is pronounced more-or-less like English “shch”. In Russian, it no longer has that sound—though it still does in Ukrainian and Rusyn—but following older tradition, Russian names with Đ© are transliterated as “shch” in English.
For example, there is a Russian composer namedÂ Đ ĐŸĐŽĐžĐŸĐœ Đ©Đ”ĐŽŃ€ĐžĐœ, whose name is transliterated into English as “Rodion Shchedrin”. His first name is fairly consistently spelled “Rodion”, but his last name is all over the place when transliterated into the Latin alphabet through other languages, each trying to capture “shch” in their own way. In Czech it’s efficiently rendered as Ơčedrin, while German has the much, much longer Schtschedrin, French Chtchedrine, and Polish Szczedrin. Other variants include Catalan SxedrĂ­n, Danish Sjtjedrin, Hungarian Scsedrin, Dutch Sjtsjedrin, Romanian Șcedrin, and Finnish ƠtĆĄedrin.
Now you can figure out why the composer Tchaikovsky’s name is spelled with an apparent silent T. In Russian it starts with the letter Ч, which in many languages sounds like English “ch” and is generally romanized as such. In this case the name came into the Latin alphabet through French as “tch”, and that spelling became standardized in English, too.
This kind of transliteration-based variation isn’t limited to Russian or Cyrillic, of course.
For decades, there was a mixture of confusion and a running gag over the many ways to spell Libyan leader Gaddafi/Khadafy/Qadhafi’s name. In addition to inconsistent romanization, Arabic script doesn’t normally spell out all the vowels, and the pronunciation of the unwritten vowels varies by dialect, giving many layers of inconsistency. Thus, native speakers of different varieties of Arabic could pronounce a name with significant differences, and then transliterate their pronunciations according to different transliteration schemes, made even more divergent in languages with different spellings of the same sounds (like “sh” and “ch” above).
An article in The Straight Dope on the variation of “Gaddafi” appeared as early as 1986, and as late as 2009, ABC News listed 112 variations (see the relevant footnote on the Wikipedia article). For (a peculiar kind of) fun, I worked up a regular expression that matches them all: ([KG]h?|Qu?)[aue]([dtz][h']?)+[aā]f+[iÄ«y]. A regex to match all variants of his given name is left as an exercise for the reader.
Of course, the only inarguably correct spellings of Muammar and Rodion’s surnames are
 Ű§Ù„Ù‚Ű°Ű§ÙÙŠ and Đ©Đ”ĐŽŃ€ĐžĐœ, respectively!

———

Why it stinks for search: Again, disambiguation pages and redirects are a practical and accurate approach in various Wikipedias, though the level of effort required to create and maintain them—have you thanked a WikiGnome today?—is untenable for many search scenarios. Phonetic algorithms can help, but they invariably suffer from false positives, false negatives, or significant complexity—or all three at once! There are sometimes useful trade-offs to be made, like limiting the kind of names the phonetic matching has to accommodate, but a general solution is very difficult.

———

Surnames
 it’s complicated

Surnames as family names are a relatively recent invention in many cultures. From the English Wikipedia article on surnames:
Many cultures have used and continue to use additional descriptive terms in identifying individuals. These terms may indicate personal attributes, location of origin, occupation, parentage, patronage, adoption, or clan affiliation. These descriptors often developed into fixed clan identifications that in turn became family names as we know them today.
Some English occupational names include Baker, Carpenter, Farmer, Miller, Potter, Weaver, and Smith. (Blacksmiths were very important in many cultures, and as a result, the words for smith or blacksmith in many languages have become surnames: Demirci, Fabbro, Haddad, Herrera, Kajiya, Kalējs, Kalvaitis, KovĂĄcs, Kováƙ, LefĂšvre, Lohar, McGowan, Nallbani, Schmitt, SeppĂ€, Sideras, Smed, Zargar and many others.)
Patronymics—a name based on the name of a male ancestor—are another source of surnames (that can become family names). Patronymics include Arabic Ibn- and Bin-, Aramaic Bar-, Celtic Mc- and Fitz-, Hebrew Ben-, Persian -pur, and Scandinavian -sen, and others. Matronymics are rarer, but also occur. Patronymics are good candidates to fossilize into family surnames. Many English surnames that follow the pattern of “male name + -son or -s” come originally from English or Welsh patronymics: Johnson, Robertson, Williams, Adams, Edwards, and Jones.
Some cultures don’t have surnames, or use surnames in a different way from most Western European names. Javanese people in Indonesia sometimes have only one name, or mononym. Other variations, including multiple names without a family name, also occur—see the Wikipedia page for examples. Icelandic names typically use a patronymic (or matronymic) as a surname instead of a family name, using the parent’s name, plus -son or -dóttir. Most people know that many East Asian names are ordered with family name first, then given name—though this is also the traditional order in Hungary, too. When transliterated into languages that use the traditional Western name order, East Asian names are sometimes re-ordered, sometimes not—leading to confusion.

———

Why it stinks for search: For the purposes of search, mononyms, patronymics, and name element re-ordering often don’t matter much, unless you are dealing with highly structured data. If you know what elements to search for, you should be able to find them as a simple bag of words—that is, not paying attention to the order the words are in. Other naming traditions can lead to more confusion, though.

———

Traditional Spanish names include two surnames, with the first coming from the father (and before that, the father’s father), and the second coming from the mother (and before that, the mother’s father). The first surname is in some sense the “main” surname, in that JosĂ© Antonio GĂłmez Iglesias would be referred to as Señor GĂłmez or JosĂ© GĂłmez, rather than as Señor Iglesias or JosĂ© Iglesias, as those unfamiliar with the system might suppose. Ongoing cultural shifts have resulted in more flexibility in naming in Spain, and the system has further evolved in Latin America and the U.S., where some Hispanic people have adopted the single family name model. Searching for the wrong shortened version of a name—e.g., JosĂ© Iglesias based on the full name JosĂ© Antonio GĂłmez Iglesias—is a good way to not find what you are looking for.
Traditional Arabic names contain many interesting parts, including a variable number of patronymics and possibly a paedonymic (a name based on the name of a child), religious elements, and elements indicating place of origin, tribal affiliation, or ancestry—all depending on context and level of formality. Improperly trying to fit elements of the name into a Western name schema can lead, as with Hispanic names, to considering the wrong name elements as the ones primarily used to refer to someone. To further complicate matters, some of the elements—particularly patronymics (bin Laden) and elements based on ancestry (Al Saud)—have become surnames.

———

Why it stinks for search: Once again, WikiGnomes often save our collective bacon in these situations with redirects and disambiguation pages that help you figure out what you are looking for or help the search engine find it for you.

———

An onomastic miscellany

Here are some random additional name-related fun facts that didn’t make it into the discussion above:

  • Onomastic is a nerdy word that means “related to names.”
  • A lot of given names come from surnames. There are lists for male and female names, and you can find more by searching Wiktionary for the phrase “transferred from the surname”.
  • “Daisy” is a nickname for “Margaret” because the French version of the name, “Marguerite”, is also the French name for a kind of daisy.
  • Russian Wikipedia uses the traditional “Surname, GivenName” order for titles, so Albert Einstein is listed as “ЭĐčĐœŃˆŃ‚Đ”ĐčĐœ, ĐĐ»ŃŒĐ±Đ”Ń€Ń‚â€ (“Einstein, Albert”). This does make sorting easier.
    • English Wikipedia and others use DEFAULTSORT to help handle the complexity of figuring out where a given name ends and a surname begins for sorting: {{DEFAULTSORT:Einstein, Albert}} and {{DEFAULTSORT:King, Martin Luther Jr.}}.
  • In systems that require name elements that a person’s name doesn’t have, you will sometimes see the abbreviations NFN, NMN, or NLN, for “no first name”, “no middle name”, or “no last name”.
  • Another aspect of naming we didn’t touch on is online identities; you can get to know someone by an online moniker without ever knowing their “real” name.
    • Debates over whether online users should use their legal names has been dubbed “nymwars”.
  • The Korean name Park is spelled that way in English because non-rhotic varieties of British English use “r” as a mark of vowel length, so it was the obvious way to spell what sounded more-or-less like “pahk”. Other transliterations include “Bak” and “Pak”. The only unambiguously correct spelling is 박.

Winding down and wrapping up

There is a lot more to names—see “Further reading” below—but we’ve covered the general classes of problems we are likely to encounter when searching for people by name: unexpected variation in the preferred form of a name, unpredictable nicknames, cross-cultural confusion, transliteration trouble, spelling struggles, and overall orthographic anarchy. Many of these concerns also apply when searching for places and other proper nouns besides people. All this variation in names sometimes stinks!—but the Search Platform team is always working to improve search for Wikipedia and its sister wiki projects.

Further reading

Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

3 Comments
Inline Feedbacks
View all comments

For a more general look at some of the incorrect assumptions software developers make about names, check out the blog post “Falsehoods Programmers Believe About Names”—which despite being almost 8 years old, is still a relevant list of common name-related mistakes: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
Wikipedia et al.’s approach of titles, redirects, and default sort look smarter and smarter all the time!

Lovely overview! And +1 on thanking the redirect wikignomes. And of course, everyone can add a missing redirect when a search query unexpectedly doesn’t return a certain article due to variants.

Precision is not always the key on names since there are a lot of languages and cultures on the web the best option is to localize the database of the names by country, age, city. which will be more idealistic for searching.