So -happy to meet you: Advanced searching techniques on Wikimedia sites

Translate This Post

Image by Camdiluv, modified with color inversion, CC BY-SA 2.0.

This summer, MediaWiki user This, that and the other (a.k.a. TTO) created a ticket in Phabricator to report that our search results seem to be random when a query begins with a hyphen.
That is in fact a reasonable interpretation of what happens, say, on English Wiktionary when you are searching for a suffix like -happy (as in trigger-happy) or -minded (as in fair-minded): you get about 5.3 million results, and you may get them in the same order for different suffixes, or maybe you get a different order for the same suffix a few seconds later. For some suffixes, like -in-law (as in sister-in-law), you get the entry for that suffix, followed by the similarly (or maybe differently!) ordered 5.3 million results. This doesn’t happen with prefixes, like pre-, un-, Ægypto-, oö-, or polydeoxyribo- or words with internal hyphens, like trigger-happy, fair-minded, or sister-in-law. What gives?
The answer could be the plot to a heroic fantasy adventure film—Clash of the Syntaxes! Okay, maybe not—but I still want to hear Liam Neeson (YouTube) to say, “Release the klikken!”

———

In a previous post, I talked about how people asking questions on Wikipedia inadvertently fell afoul of the single-character wildcard ? (now \? to prevent the problem). Something similar is happening here because that hyphen, used to indicate a suffix, is also used to negate a search term. So searching for -happy on Wiktionary returns all 5.3 million entries that do not have happy in them—which is most of them. Similarly, most entries do not have minded in them, so searching for -minded gives a similarly huge, mostly useless list.
The order appears random because there’s no meaningful basis for ranking them—they all lack happy or minded to the same degree—and whatever arbitrary criteria is used to sort one batch of about 5.3 million results is used to sort the other batch. When the order of the results changes, that’s probably because the search got routed to a different server, which sorts everything in a slightly different but equally arbitrary order.[1] Over time, the results will change on any given server as well, as the search index is updated and optimized through normal use.
In the case of -in-law, there is a second search pathway activated—an exact title match—which puts the entry for -in-law at the top of the big pile of 5.3 million results that do not contain in-law.

Shriek, shriek, bang, bang

To make matters more complicated, there’s another another way to negate a search: with an exclamation point (!), also called bang, shriek, pling, or, long ago, ecphoneme—no, really!
The exclamation point is also a letter in some African languages, where it usually stands for an alveolar click. Words can start with such clicks, as in the language names !Kung and ǃXóõ, both of which have entries in English Wiktionary. And of course there are many articles in English Wikipedia with titles or redirects that start with an exclamation point—like the title of the article on the punk band !Action Pact!, or easy-to-type redirects to titles that start with an inverted exclamation point such as !Uno!, which redirects to ¡Uno! (As with internal hyphens, internal exclamation points—as in P!nk, The Amaz!ng Meeting, or L!VE TV—don’t cause any problems.)
As a result, searching for -happy or !happy gives the same results, as does searching for -minded or !minded. However, !in-law gets one less result that -in-law because there is no exact match. Conversely, !Kung gets one more exact-match result than -Kung.
Now we know what’s happening, and why, but what can we do about it?

-mind-reader

It’s often hard to divine users’ intent from the scant evidence provided in a query. In the case of TTO’s query -happy, we know that a good result would have been either an entry for the suffix -happy, or at least a list of entries that contained -happy. On the other hand, I like to search English Wikipedia for -the to see how many articles don’t contain the—there are over 71 thousand! (I like to see how far I can get down the list of most frequent words in English before running out of Wikipedia articles. There are dozens of articles without any of the top 50 most frequent words in them—though, boringly, they tend to be sports rosters. It’s not a great hobby, but it keeps me off the streets.)
My colleague David Causse pointed out some of these use cases in the Phabricator ticket—including my odd hobby and another use case I hadn’t thought of: on small, growing wikis, editors may meaningfully look for all pages that don’t contain some particular text. It can be very difficult to know what people are trying to do when you have fewer than ten characters on which to base a determination.
The best bet in this case seems to be to provide lexicographically sophisticated users who want to search for suffixes with some advice to turn them into sophisticated searchers as well.

Search nerds, level up!

An obvious approach that doesn’t actually work is to put quotes around the hyphenated search term, such as “-happy” or “-minded”. The text inside the quotes is treated somewhat literally—so searching for “hoping” will not match hope, hoped, and hopes, the way that searching for quoteless hoping will. However, hyphens are still ignored in quoted searches, so that, for example, “well-known” and “well known” match each other.
The quotes do at least block the negating powers of the hyphen. For “-minded” this works out reasonably well on Wiktionary because minded isn’t used much except in the kind of compounds you might be looking for: closed-minded, open-minded, simple-minded, etc. Similarly, “!Kung” ignores the exclamation point. For “-happy”, however, there are too many other uses cluttering up the results.
Among the lesser-known search keywords is insource:. It searches the raw wikitext and it allows regular expressions (also regexes or regexps), which can be both powerful and costly. Regexes, which are marked with /slashes/, in addition to allowing complex pattern searching, are also very literal. By default, even upper and lower case versions of the same letter do not match! Adding dding i after the regex tells it to be case insensitive.
Regexes also allow us to search for that hyphen or exclamation point, with a query like insource:/-minded/i or insource:/-happy/i—however, we probably shouldn’t search like that.

Marcia! Marymarcia! Matamarcianos!

Regexes are very expensive to search with because they don’t use an index. A search index knows, for example, that the word happy appears as word #137, #492, and #517 in document #5943, etc., etc.—making it easy to find all articles with happy in them. When using regexes, each document must be scanned letter by letter to see if there’s a match. With over five million articles on the English Wikipedia and over five million entries on the English Wiktionary, such scans usually take much too long and users get a warning that their search did not complete, along with their incomplete results.
Also because of that letter-by-letter scanning, regexes can make unintended matches. The pattern /marcia/ actually won’t match the name Marcia (Youtube) because regexes are so literal, and without the i at the end m is not the same as M. Worse, that regex does match Marymarcia, matamarcianos, and artemarcialistas. Crafting exactly the right regex is as much art as science and beyond the scope of this blog post, but we will be getting to an alternative that works much better than showing 5.3 million maximally irrelevant results.

Turning the tables

Surprisingly, part of our problem can become part of our solution. In the case of suffixes like -happy, we know the the suffix, when it occurs, will be indexed as plain happy. We can use that to our advantage. While most Wiktionary entries that contain happy do not contain -happy, (a) all entries that contain -happy do in fact contain happy, and (b) there are a lot fewer than 5.3 million of them.
We can split our final search into two parts:

  • "happy" which limits the collection of entries the regex needs to scan to fewer than 3,000—much better than the full 5.3 million!—and,
  • insource:/-happy/i which further restricts results to those that contain the hyphen right before happy, with i at the end because case doesn’t matter.

Our final queries look like this:

  • "happy" insource:/-happy/i
  • "minded" insource:/-minded/i
  • "in-law" insource:/-in-law/i

The quotes around the first term aren’t strictly required, but they do filter out some additional results since plain minded, for example, will also match mindedness and mindedly, which we may not be interested in.

Read The Fantastic Manual

While this is a lot of added complexity, it is also a lot of added power and precision—and there are many more options and methods for bending search to your will. A good way to learn more is to peruse[2] the documentation and then just try things out for yourself!

Footnotes

1. Once this post is more than, oh, about 37 seconds old, it’s quite possible that some of the details could be out of date. Eventually someone may create an entry for -minded or -happy—but the general idea will still be the same.

2. Either of the conflicting senses will do.

Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?