Helping you find that needle in the haystack: Building Wikipedia's search functions

Translate this post

“The Gleaners,” by Jean-François Millet, public domain/CC0.

On a daily basis, millions of terms are entered into the Wikipedia search engine. What comes back when people search for those terms is largely due to the work of the Discovery team, which aims to “make the wealth of knowledge and content in the Wikimedia projects easily discoverable.”
The Discovery team is responsible for ensuring that visitors searching for terms in different languages wind up on the correct results page, and for continually improving the ways in which search results are displayed.
Dan Garry leads the Backend Search team which maintains and enhances search features and APIs and improves search result relevance for Wikimedia wikis. He and his team have a public dashboard where they can monitor and analyze the impact of their efforts. Yet, they do much of their work without knowing who is searching for what—Wikipedia collects very little information about users, and doesn’t connect search data to other data like page views or browsing habits.
Dan and I talked about how the search team improves search without knowing this information, and how different groups of people on Wikipedia use search differently. An edited version of our conversation is below.


Mel: You mentioned in an earlier conversation we had that power editors use Wikipedia’s search in a completely different way than readers. What are some of the ways that power editors use search?
Dan: Power users use search as a workflow management tool. For example—they might see a typo that annoys them or a word in an article that is misused a lot, or be looking for old bits of code that need to be changed, and then search for that to see if corrections can be made. In that case, unlike your average user, they’re actually hoping for zero results from their query, because it means the typo isn’t present anywhere.
Another way that power users might use search is to look for their usernames because they might want to find places where they’ve been mentioned in discussion—and they want to “sort pages by recency” so that they can see the most recent times they’ve been mentioned.
That represents a divergence from someone who simply wants to find an article. Our power users aren’t always trying to find an article—they’re trying to find pages that meet certain criteria so they can perform an action on those pages. They’re interested in the whole results set, rather than 1-2 results.


Mel: It sounds like power editors don’t always want or need relevancy. (Although I’m sure sometimes they do.)
Dan: That’s right. It’s something we’d like to study more in-depth. We prioritize relevancy for readers but editors and even some kinds of readers might need something completely different.

Dan Garry. Photo by Myleen Hollero/Wikimedia Foundation, CC BY-SA 3.0.


Mel: There are a lot of ways to search Wikipedia. Off the top of my head, I can think of searching through search engines, through, through an individual article page, and then on the mobile apps. Do you notices differences between all of these different pathways into the site?
Dan: Occasionally we do. I used to be a product manager for mobile and I was focusing a lot on search. I was interested in search as an entry point for the mobile app.
But we found that a lot of people were having trouble with things like finding the search tool. We had made an assumption that keeping a search query in the search bar would be useful for the end user, but people thought that was the title of the page, and they were really confused.
When we realized that this could be an issue, we did a lot of qualitative user studies with people, and asked staff who weren’t on the product team what they thought. It was helpful to get perspectives of this feature on the app outside of the dev team, from actual users.
We decided to change the way that search appeared in the app once a page loaded. When people navigated to that page, we deleted their search phrase from the search box which helped people know where to look to start searching again.
We’ve also thought quite a bit about images and their relationship to search. We thought about adding images in search results, and we found that adding images to the search results changed user behavior quite a bit. Instead of clicking on the first link, which may or may not have been the most relevant result, they would almost always prefer articles with pictures, even if the articles were further down the search results page. We asked why, and people said that they felt that the result was more comprehensive or complete.
It’s funny how changing something small can immediately have a huge effect. When we made the picture change, we also saw a small drop in people clicking through to the articles. This alarmed us because we thought we were enhancing things for the end user, and we were worried that by adding the pictures, that we may have inadvertently caused them to not get the information they needed. But we did some digging, and found it was the opposite:  for some queries, the answer to the search query was given in the search results so they didn’t need to go to the article. We were meeting their user needs earlier in the search process which was fantastic.
You really need both quantitative and qualitative data to truly understand all the ways users use your product. Having either only one or the other can paint an unclear picture.


Mel: What kinds of things do you think about when thinking about relevancy?
Dan: This is a tricky topic. The fundamental approach assumes that you can break down relevance into an equation that aggregates different factors, and then produces results that are “the most relevant.” That’s clearly not always going to be the case. If I search for ‘Kennedy,’ I could be looking for the airport, or the President, or I might be looking for John Jr. or Ted. There is no single correct “most relevant result” for that query.
There’s a multitude of different factors—we used to use something called tf-idf to figure out what to surface in what order. tf-idf stands for ‘term frequency—inverse document frequency’, which combines measures of how much words are mentioned in one article with how much they’re mentioned in the whole site.
So if I were to search for “Sochi Olympics”. The word “Sochi” is relatively rare, but the word “Olympics” is much more commons, it knows that “Sochi” part of the query is probably the more important one, and that’s how it finds the 2014 Winter Olympics article as opposed to other articles about the Olympics.

Melody Kramer. Photo by Zachary McCune/Wikimedia Foundation, CC BY-SA 3.0.


Mel: It sounds like that would be challenging for words that have multiple meanings.
Dan: That’s true and something we think about a lot. If you go to Wikidata, and you search for life on the search page, you get search results like: Life Sciences, the Encyclopedia of Life, IUBMB Life, Cellular and Molecular Life Sciences, the phrase slice of life, the video game Half-Life… but you don’t get the item on the concept of something being living.
And that’s because of the term frequency and inverse document frequency. A lot of the pages I just mentioned a lot of them have the term life in them. And, by coincidence, the item about life itself doesn’t actually have the word life in it very often. Which means the actual result for life is far down, because it doesn’t seem as important as the others, even though it is!


Mel: I imagine there must be ways to mitigate that.
Dan: We’ve switched to an algorithm Okapi BM25 instead of tf-idf – it’s a newer algorithm. (BM stands for Best Match.) Basically, what BM25 says is that there isn’t a huge difference between a term being mentioned 1000000 times and a term being mentioned 10000 times.  Using the new algorithm and switching to a more precise way of storing data about articles helped with the Kennedy problem a lot, because it’s paying less attention to how frequently the word Kennedy appears in other pages since it’s used a lot in this page. Before John Fitzgerald Kennedy was on the second page of result, and now he’s about 7th or 8th in terms of results.


Mel: Does the site use BM25 everywhere?
Dan: We use BM25 on every Wikipedia that is not in Chinese, Thai, Japanese and other languages where words in a sentence don’t have spaces in between them. We tested BM25 and it caused a massive drop in the zero results rate on the spaceless languages due to a bug in the way words are broken up, or tokenized. We learned the algorithm wasn’t working on those languages, and we deployed it everywhere else. We’re hopeful that we can fix that problem for spaceless languages in the future.


Mel: What has been the most unexpected thing you’ve learned through search?
Dan: There is a surprising long tail when it comes to the frequency of searches.
One of the first things we were asked by our community members is “Why don’t you make a list of the most popular queries that give zero search results so editors can make redirects or find articles that need to be written?”
The data is not that useful, as it turns out. In our analysis of the problem, some of the most popular zero result searches were “{searchTerms}” and “search_suggest_query” which we think are bugs in certain browsers or automated search systems.
We also found that a lot of people were searching for DOIs, which are digital object identifiers used by academic researchers. Most of the searches for those got zero results. We had to ask ourselves “What are people doing?” And we found there was a tool that let researchers put a DOI into it to see whether their paper was cited in Wikipedia. Of course, most papers that people are searching for aren’t in Wikipedia, so it’s actually correct to give them zero results!
When I started in search, we believed that users should never get zero results when searching. But it turns out that a lot of people were searching for things we don’t have and it’s correct to give them zero results.


Mel: I know that Wikipedia has a very strict privacy policy and tracks hardly anything. What do we collect?
Dan: We do track some info. We have event logging that says ‘This user with this IP clicked on the 4th result, it took us this long to give them results’, and so on. But, it’s Wikimedia’s policy to delete all personally identifying information after 90 days. That is a very intentional thing we decided to protect user privacy.
If you don’t want information about users to be revealed, the only thing you can do is to not record it. If we get subpoenas, we are legally required to comply with. But if we don’t have that information, we obviously can’t give it out! So it’s the safest way to keep users’ privacy protected. We can figure out some things by language, but not geography.
But it’s tricky sometimes. A good example of that within the Latin alphabet is the search term “paris”. What language is that in? Is it English? French? If I search for “cologne”, it’s a city in Germany but also a perfume in English. And that’s an example of relevance. Is a user who searches for “cologne” searching for a fragrance or a city? These things make delivering good search results really hard, but we keep on trying, and keep making them a little better every day.
Melody Kramer, Senior Audience Development Manager, Communications
Dan Garry, Lead Product Manager, Discovery Product and Analysis
Wikimedia Foundation

Archive notice: This is an archived post from, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Inline Feedbacks
View all comments

Hello! The painting used as an illustration is not by “unknown”. The artist is

(Viva Millet!) Thanks for the explanatory article on this important topic. It’s tough to explain search algos.

To Ed: The painting is called “The Gleaners”. Valued image here:comment image
The file you are linking to has very misleading title (and unsufficient description). If I could, I would delete that file.

[…] Read the Complete Interview (approx. 1950 words) […]