Historically image search on Wikimedia Commons has been poor. The text-based search we use on the Wikipedias is very good at finding relevant documents in a corpus of text, but not so good at finding relevant images in a collection of sparsely labelled media files. As a result many users preferred to use Google to search Commons rather than using the on-wiki search.
The Structured Data on Commons (SDoC) project allowed users to add multilingual captions and annotations to a file on Commons. For example a Commons user can upload a picture of a black dog, then;
- add the caption “black dog” in English
- add the caption “madra dubh” in Irish
- add the annotation
- stored under the hood as
P180is the Wikidata property
Q144is the Wikidata item
- stored under the hood as
As part of SDoC we added
haswbstatement: keywords to search, so that a user could search for a specific caption or annotation. Regular search, however, was still entirely text-based – if a user searched for “dog”, the underlying search query on Commons was the same as it would be on a Wikipedia. In principle this isn’t entirely unreasonable – a search for “dog” doesn’t just query the text on a page, but also other fields like the page title and the names of categories the page is in – but in practice many of the images returned from a search were irrelevant to the search term.
Using the new data from SDoC in search
When we began work on Structured Data Across Wikimedia (SDAW) we started to consider how to make use of the new structured data to improve Commons image search. Our first proof of concept involved sending a set of queries to the search engine (Elasticsearch) and then combining the results. We experimented with various sets of queries on different fields, like the new captions field, and also introduced a step to first query Wikidata to find items corresponding to the search terms (like
Q144 for “dog”), and then searching for images associated with those Wikidata items via
depicts. Searching the new structured data fields gave us improvements right away, but there was no obvious “right” way to combine results from those with results from the old search in code, so we integrated the queries into one Elasticsearch query to allow Elasticsearch to look after the combination.
Adding the new fields to the search isn’t enough on its own – we need to sort the results by how good a match an image is for the search term. When Elasticsearch does a query, it calculates a score (based on BM25) for how good a match the search term is for each field. It combines the scores for each field (by multiplying each field’s score by a weight and adding them) to get an overall score for each individual search result, and sorts the result set by the score.
There’s no obvious way to decide the weights for each field – is a match in the page title more likely to indicate a “good” image than a match in the
depicts annotations? Is a match in the category name plus a match in the caption better than a match in the title plus the wikitext? 🤷 We ran a bunch of test queries to try and get some idea, and did a lot of debating and adjusting weights but it was hard to tell whether we were improving things overall or just improving things for our favourite test queries.
Tuning search using data
It quickly became clear that we needed a more objective way to figure out which fields were more useful. We began by collecting the 1000 most common search terms, plus 1000 other random search terms, and ran searches for them on Commons, and stored the scores from individual search fields.
We then built a tool on toolforge to allow users to classify whether individual images were good or bad matches for a search term, and gathered >10k search-term/result pairs with a rating of “good” or “bad”.
Relative importance of search signals
This data allowed us to examine the relationship between the likelihood of an image being good and the score for individual fields. We were able to see, for example, that images where the page title matched the search term were much more likely to be good than images where the parsed wikitext matched the search term – and so it made sense to give a higher weight to the page title field. Some fields turned out to have very little correlation with the likelihood of an image being good, so we were able to drop them from our queries entirely, and drastically reduced the weight for others.
The labelled data also made it possible to compare how two different queries performed. We made another custom script that would run searches on Commons using a particular set of weights (or differently-structured Elasticsearch queries) for all our stored search terms, count how many of the results are good or bad, and then calculate various metrics (precision, recall, f-score) that give indications of search performance. This meant that as we were iterating through different ideas we were able to say objectively whether the latest set of changes we had made was an improvement or not.
We chose precision@25 (the proportion of images in the top 25 results that are labelled as “good”) as the metric we would try and optimise, seeing as it best reflects the user’s experience on the first page of the MediaSearch interface (on desktop).
Moving past manual search tuning
We realised at this stage that we could think of the scores from individual search fields as signals that could be combined to give us an estimate of the probability of an image being a good result for a particular search term. The research team suggested to us that this could be modelled using logistic regression, and so wrote a Python script to train a logistic regression model, optimising for precision@25. Our first iteration of the logistic regression model got us from a precision@25 of 0.73 to 0.79 – so going from ~1 in 4 bad images in the top 25 results to ~1 in 5.
Learning to rank
Elasticsearch has a “learning to rank” plugin, which provides a way of training an XGBoost model and using that to rank search results, and is used on the Wikipedias. We spent a while working on implementing this on Commons, but in the end it gave us a worse precision@25 (0.66) than our logistic regression model, so we abandoned this approach.
Do we need more data?
At this stage we had made a lot of progress based on using manually-evaluated training data, but were unsure if we had enough data to constitute a representative sample of all Commons images. To try and evaluate this we took randomised sub-samples of the existing data and trained the models on those, varying the size of the sub-sampled dataset and plotting that against precision@25. In general we found that precision@25 increased quickly with dataset size up to a dataset size of around 1000, and very slowly thereafter. Seeing as we already have >10k data points, it looks like adding more is not going to improve search performance significantly.
Improving non-English searches
Image search in English was pretty good by now. SDoC
depicts annotations are language-agnostic, and captions are multilingual, so image search for non-English had improved too. However, an image “Bat.jpg” that does not have a
depicts annotation or localised caption, won’t be found in a search for “ialtóg” (the Irish for bat). English is by far the most common language for labelling images, so we needed to find a way to improve non-English search recall. To do this we introduced another step – looking for aliases of the search term in the user’s language via Wikidata, and then adding the best matches from those aliases back into the text search part of our query. This gave us a large increase in recall, especially in non-English languages – for example without synonyms a search for “ialtóg” yielded 331 results, with synonyms the same search yields ~3800 results.
It also gave us another minor bump in precision@25, we think because the use of synonyms can now further boost existing results that also mention those. For example a search term “new york” now expands to also incorporate e.g. “nyc” and “big apple” behind the scenes, and any additional mention of any of those words pushes a result higher, which in turn reduces the odds of poorer results with only a passing, irrelevant mention of the search term to be considered among the best matches.
New search signals from Wikidata and the Wikipedias
The most recent improvement we made came about during work on the image-suggestions data pipeline. Building on work done by the research and search teams, the data pipeline finds Wikidata entity ids connected to images in various ways – for example if an image is used as the main image for a Wikidata item, or it’s in a Commons category associated with a Wikidata item, or is the lead image on a Wikipedia page associated with a Wikidata item. These Wikidata ids are imported back into the search index and used as new search signals, so if I search for “bat” the lead image from the English Wikipedia article on bats will get boosted and appear near the top of the search results. Re-training the logistic regression model with this data got us to a precision@25 of 0.89.
The future of media search
Obviously there is potential for further improvement – improving recall by making use of links between Wikidata items, re-trying Elasticseach’s XGBoost model with a larger training dataset, expanding the improved media search to non-Commons wikis – but for now, we have managed to improve precision up to a point where only 1 in 10 of the top images may not be relevant, while also making a substantial improvement in exposing image content to non-English languages.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?Start translation