Wikimedia moving to Elasticsearch

Translate this post

We’re in the process of rolling out new search infrastructure to all of the wikis, so it’s a good time to explain what’s coming to all Wikimedia wikis in the very immediate future, why we’re changing it, and how you can get involved.

Screenshot of the new search box
The new search engine is coming soon to all Wikimedia wikis, and may already be on your favorite wiki

First a bit of background. All Wikimedia sites have been using a home-grown search system based on Apache Lucene since 2005 or 2006. It was written primarily by volunteer Robert Stojnić and is called lucene-search-2. This is a fantastic search engine, which has powered the sites for years now, and has managed to scale very well for the past 8 years or so. Early in 2013 this became a point of significant operational problems; short-term we were able to patch some of the most glaring issues in lucene-search-2 but it became increasingly apparent that a replacement was needed. Robert is no longer around and the system is showing its age.
We’re very happy with Lucene but we wanted to get out of the business of maintaining a special-purpose open-source search system when there are two very good general-purpose open-source search systems available: Solr and Elasticsearch. Both are based on Lucene and horizontally scalable for data and query volume. After experimenting with both and implementing basic MediaWiki integration we chose to settle on Elasticsearch for the following reasons:

  • Elasticsearch’s reference manual and contribution documentation promised an easy start and pleasant time getting changes upstream when we’ve needed to
  • Elasticsearch’s super expressive search API lets us search any way we need to search and gives us confidence that we can expand on it. Not to mention we can easily write very expressive ad-hoc queries when we need to.
  • Elasticsearch’s index maintenance API lets us maintain the index right from our MediaWiki extension, so it’s easier for us to deploy and test, and should be easier for MediaWiki users outside Wikimedia to use. At the time of the choice, Solr’s schema API was read-only.
  • Rack awareness, automatic shard rebalancing, statistics exposed over HTTP, preference for JSON and YML over XML, and first-party Debian packages were also nice.

To provide the integration to MediaWiki, we’ve written a new extension called CirrusSearch that we’ve designed to be mostly backwards-compatible with the current search with the following exceptions:

  • Templates are expanded before indexing so text that comes from templates will be searchable but text inside templates no longer will be.
  • Page updates are reflected in search results pretty quickly after they are made, usually within seconds for single page edits.
  • Wiki communities can mark some pages as higher or lower quality and it will be reflected in the search results.
  • A few new “expert” options have been added (intitle: is negate-able, prefer-recent: etc).

We’ve documented all of these features and more on mediawiki.org, and the page is licensed in the public domain so people can feel free to copy it to their wikis as a basis of documentation.
We plan for this replacement search to be a Beta Feature for all wikis by the end of February and the primary search in March or April. See our ever-evolving timeline for ever-evolving specifics.
We’ve got a lot of exciting things on the horizon now that we’ve got a modern and stable search for Wikimedia. We’re talking Wikidata, Commons metadata, faceting, real cross-wiki searching, etc. Please get involved by filing bugs, talking to us on the project page, or by finding us on IRC and pinging us there. On IRC, you can find us as ^d and manybubbles.
Chad Horohoe and Nik Everett, Wikimedia Foundation

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

11 Comments
Inline Feedbacks
View all comments

[…] today announced it is replacing its search feature with one provided by enterprise data search and analytics […]

[…] today announced it is replacing its search feature with one provided by enterprise data search and analytics […]

[…] en kunngjøring opplyser Wikimedia at de ikke lenger ser det som hensiktsmessig å vedlikeholde egen søkemotor, […]

[…] dovrebbe cominciare una breve fase di betatest destinata a concludersi entro febbraio 2014, esauriti i collaudi i siti Web di Wikimedia potranno […]

[…] today announced it is replacing its search feature with one provided by enterprise data search and analytics […]

[…] today announced it is replacing its search feature with one provided by enterprise data search and analytics […]

[…] Wikipedia(維基百科)了吧。而 Wikimedia 於 1 月 6 日發表了一份聲明,搜尋系統將全面改用 Elasticsearch。試用版將於 2 月開放試用版,3、4 […]

Its really nice to know that ElasticSearch is going to be used.
Recently, I have also been working on elastic search indexing for the app that we are developing.

You might want to consider Amisa Server as well. Faster than elastic search and more powerful queries using SQL predicates

[…] More related informations can be found on the official Wikimedia Blog. […]

[…] by Wikipedia. Wikimedia, the organisation behind Wikipedia and its sister projects, decided some time ago to use Elasticsearch as a search engine and this search engine is designed for such tasks. […]