Last year, we released the EventStreams service. This service allows anyone to subscribe to recent changes to Wikimedia data. At the time, we only had one stream of data available: RecentChanges. RecentChanges is a stream of Wikimedia change events (e.g. recent edits to pages in Japanese Wikipedia). External developers can consume this stream to create tools or pages like similar to the Special:RecentChanges page (see below).
The EventStreams service was built to replace some aging backend technologies that served a feed of RecentChanges, so it made sense that the initial release of the EventStreams included all data in the legacy RecentChanges feed.
But EventStreams was built with the intention of exposing more data than just RecentChanges. In the last few years, Wikimedia engineers have been using other more well defined and structured streams of events to build production features. Much of the data in these events overlaps with RecentChanges, but the new streams contain more types of events in a more predictably structured format.
In addition to more streams, the EventStreams API has gotten a few new features too! EventStreams now supports stream composition and subscription based on a historical timestamp.
We’ll first describe the new API features and how to use them, and then highlight some of the new event streams below.
Historical timestamp subscription
EventStreams is backed by Apache Kafka, and as such it has always had a historical subscription ability. This ability is used transparently by EventSource/SSE (Server Sent Events) clients to resume from a position in the stream when they reconnect. This allows connected clients to not lose events during a period of network flakiness or service maintenance.
Recent versions of Apache Kafka have added a timestamp to stream position index. EventStreams now leverages this index to support stream subscription starting at a specific timestamp in the past via the since query parameter
If since is set to a relatively recent timestamp, EventStreams will look up the positions in the requested streams that correspond to that timestamp. There is no guarantee that events exist at the exact given timestamp, but Kafka guarantees that you will only receive events for times after the since timestamp.
I did say relatively recent. Kafka does not keep all data around forever. The EventStreams data is small enough that we have capacity to extend the our usual one week retention time to 31 days. (This retention configuration is stream specific and might not apply to future streams, but all currently exposed streams should have a 31 day history available.)
Timestamp support also allows us to replace the offset based auto-resume with a timestamp based one. Now, instead of the EventSource/SSE Last-Event-ID containing the latest offset, it will contain the latest timestamp. By avoiding the use of the Kafka cluster specific offsets, we are able to run the EventStreams service in multiple datacenters for higher availability.
Composite streams
Previously, EventStreams only allowed you to subscribe to a single stream in a single HTTP request. This meant if you wanted to build a client that subscribed all page related change events, you’d have to initialize multiple SSE and HTTP connections, and somehow merge the results together. But no longer!
EventStreams now supports subscription to a comma separated list of streams. You can request multiple streams at once, and get them returned to you as SSE events in the same response. E.g. to subscribe to all of the available page events, you would connect to this: https://stream.wikimedia.org/v2/stream/page-create,page-delete,page-undelete,page-move,page-properties-change.
New streams
The new streams are all described in the EventStreams documentation page. These streams include change events of various types for pages and revisions. The page-create stream allows you to subscribe to all article creation events. page-delete contains all page deletion events. revision-create contains events for every edit to any article or Wikidata item, etc. etc. Many of these streams overlap with what is already in the recentchange stream. However, these are the events that WMF uses for production features, including expiring rendered content and caches. They were designed to be more consistent, more granular, and more backwards compatible. Some include information about the prior state of the changed resource to make it easier to understand what has changed, instead of only providing the current state.
We also plan to expose an exciting new stream: revision-score. By subscribing to this stream, you’ll get events for every revision that the ORES service scores. This would allow you in real time to filter for ‘damaging’ or ‘wp10’ quality edits. There are some technical complexities around the schema of these events that we are hope to resolve soon. When we do, we will make the revision-score stream public and announce it.
Example
You can use the EventStreams service with any SSE/EventSource client. For this demo we’ll use the built in browser EventSource in JavaScript. Navigate to http://wikimedia.org in your browser and open the development console. Then paste the following:
// We’ll subscribe to page-create, page-delete // and page-move (rename) events, starting 1 day ago. // Calculate the timestamp of 1 day ago. var dt = new Date(); dt.setDate(dt.getDate() - 1); var oneDayAgo = dt.toISOString(); var url = `https://stream.wikimedia.org/v2/stream/page-create,page-delete,page-move?since=${oneDayAgo}`; // Use EventSource (available in most browsers, or as an // npm module: https://www.npmjs.com/package/eventsource) // to subscribe to the stream. var recentChangeStream = new EventSource(url); // Print each event to the console recentChangeStream.onmessage = function(message) { //Parse the message.data string as JSON. var event = JSON.parse(message.data); console.log(event); };
You should see the page change events printed in your console.
Check out the EventStreams documentation for more information and examples in other languages.
If you build something, please tell us, or add yourself to the Powered By EventStreams wiki page.
Andrew Otto, Senior Operations Engineer, Analytics
Wikimedia Foundation
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation