Every large website, including Wikipedia, deals with malicious bots – something which is getting worse and not better on today’s Internet. Wikipedia needs stronger tools to defend itself from malicious automated (including AI-driven) activity. As Wikimedia Foundation, we wrote in April about protecting our infrastructure from scrapers excessively using the Wikimedia content for training data. In this post, we are talking about a new way to protect Wikimedia from malicious bots that carry out activities generally intended for humans, such as creating accounts and editing.
To do this, we are trying out a new bot detection service on Wikipedia. We’ll begin by applying it to account creation, and may extend it later to protect editing or other sensitive actions.
Our goal is to lay the groundwork to better defend against automated vandalism like the mass word-swapping changes the English Wikipedia dealt with in July, and the kind of automated account takeover attempt we dealt with in March. We also want to be better prepared for automated sockpuppets that might change content or affect the volunteer community’s processes for guarding integrity and establishing consensus.
This new bot detection service will replace the use of our current CAPTCHA, a basic “type in the word” visual puzzle generated by software that dates back to the ‘00s. Simply put, this system hails from an earlier era of the web and is not equipped to defend against modern AI-backed attackers. We’ve also received ample feedback that the current CAPTCHA is too difficult to use for human users.
The service we will be trying out is hCaptcha, a third-party service specializing in bot detection. They have a particular focus on privacy-sensitive customers, including Signal and many other internet services, that make them a good fit for Wikipedia.
In this trial, we’ll be looking to see how well hCaptcha does at stopping or slowing bot-driven activity, and how well it helps real humans more easily use Wikipedia.
We want to be upfront that this trial will involve us integrating wikis directly with a third-party proprietary service. This is new for Wikimedia, and something we, as the Foundation, don’t take lightly. However, it’s not feasible for us to build a service ourselves that can keep the projects safe in this era. Organizations that are dedicated to running bot detection services have dramatically more expertise and resources to offer than us – especially the ongoing work of keeping up with the cat-and-mouse game of bot detection and evasion as it changes each year.
We’ve always operated Wikipedia in the most privacy-sensitive way we can, which has helped us avoid the kind of casual information sharing and online tracking that has become so common to the modern web. To maintain this commitment, we’ve set it up so that hCaptcha cannot see raw IP addresses of visitors, nor will it be able to see what specific actions are being taken or what URLs are being accessed. Any information about visiting devices that does get collected as part of bot detection will be discarded by hCaptcha within 10 days.
Altogether, this is an opportunity to improve the accessibility and security of the wikis at the same time, while carefully limiting the impact on user privacy. For some more technical details about how:
- Unlike our current CAPTCHA, with this new approach, the service will work primarily invisibly. Most visitors (around 99.9%) will never see a puzzle to solve at all.
- For those visitors that do see a puzzle, they will need to complete it to create an account. These are visual puzzles, but for users with sight issues or other accessibility needs, a text-based puzzle is available that can be completed using only a keyboard.
- The service will send back a “risk score” that is their confidence level in the account having been made by an inauthentic user. This risk score will not be public, but will be saved privately to enable analysis and responses to potentially bot-driven activity by WMF and volunteer investigators.
- Visitor IP addresses will not be sent to the service – all requests to the service go through a proxy we host ourselves that drops raw IPs and uses hashed versions instead.
- The code we load from the service will be sandboxed so that it cannot see or interfere with the page context of the user session, and so that the service can’t see the specific URL of the page.
- See our project page for more technical details.
We are also planning to incorporate the bot detection data we get from this into the tools we provide to our trusted volunteer investigators to respond to sockpuppeting and other inauthentic activity. This is part of our larger effort in safety and security for this year to build more anti-abuse signals and tools into the wikis, and you’ll see some of these ideas in our near-term public plans.
Starting in the coming weeks, and over the course of several months, we’ll be analyzing how bots are engaging with the wikis, making sure hCaptcha isn’t making it unexpectedly harder to use Wikipedia, and identifying any further privacy and security measures we can take. We will review this analysis, and will engage publicly with the communities about how the trial went, before we make decisions on expanding the use of hCaptcha to replace our current CAPTCHA.
We’ll be engaged with the communities throughout this process. Thank you to the volunteers who’ve provided direct feedback so far, which has helped to shape our privacy model and technical implementation. We will share updates as this work progresses – please share your thoughts on our project page and subscribe to our team newsletter to stay in touch.
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation