Lightening the robotic scraping: Insights for a ‘green’ cache from the Software Heritage archive

Translate this post
A brightly coloured squid (brown, blue, and gold) swims in a blurry aquatic environment.
A homage to the ‘Kraken’ French national supercomputer deployed for this research.

Since generative AI took off, open archives like Wikimedia Commons have faced a contradiction: on one hand, our human-created digital content is more valuable than ever for training open AI models, particularly Large Language Models (LLMs); on the other, automated bulk access, termed scraping, risks clogging infrastructure for traditional human users and driving up financial and environmental costs to keep content accessible to everyone.

Technicians from the Wikimedia Foundation argued in a Diff post last year that in 2025, 65% of the most expensive traffic on Wikimedia’s central servers was coming from bots. This “bulk” traffic limits the capacity to absorb legitimate peaks from human users and consumes precious resources.

But what if, instead of restricting access, as imposed by the robotic user access policy and envisioned in the Wikimedia Foundation’s Annual Plan for 2025–2026, we could make data retrieval and storage engines more efficient to mitigate robotic costs? I addressed this research question with the team at Software Heritage, a UNESCO-backed initiative based at Inria (Paris) that shares much the same vision and challenges as Wikimedia regarding content sharing and reuse for equitable AI and Open Research. In our work, we’ve been dealing with archiving code repositories. Yet, I think the approach could be adapted to compress text-based data (such as SVG images, books, or wikitext) in the Wikimedia ecosystem. Joining forces across open-source archives and communities is the right way to navigate the challenges of the LLM together — Unity is strength!

The concept: An adaptive, dynamic, compressed cache

The bottom line is simple: instead of serving every file directly from the main archive (called Winery, a time-consuming Ceph-based storage), we tested, together with the SWH team, a stream-parallel caching system based on a key-value store.

A simplified diagram of the multi-tier architecture. Frequent requests (e.g., from bots) are served by a fast, compressed cache, while the main archive (Winery) is only queried in the event of a “miss”.

This cache utilises a data-compression pattern and parameter optimisations to pack similar data together effectively, reducing the storage footprint and, consequently, the energy required to store and retrieve data.

In particular, we designed an indexing key that reflects the “Permute, Partition, Compress” (PPC) algorithmic paradigm to implement a dynamic, large-scale archive using RocksDB —an open-source database that Wikimedia already adopts within Ceph for storing metadata and that was included in Wikimedia technical evaluations related to data streaming (e.g., to memorise the local state within Kafka).

The results: Gigabytes per second at a tenth (or less) of the energy

The logotype of Software Heritage depicts angle markup brackets.

The tests, conducted on computing nodes of the French national “Kraken” HPC cluster operated by Grenoble Alpes University/GRICAD, hinged on a dataset containing a Terabyte-scale source-code collection, with some insightful results:

  • Speed: The cache allows data retrieval at speeds of up to Gigabytes per second, even with compression that reduces space by 15-20%. This improved the result by several orders of magnitude over a previous static compressed solution.
  • Scalability: The system effectively leverages parallelism (up to 16-32 threads), yet results also show that time-energy correlation is not always linear—a relevant insight for efficient large-scale data management.
  • Energy: The most efficient configurations reduce energy consumption per operation by 1-2 orders of magnitude compared to the sequential ones.

What does this mean for Wikimedia?

No solution is a silver bullet, and this research does not solve the issues of attribution or indiscriminate scraping. Yet, it offers further insights and a concrete tool for one specific side of the problem: infrastructural load.

I believe applying similar compression techniques to caching layers for Wikimedia Commons, Wikisource or other projects —where content such as SVG images, books, or wikitext is often highly redundant and repeatedly accessed— could mean:

  • Reducing the cost of serving bulk bot traffic, freeing up bandwidth for human users.
  • Lowering the energy footprint of open-content storage: a key sustainability goal.
  • Better resilience against “legitimate” traffic spikes, such as those seen during global events.

Towards a more responsible infrastructure use

I believe this research aligns with Wikimedia’s efforts outlined in the “WE5: Responsible Use of Infrastructure” plan. It’s not just about setting use boundaries, but about exploring new, greener solutions to become more space-, time- and energy-efficient. These questions resonate with conversations in the Wikimedia Futures Lab (Frankfurt), where volunteers and experts are examining how global trends —including AI, digital sustainability, and infrastructure resilience— shape the future of free knowledge.

I believe making open data access more technically sustainable is one way to protect the model and vision of free knowledge in the age of AI. Our research is to be presented at the Greenvolve Workshop (IEEE SANER 2026 conference) in Limassol, Cyprus. The technical paper is available as a preprint on arXiv.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?