Folks that use XML dumps of our projects will know that the dumps process has been stalled while we investigated bug 23264. We have been running individual project dumps manually and asking people to inspect them carefully. We have just started the automated dumps up again, and various code fixes should be checked in shortly. Thanks to all for your assistance and your patience.
If you are working with the XML dumps of the English language Wikipedia containing all page revisions (pages-meta-history), please note the following issues with the two completed runs.
The January 30 run is missing the text for a large number of old revisions of articles, primarily revisions created between January 1 2005 and May 14 2005. This was due to bug 20757 which was subsequently fixed. If you are doing analysis using the text data, you can retrieve the missing text by extracting it from an earlier file; see the archives.
The March 12 run is incomplete; it is missing about the last third of the revisions, due to early termination during the compression step.
The stubs files and the current page dumps appear to be fine, so statistical or other analyses that only use these files should not be impacted. The mysql table dumps are also unaffected.
We apologize for the inconvenience and are working on getting out a set of complete full history dumps with all revision text intact.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Welcome to Diff

Welcome to Diff, a community blog by – and for – the Wikimedia movement. Join Diff today to share stories from your community and comment on articles. We want to hear your voice!

Subscribe to Diff via Email

4 Comments

Inline Feedbacks

View all comments

James

14 years ago

#2396

Will you please make the dumps and the image bundles available in a series of small, say ~50MB split(1) files? I have read that many people have problems with connections being reset after most of a lengthy file has been downloaded. I have no objection to downloads of such files served on a bandwidth-limited host, if they become popular, but the fact that no mirrors have arisen is disturbing. It would also be nice to have the Foundation projects’ extensions and gadgets available in executable installer files for Linux, Mac OSX, and Microsoft Windows, for those who wish to establish… Read more »

ArielGlenn

Reply to James

#2399

50MB split files might be a bit unwieldy, since we are talking about 32GB compressed files as it is. However we are certainly thinking about ways to provide the en wikipedia dumps in somewhat smaller chunks.

Johnny2k

#2397

I’ve been working with the March 12 dump and it seems that the discussion and user pages were included in pages-articles.xml. Was this a mistake in the dump or should there be rows in the page table containing page_namespace values equal to odd integers?
An older enwiki from 2009 did not contain those values.

Reply to Johnny2k

#2398

pages-articles.xml should only contain pages from the main namespace. I’ll need to look at that more closely. Thank you for the heads up.

Diff

XML dumps resumed

Can you help us translate this article?

Related

Welcome to Diff

Subscribe to Diff via Email

Wikimania Katowice

Wikimedia CEE Meeting 2024

Celtic Knot 2024

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole

	This comment is spam
	This comment is a violation of the Code of Conduct
	Other