XML dumps resumed

Folks that use XML dumps of our projects will know that the dumps process has been stalled while we investigated bug 23264. We have been running individual project dumps manually and asking people to inspect them carefully. We have just started the automated dumps up again, and various code fixes should be checked in shortly. Thanks to all for your assistance and your patience.
If you are working with the XML dumps of the English language Wikipedia containing all page revisions (pages-meta-history), please note the following issues with the two completed runs.
The January 30 run is missing the text for a large number of old revisions of articles, primarily revisions created between January 1 2005 and May 14 2005. This was due to bug 20757 which was subsequently fixed. If you are doing analysis using the text data, you can retrieve the missing text by extracting it from an earlier file; see the archives.
The March 12 run is incomplete; it is missing about the last third of the revisions, due to early termination during the compression step.
The stubs files and the current page dumps appear to be fine, so statistical or other analyses that only use these files should not be impacted. The mysql table dumps are also unaffected.
We apologize for the inconvenience and are working on getting out a set of complete full history dumps with all revision text intact.

Archive notice: This is an archived post from blog.wikimedia.org, which operated under different editorial and content guidelines than Diff.

4 Comments
Inline Feedbacks
View all comments

Will you please make the dumps and the image bundles available in a series of small, say ~50MB split(1) files? I have read that many people have problems with connections being reset after most of a lengthy file has been downloaded. I have no objection to downloads of such files served on a bandwidth-limited host, if they become popular, but the fact that no mirrors have arisen is disturbing. It would also be nice to have the Foundation projects’ extensions and gadgets available in executable installer files for Linux, Mac OSX, and Microsoft Windows, for those who wish to establish… Read more »

50MB split files might be a bit unwieldy, since we are talking about 32GB compressed files as it is. However we are certainly thinking about ways to provide the en wikipedia dumps in somewhat smaller chunks.

I’ve been working with the March 12 dump and it seems that the discussion and user pages were included in pages-articles.xml. Was this a mistake in the dump or should there be rows in the page table containing page_namespace values equal to odd integers?
An older enwiki from 2009 did not contain those values.

pages-articles.xml should only contain pages from the main namespace. I’ll need to look at that more closely. Thank you for the heads up.