If you ever look into using Wikipedia content for analysis — and wow, there’s tons of cool stuff to do with it — you’ll want a local copy. You can’t scale an analysis that queries the public server, and it’s pretty rude to even try. Setting up a local copy takes some time, but at the end you’ll have a goodly portion of accumulated human knowledge, and metadata relating it, right there on your local drive.
Quick caveat: there’s a difference between a Wikipedia mirror — which is a faithful duplicate of all the content on Wikipedia — and what I’m calling a Wikipedia local copy: page text and some of the metadata. A cache, basically. Other guides describe how to set up a mirror, and it allegedly takes twelve days. That was far beyond my budget.
Another caveat: even a not-perfect-fidelity Wikipedia local copy occupies a lot of disk space. The page contents themselves are about 50GB. Depending on how much metadata you import (see below), usage can go up quite a bit from there. My local copy is just over 200GB, for reference.
Here are the steps.
Download the “pages-articles” dump
Navigate to the Wikimedia database dump list. Choose any project you wish; for this post I’ll assume English Wikipedia (“enwiki”), which happens to be the largest Wikimedia project. Here’s the latest enwiki dump (20160820) as of the time of this writing.
Search that page for the file named like “enwiki-20160820-pages-articles.xml.bz2” and start downloading it. This is an archive of the contents of all the project pages. The 20160820 pages-articles dump is 12.3GB for example, so it’ll take a while.
Keep this tab open: you might want to download more files from here later.
Install MediaWiki software
On OS X, this is quite painless: install the Bitnami MediaWiki Stack and follow its configuration instructions. When you set a username and password, make sure to remember the password (or write it down): you’ll need it below. Make sure to start the services.
Bitnami has installers for other platforms, and cloud VMs ready to launch out of the box. There are also plenty of other setup guides out there, particularly for MediaWiki on Linux.
After setup, you should be able to load http://localhost:8080/mediawiki/Main_Page and see … well nothing really, just an empty MediaWiki main page. We’ll import the real stuff below. Don’t create any content here: we’re going to wipe the DB.
Build the pages-articles import tool
Unfortunately, this is a bit of a pain. First you need Java >= 1.7. Download the JDK here and follow the instructions for the installer. (On Linux, you may want to use your distribution’s Java installation method.) On OS X, add this line to your .profile or .bashrc:
export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
(On Linux, you may need to update your preferred Java version.) Next you need to install git and maven. On OS X, homebrew is very convenient for this:
brew install git maven
(And likewise on Linux.) Now we can build the pages-articles importer.
mkdir ~/wikipedia cd ~/wikipedia git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper cd mwdumper mvn package
OK! If you see a compiler error about an unsupported Java version, ensure that “javac -version” says something like “javac 1.8.0_102”; otherwise make sure you followed the steps above to set your preferred Java.
Has your pages-articles download finished yet? No? OK, come back here when it does.
For simplicity, let’s say that you’ve chosen the 20160820 enwiki dump, and you saved the pages-articles archive to “~/Downloads”. These instructions assume a Bitnami install on OS X, but will work on Linux with small tweaks to the tool paths and DB names etc.
First we need to wipe any existing content from the MediaWiki DBs. [Ed: you can probably skip the DB wipe, but I did it anyway.]
export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH" mysql -u root -p bitnami_mediawiki Enter password:
Here you’ll need to provide the password you choose during configuration — you remembered it right? — but ignore the username part of the command (“-u root”); it’s always “root” here, no matter what you chose above.
mysql> DELETE FROM page; DELETE FROM text; DELETE FROM revision; mysql> quit cd /Applications/mediawiki-1.26.3-1/apps/mediawiki/htdocs/maintenance/ php rebuildall.php
Now we can import pages-articles:
cd ~/wikipedia mkdir enwiki-20160820 cd enwiki-20160820 mv ~/Downloads/enwiki-20160820-pages-articles.xml.bz2 ./ java -jar ../mwdumper/target/mwdumper-1.25.jar \ --format=mysql:1.25 enwiki-20160820-pages-articles.xml.bz2 \ | mysql -u root -p bitnami_mediawiki
This will take several hours at least; you may want to run it overnight. See you later!
[WARNING: when I imported a dump several months older than 20160820, the import tool died with a parse error just before the import finished. This seems to have resulted in some articles not being imported, but didn’t noticeably impact my project at the time. YMMV. I don’t know if this is still an issue.]
After the import finishes, you should be able to load an arbitrary article, for example http://localhost:8080/mediawiki/War_hammer (from enwiki). Of course, the displayed page will look considerably different than it does on the public server, because our local copy only has the page text.
Import metadata (optional)
Depending on your use case, you may want more than just the article text. Details about the available metadata archives are beyond the scope of this article, but they’re hosted on the same dump page from which you downloaded the pages-articles archive.
Let’s say you want to import the redirect list metadata. Search the dump page for a file called something like “enwiki-20160820-redirect.sql.gz” and download it. Then import it with
export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH" cd ~/wikipedia/enwiki-20160820 mv ~/Downloads/enwiki-20160820-redirect.sql.gz ./ bzcat enwiki-20160820-redirect.sql.gz | mysql -u root -p bitnami_mediawiki
To import another metadata archive, follow the steps above replacing “redirect” with the metadata you want to import. Be aware that some larger metadata archives take quite some time, hours, to import.
That’s it! Now you have a local copy of Wikipedia accessible offline, and — more interestingly to me — query-able through the MediaWiki API with minimal latency and no throttling. (And the raw DB tables are available to power users.)
Protip: I found it quite handy to design my queries using the Wikipedia API Sandbox on the public server, which of course has all the metadata and secondary content, and then trying the queries locally. All the metadata and secondary content on the public server makes queries easier to debug, and comparing results public vs. local lets you see where you may need to import more metadata archives. But keep in mind that the public server content is constantly changing, so you won’t always get exactly the same results locally even if all dependencies have been imported properly.