A lovely Wikipedia, just for me

If you ever look into using Wikipedia content for analysis — and wow, there’s tons of cool stuff to do with it — you’ll want a local copy.  You can’t scale an analysis that queries the public server, and it’s pretty rude to even try.  Setting up a local copy takes some time, but at the end you’ll have a goodly portion of accumulated human knowledge, and metadata relating it, right there on your local drive.

lovelywikipedia

Quick caveat: there’s a difference between a Wikipedia mirror — which is a faithful duplicate of all the content on Wikipedia — and what I’m calling a Wikipedia local copy: page text and some of the metadata.  A cache, basically.  Other guides describe how to set up a mirror, and it allegedly takes twelve days.  That was far beyond my budget.

Another caveat: even a not-perfect-fidelity Wikipedia local copy occupies a lot of disk space.  The page contents themselves are about 50GB.  Depending on how much metadata you import (see below), usage can go up quite a bit from there.  My local copy is just over 200GB, for reference.

Here are the steps.

Download the “pages-articles” dump

Navigate to the Wikimedia database dump list.  Choose any project you wish; for this post I’ll assume English Wikipedia (“enwiki”), which happens to be the largest Wikimedia project.  Here’s the latest enwiki dump (20160820) as of the time of this writing.

Search that page for the file named like “enwiki-20160820-pages-articles.xml.bz2” and start downloading it.  This is an archive of the contents of all the project pages.  The 20160820 pages-articles dump is 12.3GB for example, so it’ll take a while.

Keep this tab open: you might want to download more files from here later.

Install MediaWiki software

On OS X, this is quite painless: install the Bitnami MediaWiki Stack and follow its configuration instructions.  When you set a username and password, make sure to remember the password (or write it down): you’ll need it below.  Make sure to start the services.

Bitnami has installers for other platforms, and cloud VMs ready to launch out of the box.  There are also plenty of other setup guides out there, particularly for MediaWiki on Linux.

After setup, you should be able to load http://localhost:8080/mediawiki/Main_Page and see … well nothing really, just an empty MediaWiki main page.  We’ll import the real stuff below.  Don’t create any content here: we’re going to wipe the DB.

Build the pages-articles import tool

Unfortunately, this is a bit of a pain.  First you need Java >= 1.7.  Download the JDK here and follow the instructions for the installer.  (On Linux, you may want to use your distribution’s Java installation method.)  On OS X, add this line to your .profile or .bashrc:

export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

(On Linux, you may need to update your preferred Java version.)  Next you need to install git and maven.  On OS X, homebrew is very convenient for this:

brew install git maven

(And likewise on Linux.)  Now we can build the pages-articles importer.

mkdir ~/wikipedia
cd ~/wikipedia
git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper
cd mwdumper
mvn package

OK!  If you see a compiler error about an unsupported Java version, ensure that “javac -version” says something like “javac 1.8.0_102”; otherwise make sure you followed the steps above to set your preferred Java.

Import pages-articles

Has your pages-articles download finished yet?  No?  OK, come back here when it does.

For simplicity, let’s say that you’ve chosen the 20160820 enwiki dump, and you saved the pages-articles archive to “~/Downloads”.  These instructions assume a Bitnami install on OS X, but will work on Linux with small tweaks to the tool paths and DB names etc.

First we need to wipe any existing content from the MediaWiki DBs.  [Ed: you can probably skip the DB wipe, but I did it anyway.]

export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH"
mysql -u root -p bitnami_mediawiki
Enter password:

Here you’ll need to provide the password you choose during configuration — you remembered it right? — but ignore the username part of the command (“-u root”); it’s always “root” here, no matter what you chose above.

mysql> DELETE FROM page; DELETE FROM text; DELETE FROM revision;
mysql> quit
cd /Applications/mediawiki-1.26.3-1/apps/mediawiki/htdocs/maintenance/
php rebuildall.php

Now we can import pages-articles:

cd ~/wikipedia
mkdir enwiki-20160820
cd enwiki-20160820
mv ~/Downloads/enwiki-20160820-pages-articles.xml.bz2 ./
java -jar ../mwdumper/target/mwdumper-1.25.jar \
  --format=mysql:1.25 enwiki-20160820-pages-articles.xml.bz2 \
  | mysql -u root -p bitnami_mediawiki

This will take several hours at least; you may want to run it overnight.  See you later!

[WARNING: when I imported a dump several months older than 20160820, the import tool died with a parse error just before the import finished.  This seems to have resulted in some articles not being imported, but didn’t noticeably impact my project at the time.  YMMV.  I don’t know if this is still an issue.]

After the import finishes, you should be able to load an arbitrary article, for example http://localhost:8080/mediawiki/War_hammer (from enwiki).  Of course, the displayed page will look considerably different than it does on the public server, because our local copy only has the page text.

Import metadata (optional)

Depending on your use case, you may want more than just the article text.  Details about the available metadata archives are beyond the scope of this article, but they’re hosted on the same dump page from which you downloaded the pages-articles archive.

Let’s say you want to import the redirect list metadata.  Search the dump page for a file called something like “enwiki-20160820-redirect.sql.gz” and download it.  Then import it with

export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH"
cd ~/wikipedia/enwiki-20160820
mv ~/Downloads/enwiki-20160820-redirect.sql.gz ./
bzcat enwiki-20160820-redirect.sql.gz | mysql -u root -p bitnami_mediawiki

To import another metadata archive, follow the steps above replacing “redirect” with the metadata you want to import.  Be aware that some larger metadata archives take quite some time, hours, to import.

Have fun

That’s it!  Now you have a local copy of Wikipedia accessible offline, and — more interestingly to me — query-able through the MediaWiki API with minimal latency and no throttling.  (And the raw DB tables are available to power users.)

Protip: I found it quite handy to design my queries using the Wikipedia API Sandbox on the public server, which of course has all the metadata and secondary content, and then trying the queries locally.  All the metadata and secondary content on the public server makes queries easier to debug, and comparing results public vs. local lets you see where you may need to import more metadata archives.  But keep in mind that the public server content is constantly changing, so you won’t always get exactly the same results locally even if all dependencies have been imported properly.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s