A lovely Wikipedia, just for me

If you ever look into using Wikipedia content for analysis — and wow, there’s tons of cool stuff to do with it — you’ll want a local copy.  You can’t scale an analysis that queries the public server, and it’s pretty rude to even try.  Setting up a local copy takes some time, but at the end you’ll have a goodly portion of accumulated human knowledge, and metadata relating it, right there on your local drive.

lovelywikipedia

Quick caveat: there’s a difference between a Wikipedia mirror — which is a faithful duplicate of all the content on Wikipedia — and what I’m calling a Wikipedia local copy: page text and some of the metadata.  A cache, basically.  Other guides describe how to set up a mirror, and it allegedly takes twelve days.  That was far beyond my budget.

Another caveat: even a not-perfect-fidelity Wikipedia local copy occupies a lot of disk space.  The page contents themselves are about 50GB.  Depending on how much metadata you import (see below), usage can go up quite a bit from there.  My local copy is just over 200GB, for reference.

Here are the steps.

Download the “pages-articles” dump

Navigate to the Wikimedia database dump list.  Choose any project you wish; for this post I’ll assume English Wikipedia (“enwiki”), which happens to be the largest Wikimedia project.  Here’s the latest enwiki dump (20160820) as of the time of this writing.

Search that page for the file named like “enwiki-20160820-pages-articles.xml.bz2” and start downloading it.  This is an archive of the contents of all the project pages.  The 20160820 pages-articles dump is 12.3GB for example, so it’ll take a while.

Keep this tab open: you might want to download more files from here later.

Install MediaWiki software

On OS X, this is quite painless: install the Bitnami MediaWiki Stack and follow its configuration instructions.  When you set a username and password, make sure to remember the password (or write it down): you’ll need it below.  Make sure to start the services.

Bitnami has installers for other platforms, and cloud VMs ready to launch out of the box.  There are also plenty of other setup guides out there, particularly for MediaWiki on Linux.

After setup, you should be able to load http://localhost:8080/mediawiki/Main_Page and see … well nothing really, just an empty MediaWiki main page.  We’ll import the real stuff below.  Don’t create any content here: we’re going to wipe the DB.

Build the pages-articles import tool

Unfortunately, this is a bit of a pain.  First you need Java >= 1.7.  Download the JDK here and follow the instructions for the installer.  (On Linux, you may want to use your distribution’s Java installation method.)  On OS X, add this line to your .profile or .bashrc:

export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

(On Linux, you may need to update your preferred Java version.)  Next you need to install git and maven.  On OS X, homebrew is very convenient for this:

brew install git maven

(And likewise on Linux.)  Now we can build the pages-articles importer.

mkdir ~/wikipedia
cd ~/wikipedia
git clone https://gerrit.wikimedia.org/r/mediawiki/tools/mwdumper
cd mwdumper
mvn package

OK!  If you see a compiler error about an unsupported Java version, ensure that “javac -version” says something like “javac 1.8.0_102”; otherwise make sure you followed the steps above to set your preferred Java.

Import pages-articles

Has your pages-articles download finished yet?  No?  OK, come back here when it does.

For simplicity, let’s say that you’ve chosen the 20160820 enwiki dump, and you saved the pages-articles archive to “~/Downloads”.  These instructions assume a Bitnami install on OS X, but will work on Linux with small tweaks to the tool paths and DB names etc.

First we need to wipe any existing content from the MediaWiki DBs.  [Ed: you can probably skip the DB wipe, but I did it anyway.]

export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH"
mysql -u root -p bitnami_mediawiki
Enter password:

Here you’ll need to provide the password you choose during configuration — you remembered it right? — but ignore the username part of the command (“-u root”); it’s always “root” here, no matter what you chose above.

mysql> DELETE FROM page; DELETE FROM text; DELETE FROM revision;
mysql> quit
cd /Applications/mediawiki-1.26.3-1/apps/mediawiki/htdocs/maintenance/
php rebuildall.php

Now we can import pages-articles:

cd ~/wikipedia
mkdir enwiki-20160820
cd enwiki-20160820
mv ~/Downloads/enwiki-20160820-pages-articles.xml.bz2 ./
java -jar ../mwdumper/target/mwdumper-1.25.jar \
  --format=mysql:1.25 enwiki-20160820-pages-articles.xml.bz2 \
  | mysql -u root -p bitnami_mediawiki

This will take several hours at least; you may want to run it overnight.  See you later!

[WARNING: when I imported a dump several months older than 20160820, the import tool died with a parse error just before the import finished.  This seems to have resulted in some articles not being imported, but didn’t noticeably impact my project at the time.  YMMV.  I don’t know if this is still an issue.]

After the import finishes, you should be able to load an arbitrary article, for example http://localhost:8080/mediawiki/War_hammer (from enwiki).  Of course, the displayed page will look considerably different than it does on the public server, because our local copy only has the page text.

Import metadata (optional)

Depending on your use case, you may want more than just the article text.  Details about the available metadata archives are beyond the scope of this article, but they’re hosted on the same dump page from which you downloaded the pages-articles archive.

Let’s say you want to import the redirect list metadata.  Search the dump page for a file called something like “enwiki-20160820-redirect.sql.gz” and download it.  Then import it with

export PATH="/Applications/mediawiki-1.26.3-1/mysql/bin/:$PATH"
cd ~/wikipedia/enwiki-20160820
mv ~/Downloads/enwiki-20160820-redirect.sql.gz ./
bzcat enwiki-20160820-redirect.sql.gz | mysql -u root -p bitnami_mediawiki

To import another metadata archive, follow the steps above replacing “redirect” with the metadata you want to import.  Be aware that some larger metadata archives take quite some time, hours, to import.

Have fun

That’s it!  Now you have a local copy of Wikipedia accessible offline, and — more interestingly to me — query-able through the MediaWiki API with minimal latency and no throttling.  (And the raw DB tables are available to power users.)

Protip: I found it quite handy to design my queries using the Wikipedia API Sandbox on the public server, which of course has all the metadata and secondary content, and then trying the queries locally.  All the metadata and secondary content on the public server makes queries easier to debug, and comparing results public vs. local lets you see where you may need to import more metadata archives.  But keep in mind that the public server content is constantly changing, so you won’t always get exactly the same results locally even if all dependencies have been imported properly.

Advertisements

Node 6 woes

I’m working on a project that involves processing a fairly big data set — I’ll have more to say about the project later — and it was convenient to write my analysis using node.js.  That turned out to not be a good idea.

After my long-running analysis hit around 1.5GB of heap usage (on a machine with 16GB RAM), my program mysteriously crashed.  As you may know, you have to explicitly override the default heap-size limit with:

node --max_old_space_size=X myscript.js

where X is the size in MB.  Mildly irritating, but whatever.

As my analysis marched on to around 4GB of heap usage, I saw a new mysterious crash that looked like:

#
# Fatal error in ../deps/v8/src/heap/spaces.h, line 1516
# Check failed: size_ >= 0.
#

==== C stack trace ===============================

1: V8_Fatal
2: 0xc56387
3: v8::internal::MarkCompactCollector::SweepSpaces()
4: v8::internal::MarkCompactCollector::CollectGarbage()
5: v8::internal::Heap::MarkCompact()
6: v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags)
7: v8::internal::Heap::CollectGarbage(v8::internal::GarbageCollector, char const*, char const*, v8::GCCallbackFlags)
8: v8::internal::Heap::HandleGCRequest()
9: v8::internal::StackGuard::HandleInterrupts()
10: v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*)
11: 0xf767a506338
Illegal instruction

Some searching turned up this v8 bug, in which a variable used to track heap space during sweeps regressed to an “int” (I think?) and so overflows on heaps larger than 4GB.  This regression is rather more annoying, and a bit of a head-scratcher.

The workaround is to downgrade to node 5.11.  You can check your system with the excellent test case and instructions here.

As a side effect, after the downgrade the performance on my workload increased by 15% – 30%.  Also a bit of a head-scratcher.

So, on the whole, 🤔.  You might want to think twice about using node.js for programs that are very resource intensive[1].

However, there were some rather pleasant parts of my experience with the latest-and-greatest node tools that I may write up later.


[1] Yes, I could have rewritten my program to use multiple processes, or a different analysis backend or computational model or …  I’m just recording my experience here as a relatively naive user.

Shot put: the doping-est event in history?

The women’s shot put final at the 2016 Olympics was quite entertaining.  There was a whiff of David-and-Goliath as Michelle Carter (USA) knocked off the favorite, Valerie Adams (New Zealand), with her last put.  That last put (20.63m, 67′ 8.2″) was a personal best and American record for Ms. Carter.

Valerie Adams was almost born to put shot.  She won the World Youth and World Junior Championships, and since 2007 has dominated the World Championships and Olympics.  Between 2010 and 2014, she had a 54-meet win streak.  The Adams family is noted in New Zealand for their sports genes; Valerie is 6’4″ tall and weighs in at 260lbs.  Her brother Steven plays in the NBA, standing at 7’0″, weighing 255lbs.  A couple other brothers played professional basketball in New Zealand.

To say that the shot-put final was a David-and-Goliath contest is stretching it, though.  Michelle Carter’s father Michael won the silver medal in shot put at the 1984 Olympics and went on to be a Pro-Bowl and Super-Bowl-champion nose tackle in the NFL.  Michelle won the 2004 World Juniors and was 5th in the 2012 Olympics.  She’s 5’9″ and weighs 300lbs, while Dad was listed at 6’2″ and 285lbs.

All this is to say that two badasses fought it out in the 2016 final.  The world record isn’t always set in the Olympics, of course, but given these two I idly wondered how far off they were from it.  I expected that Valerie Adams had set it a few years back, or something like that.  Nope!  They weren’t even close, this year’s winning put being exactly 2m short of the world record (that’s a lot).

What the heck is going on?

If you look at the history of shot put, some strange trends emerge.

shot put progressions
History of shot put performance: men’s and women’s Olympic winners and world records from 1896 to 2016

(The “dots” in the chart are the actual winning or record puts for the given year; the lines merely connect adjacent dots.  Only one world record value (the largest) is recorded in any year in which it was broken; for example, the women’s world record was broken 5 times in 1969, but only the last value (20.43m) is recorded for 1969.)

There’s a lot going on here, let’s break it down:

  • No points are cut off on the graph: women’s records only started to be kept after men’s, and shot put wasn’t an Olympic event for women until 1948.
  • The gap in records between 1936-1948 is real, and for the obvious reason (WWII).
  • Both men’s and women’s performance increases quite a bit starting around the mid-to-late 50’s.  It’s hard to say how anomalous the increase is.
  • From 1976 until 1987, the gap between men’s and women’s world records closed, and for 8 years in that time the women’s record was further than the men’s.
  • The women’s Olympic winning put in 1984 is a precipitous drop from preceding and succeeding years.  (The Soviet Union boycotted the 1984 Olympics.)
  • Performance plateaued in the late 80’s, and since then has fallen off a cliff.  Neither men nor women have come within shouting distance of the world records since then.
  • Men’s and women’s performances have diverged significantly again since the late 80’s.

Veeeeery interesting.  Of course, if you know anything about Olympic history, you’re jumping up and down yelling “doping” or “steroids” by now.  And you’re definitely right.  In the mid 50’s, anabolic steroids started to become available and be used in some sporting events.  In the mid-70’s, the International Olympic Committee banned steroids and started testing for them.  And in the late 80’s and early 90’s, steroid testing finally started to become somewhat effective.[1]  You don’t have to squint too hard to match that timeline up with the results plotted above.[2]

Steroids explain the discrepancy between recent performance and the peak in the late 80’s, but what explains women’s records closing the gap on men’s and then falling back significantly below, post-‘roids?  First of all, men put a 7.26kg shot, while women put a 4kg shot.  So the absolute distances aren’t directly comparable.  But looking at the relative trend, it seems that the juicing that the elite women were doing improved their performance more compared to men (who presumably were on the same juicing programs).

So now, it’s hard to guess at shot-put performance trends.  Will men and/or women ever get back near the results from the ‘roid era?  Seems an open question.

Concretely, if you look closely at the graph, there’s a data point missing for the 2016 men’s Olympics — I write this early in the week of August 15th 2016, and the men’s final is later this week.  We can pretty confidently predict that the winning put won’t be anywhere near the world record.  But there’s a good chance that someone will cross the 22m barrier this year, legitimately!

UPDATE: the men’s results from the 2016 Olympics are in … and the 22m barrier was smashed!  Ryan Crouser (USA) threw 22.52m, breaking the ‘roid-era Olympic Record in the process.

I don’t know of any other sports that have been thrown as far out of whack by steroids as shot put — others I’m familiar with like 100m dash have had cheaters, but the long-term progressions have been pretty stable — but admittedly I haven’t done any systematic analysis.  Do you know of any examples more egregious that this?


[1] Watch the excellent documentary 9.79* to learn about this era.

[2] I didn’t try to break down doping by country; pretty much all countries cheated.  It seems that the Soviet Union may have started systematic doping earlier, and evidence is that doping was more prevalent among Soviet women.  But the current men’s world record, for example, is held by an American who’s both an admitted and caught steroid user.


UPDATE: edited to note that men and women put shots of different mass.