Making a English Wikipedia server

Revision as of 22:57, 4 December 2020 by Sam (talk | contribs) (Stuckthrough obsolete mwdumper - php import method is now the best way and will be covered at some point)

MediaWiki makes dumps of the English Wikipedia about once a month. As it is free and open source content, you can use these dumps to make your own server with the English Wikipedia content in it. As the English Wikipedia is the largest Wikipedia, it does take a while to import the dumps, but it is by no means impossible. This guide will help you long the way.

Prerequisites

Here is some information of a few things you need to know before you get started. It will all be covered in the instructions below.

What you need to know

You will need the following to get started with a dump:

  • Apache Web Server
  • PHP
  • MySQL/MariaDB
  • A dump of your Wikipedia of choice
  • MWDumper
  • The latest Java JRE and Java JDK from the Oracle Java website

Before you start

  • Remember that some of the Wikipedia dumps are huge. You will need a lot of disk space (Around 200GB as a minimum, around 500GB to do this comfortably, remember that Wikipedia is always growing...)
  • Prepare MySQL/MariaDB for the large transactions coming up. A suggestion is to look in /usr/share/mysql/my-huge.cnf and consider using the values under the [mysql] header, at least while you are importing the database. The most important value that MUST be changed or the import will fail, is the value max_allowed_packet = 1M. This will need to be changed to max_allowed_packet = 128M. Due to the size of some of Wikipedia's articles, if this value isn't changed, MySQL/MariaDB will reject the record if it is more than 1M and the size isn't changed. 128M is more than enough during import and can be safely changed back after.
  • The tables must be cleared as per the instructions below before attempting importing or it will fail.

Importing the dump

Downloading the dumps

  1. The dumps for English Wikipedia are available from here. When there, you'll obviously want to select the latest date.
  2. Once there, you'll need to download the following:
    • enwiki-<date>-pages-artilcles.xml.bz2 (This is the latest revision of every Wikipedia page, article and template - the basis you need to get going)
    • enwiki-<date>-redirect.sql.bz2 (This will make redirects function correctly)
    • enwiki-<date>-templatelinks.sql.bz2 (This will make the template links function correctly)
    • enwiki-<date>-site_stats.sql.bz2 (This will fill in article counts and the like without searching for you)
  3. Put all the files in a dedicated folder so they are all available in one place for later.

Downloading MWDumper

  1. MWDumper is available from many places around the Internet, both in source form and already built Java packages. You will need to download a copy from Jenkins, this is pre-built by MediaWiki. MWDumper 1.16 (26/06/2013) was the latest at the time of writing.
  2. You will need to remove any versions of OpenJDK already installed (remove libreoffice-calc-extensions and libreoffice-writer-extensions before OpenJDK so that it doesn't try to install another version of Java).
  3. You will then need to install the latest Oracle Java JRE and JDK packages (64-bit packages are safe and better for this as we don't need the web plugin)

Downloading and installing MediaWiki