Making a English Wikipedia server: Difference between revisions

From ThinkServer
m Stuckthrough obsolete mwdumper - php import method is now the best way and will be covered at some point
Saved progress so far
Line 1: Line 1:
[https://www.mediawiki.org MediaWiki] makes dumps of the English Wikipedia about once a month. As it is free and open source content, you can use these dumps to make your own server with the English Wikipedia content in it. As the English Wikipedia is the largest Wikipedia, it does take a while to import the dumps, but it is by no means impossible.
[https://www.mediawiki.org MediaWiki] makes dumps of the English Wikipedia about once a month. As it is free and open source content, you can use these dumps to make your own server with the English Wikipedia content in it. As the English Wikipedia is the largest Wikipedia, it does take a while to import the dumps, but it is by no means impossible. This guide will help you long the way.
This guide will help you long the way.
 
This guide is based entierly on the English Wikipedia, but should be quite relevent for other languages too.


== Prerequisites ==
== Prerequisites ==
Line 8: Line 9:
=== What you need to know ===
=== What you need to know ===


You will need the following to get started with a dump:
You will need the following to get started:


* Apache Web Server
* Apache Web Server
Line 14: Line 15:
* MySQL/MariaDB
* MySQL/MariaDB
* A dump of your Wikipedia of choice
* A dump of your Wikipedia of choice
* <s>MWDumper</s>
* <s>The latest Java JRE and Java JDK from the [https://www.java.com Oracle Java] website</s>


=== Before you start ===
=== Before you start ===


* Remember that some of the Wikipedia dumps are huge. You will need a lot of disk space (Around 200GB as a minimum, around 500GB to do this comfortably, remember that Wikipedia is always growing...)
* Remember that some of the Wikipedia dumps are huge. You will need a lot of disk space (Around 100GB as a minimum for the download, around 500GB to do this comfortably once the database grows, remember that Wikipedia is always growing...)
* Prepare MySQL/MariaDB for the large transactions coming up. A suggestion is to look in <code>/usr/share/mysql/my-huge.cnf</code> and consider using the values under the <code>[mysql]</code> header, at least while you are importing the database. The most important value that '''MUST''' be changed or the import will fail, is the value <code>max_allowed_packet = 1M</code>. This will need to be changed to <code>max_allowed_packet = 128M</code>. Due to the size of some of Wikipedia's articles, if this value isn't changed, MySQL/MariaDB will reject the record if it is more than 1M and the size isn't changed. 128M is more than enough during import and can be safely changed back after.
* Prepare MySQL/MariaDB and PHP for the incoming large transactions.
* The tables must be cleared as per the instructions below before attempting importing or it will fail.


== Importing the dump ==
== Preperation ==


=== Downloading the dumps ===
=== Downloading the dumps ===


# The dumps for English Wikipedia are available from [https://dumps.wikimedia.org/enwiki/ here]. When there, you'll obviously want to select the latest date.
# The dumps for English Wikipedia are available from [https://dumps.wikimedia.org/enwiki/ Wikimedia Dumps]. When there, you'll obviously want to select the latest date.
# Once there, you'll need to download the following:
# Once there, you'll need to download the following:
#* <code>enwiki-<date>-pages-artilcles.xml.bz2</code> (This is the latest revision of every Wikipedia page, article and template - the basis you need to get going)
#* <code>enwiki-(date)-pages-articles-multistream.xml.bz2</code> (This is the latest revision of every Wikipedia page, article and template - the basics you need to get going)
#* <code>enwiki-<date>-redirect.sql.bz2</code> (This will make redirects function correctly)
# This is compressed with Bzip2 - if you have the space, extract it once downloaded to speed up importing.
#* <code>enwiki-<date>-templatelinks.sql.bz2</code> (This will make the template links function correctly)
 
#* <code>enwiki-<date>-site_stats.sql.bz2</code> (This will fill in article counts and the like without searching for you)
=== Downloading and installing MediaWiki ===
# Put all the files in a dedicated folder so they are all available in one place for later.


=== Downloading MWDumper ===
* Download the latest version of Mediawiki from the [https://www.mediawiki.org Mediawiki] website. As were using Linux, it's better to download the .tar.gz version.
* Extract the archive
* Clear out anything not needed:
** Timeless and Mono skin
** Text files in the root, install.sh, docker...
* Copy the folder to the webroot
* Place a file in the .../resources/asset folder if using a picture for the site logo/favicon


# <s>MWDumper is available from many places around the Internet, both in source form and already built Java packages. You will need to download a copy from [https://integration.wikimedia.org/ci/view/Java/job/MWDumper-package/ Jenkins], this is pre-built by MediaWiki. MWDumper 1.16 (26/06/2013) was the latest at the time of writing.
=== Preparing the database ===
# You will need to remove any versions of OpenJDK already installed (remove <code>libreoffice-calc-extensions</code> and <code>libreoffice-writer-extensions</code> before OpenJDK so that it doesn't try to install another version of Java).
# You will then need to install the latest Oracle Java JRE and JDK packages (64-bit packages are safe and better for this as we don't need the web plugin)</s>


=== Downloading and installing MediaWiki ===
* Login to MariaDB as root
* Create the database:
CREATE DATABASE enwiki;
* Create a user for the database:
CREATE USER 'enwiki'@'localhost' IDENTIFIED BY 'database_password';
A password can be generated at [https://passwordsgenerator.net/ Password Generator Plus]. Use a length as long as possible, it doesn't need to be remembered past this configuration.
* Grant priveliges for the user to this database:
GRANT ALL PRIVILEGES ON enwiki.* TO 'enwiki'@'localhost' WITH GRANT OPTION;
* Exit MariaDB
* Restart the server
systemctl restart mariadb
 
This can be tweaked with different database and user names as required.
 
=== Moving databse to a different hard drive ===
 
Due to the sheer size of the databse, you may choose to move the MariaDB database to a different drive. MariaDB stores each database in a seperate folder by default making this easy.
 
* Stop MariaDB
systemctl stop mariadb
* Navigate to <code>/var/lib/mysql</code>
* Move the <code>enwiki</code>/database name folder to where you want the database to be stored
* Back in the <code>/var/lib/mysql</code> folder, create a symlink to where you moved the folder, using the same name for the symlink
* Chown the databse folder where you moved it to mysql:
chown -R mysql:root /path/to/folder/enwiki
* Restart MariaDB, check it starts with no errors
systemctl start mariadb
 
=== Install Mediawiki ===
 
* Navigate to where your instance is installed: for example, https://enwiki.freddythechick.net/. You will be greeted by the Mediawiki installer.
 
== Importing the dump ==

Revision as of 01:55, 12 August 2024

MediaWiki makes dumps of the English Wikipedia about once a month. As it is free and open source content, you can use these dumps to make your own server with the English Wikipedia content in it. As the English Wikipedia is the largest Wikipedia, it does take a while to import the dumps, but it is by no means impossible. This guide will help you long the way.

This guide is based entierly on the English Wikipedia, but should be quite relevent for other languages too.

Prerequisites

Here is some information of a few things you need to know before you get started. It will all be covered in the instructions below.

What you need to know

You will need the following to get started:

  • Apache Web Server
  • PHP
  • MySQL/MariaDB
  • A dump of your Wikipedia of choice

Before you start

  • Remember that some of the Wikipedia dumps are huge. You will need a lot of disk space (Around 100GB as a minimum for the download, around 500GB to do this comfortably once the database grows, remember that Wikipedia is always growing...)
  • Prepare MySQL/MariaDB and PHP for the incoming large transactions.

Preperation

Downloading the dumps

  1. The dumps for English Wikipedia are available from Wikimedia Dumps. When there, you'll obviously want to select the latest date.
  2. Once there, you'll need to download the following:
    • enwiki-(date)-pages-articles-multistream.xml.bz2 (This is the latest revision of every Wikipedia page, article and template - the basics you need to get going)
  3. This is compressed with Bzip2 - if you have the space, extract it once downloaded to speed up importing.

Downloading and installing MediaWiki

  • Download the latest version of Mediawiki from the Mediawiki website. As were using Linux, it's better to download the .tar.gz version.
  • Extract the archive
  • Clear out anything not needed:
    • Timeless and Mono skin
    • Text files in the root, install.sh, docker...
  • Copy the folder to the webroot
  • Place a file in the .../resources/asset folder if using a picture for the site logo/favicon

Preparing the database

  • Login to MariaDB as root
  • Create the database:
CREATE DATABASE enwiki;
  • Create a user for the database:
CREATE USER 'enwiki'@'localhost' IDENTIFIED BY 'database_password';

A password can be generated at Password Generator Plus. Use a length as long as possible, it doesn't need to be remembered past this configuration.

  • Grant priveliges for the user to this database:
GRANT ALL PRIVILEGES ON enwiki.* TO 'enwiki'@'localhost' WITH GRANT OPTION;
  • Exit MariaDB
  • Restart the server
systemctl restart mariadb

This can be tweaked with different database and user names as required.

Moving databse to a different hard drive

Due to the sheer size of the databse, you may choose to move the MariaDB database to a different drive. MariaDB stores each database in a seperate folder by default making this easy.

  • Stop MariaDB
systemctl stop mariadb
  • Navigate to /var/lib/mysql
  • Move the enwiki/database name folder to where you want the database to be stored
  • Back in the /var/lib/mysql folder, create a symlink to where you moved the folder, using the same name for the symlink
  • Chown the databse folder where you moved it to mysql:
chown -R mysql:root /path/to/folder/enwiki
  • Restart MariaDB, check it starts with no errors
systemctl start mariadb

Install Mediawiki

Importing the dump