Introduction

MediaWiki is a widely used collaborative web content management sytem, that runs on websites like Wikipedia. When automating extraction from a MediaWiki website, there are several options. One of them (using the built-in export and Xill XML tools) will be explained here, including examples that can be run by anyone with Xill IDE.

Prerequisites

In order to successfully run the MediaWiki connector scripts, the following requirements must be met:

  • You have an administrator account on the source website, and you downloaded an export from [your website url]/wiki/Special:Export. Alternatively, you can try our example xml export.
  • The scripts code is opened in Xill IDE 3.0
  • A MongoDB instance is running locally on the standard port (127.0.0.1:27017)
  • When using the scripts for your own MediaWiki source instead of our example, you will probably need to adjust the extraction method. At least the functions splitContent, splitFields and addOtherFields are relevant in this respect.

Unified Data Model Setup

The project this connector is derived from was one of the first where Xill IDE 3 and UDM were used. At the time, the standardizing of UDM decorators had only just begun, so here's a little disclaimer. Although there is a separation between special and common fields, the now defined standard decorators are not used. Instead, the decorators have been named after the content types, and the fields after the source field names. Here's an example:

var decorators = {
        "Web" : {
            "description" : {
                "type" : "STRING",
                "required" : false
            },
            "files" : {
                "type" : "STRING",
                "required" : false
            },
            "id" : {
                "type" : "STRING",
                "required" : true
            },
            "title" : {
                "type" : "STRING",
                "required" : true
            },
            "contributor" : {
                "type" : "STRING",
                "required" : false
            },
            "timestamp" : {
                "type" : "DATE",
                "required" : false
            },
            "keywords" : {
                "type" : "STRING",
                "required" : false
            },
            "external-links" : {
                "type" : "STRING",
                "required" : false
            },
        },
        "Publication" : {
            "publisher" : {
                "type" : "STRING",
                "required" : false
            },
            "records" : {
                "type" : "STRING",
                "required" : false
            },
            "publication-date" : {
                "type" : "DATE",
                "required" : false
            },
            "author" : {
                "type" : "STRING",
                "required" : false
            }
        }
};

What you see here is two of the decorators as they are declared in Contenttypes.xill. As you can also read in the comments there, the Web decorator is used for common fields (that all or most web pages have) and the Publication decorator is used for all other fields that the source content type Publication has. The field names are directly copied from those in the export xml, as you can see when you run DetermineContentTypes.xill with the example XML.

Solution

1. Analyze content types

The robot DetermineContentTypes.xill makes a simple inventorisation of the fields that each content type has. It saves all distinct field names for every page of every content type, so that no filled-in field will be missed, and all unused fields are discarded. Then it prints the result to the console in a simple listing format.

2. Create content types in UDM

With the information from the previous step, it is possible to engineer the UDM setup. In the case of this particular project, the choice was made to use all the field names from the source export directly in the UDM. This led to writing the robot Contenttypes.xill, which contains the list of decorators and the function that stores the content types/decoraters in MongoDB (which is necessary for Xill to be able to save the documents in this form).

3. Scrape pages

The robot ScrapePages.xill loops thtrough all pages in the export XML, extracts all necessary fields and saves them in MongoDB.

Contenttypes.xill is re-used here for the creation of the database object that will be stored, because that script has the information (in function getDecorators) of which field should go in which decorator. This approach was useful for the project, but is also what greatly limited the possible usage of standard decorators. Also, mind that the regexes used for the separation of fields might not work for other MediaWiki implementations, because those can use different character patterns.

4. Scrape files

The robot ScrapeFiles.xill uses the Web package to open the files list on the source website and for each file open its page, download the file to local disk and extract all necessary fields from the html. The file decorators are hardcoded in this robot. They are filled and also stored in MongoDB.

The file names are unique, just like page titles, which helps in resolving all links afterwards, although there can also be (hidden) redirect pages that can be tricky to handle.

5. Transformation

Transformation steps will be necessary to prepare the content for the target system, but that is beyond the scope of this article.

Closing Remarks

Another MediaWiki website might have different field separators than assumed, but with the source example included in this connector you can at least see how it is supposed to work. You can compare that example with what your own project looks like and use from the connector what you like. This connector might not exactly live up to the standards that are now set up, but should help finding your own approach to extract content from a MediaWiki website.