Introduction

This page describes a way to export pages data and file metadata from Hippo CMS 7 to MySQL, UDM (MongoDB) and some Excel reports. The attached zip file contains (apart from robots) example XML, which makes the connector fully demonstrateable for anyone.

Prerequisites

To be able to use this connector to your advantage, you will need:
  • Installed software:
    • Xill IDE 3.1 and the MySQL plugin
    • MySQL server and some program to view the contents of your databases
    • MongoDB and a program like RoboMongo
    • MS Excel or another program that can view .xlsx spreadsheets
  • For using the connector on the Hippo installation of your project, you will need access to its console
  • Knowledge/experience:
    • Xill
    • XML and XPath
    • JSON

Unified Data Model (UDM) setup

The definition of content types and custom decorators is done in the robot /transform/ContentTypes.xill and the StandardDecorators robot is in the com folder. You can look up the details there.

Standard decorators

All content types use the standard decorator parent. That way, we are able to save the complete node hierarchy. However, parent.id is now filled with the parent's Hippo ID, while it should (according to the newest standard decorator specifications) actually be the parent's UDM ID from MongoDB.
All page types also use the standard decorators document and revision. NOTE: 'revision' got replaced by new standard decorators 'created' and 'modified' in the meantime.

Custom decorators

A hippo decorator has been defined to store the id, name and path of every item in the source. This id is also what's filled in for children in their parent.id field. You should be able to use the hippo decorator for any Hippo source system.
Most page types also have a custom decorator for fields that are specific for that type. These have the same name as their corresponding content type, i.e. agenda, news, leadingPage, webPage, banner, externalLink, image. These decorators are project-specific, so you will need to make your own if you want to export the content types from another Hippo installation.
Then there is the web decorator. It contains general fields that most web pages of the demo project (can) have. This might also apply to your CMS, for the most fields. The web decorator contains the website name, introduction, related links and files, and also the page body. This body is stored in the content field. Hippo CMS builds the page bodies from different types of components called content blocks, which is why the content field is of the LIST type. Scroll down to the Solution chapter to see how this content field is filled.

Solution

This demo was originally made for an analysis project. The biggest shortcoming might be that the actual binaries (files like pdfs and images) are not downloaded by these robots, because constructing the correct frontend url's proved to be quite difficult and the file metadata was already sufficient.
The extraction consists largely of these steps:
  1. Export XML from Hippo console to local export folder
  2. Split XML in individual page/file XML files and store basic information and structure in MySQL (after this, some reports can already be made)
  3. Transform from the individual XML files, with some help from the MySQL tables, to the universal data model
To see the whole demonstration working: add the attached connector to your Xill IDE, set your project path correctly in Main.xill and run that robot. It should create a MySQL database, a MongoDB database and three Excel reports (in the Reports folder).

Export XML from Hippo console

The first step is to use the XML Export function in the Hippo console. This has one caveat: there is a 100MB limit per XML file, while Hippo can not tell you how big the XML file will get before you download it. Downloading the whole content node at once is almost certainly impossible, but how deep must you go? Once you chose a node, the system will start building the XML file, and with a little luck present you with a 'save file' option soon. If it hits the limit though, you will get an error page and probably have to wait a bit before you can try again.

Export XML from Hippo

Since you probably can not extract the assets, documents and gallery nodes at once, Xillio's export connector has been made so that you are flexible to use the Hippo exporter from any deeper level, as long as you structure your export files exactly the way it is in Hippo. This means that if you extract the node 'ecer' (see picture), you have to save the xml file at the local path [project folder]/Export/assets/ecer.xml.

This way, the connector will be able to determine the correct parent-child relationships. This is also explained in the text file in the Export folder of the connector zip file.

Split XML

The robot export/ExtractPages.xill does this for pages (from the Export/documents folder) and export/ExtractBinaries.xill for the files (images from Export/gallery and other files from Export/assets). The export XML is chopped by Hippo handles and saved to the folder (which name is specified by the variable splitFilesFolderPath). The latter is done in the functions savePage and saveBinary. Now all items can be opened individually from local file system, which has a couple of benefits, related to the working memory of Java software as well as the human mind. While opening an item when you know its id is now very straightforward and fast, you still need a way to easily see what item is where and how they are all related. That is why we also used a relational database.

Store overview in MySQL

In the same two robots as the previous step, we also store information about the original node structure, sites, the content types and more in a MySQL database. Two of the report robots get all their information from this database, so they don't need UDM. This relational database is also used as the basis for the transformation (which you could call the second automatic extraction phase) to the UDM.

Transformation to UDM

This transformation is done with JSON templates for better re-usability. The responsible robot can be found in the transformation folder and is called XMLtoUDM.xill. Mostly the function udmTemplate() is built to be re-used in later extractions from different CMSs. This is possible because it takes any XML node and a matching JSON template. The JSON template is filled with the decorator/field structure exactly like it should be entered in MongoDB, including instructions on where each field should acquire its content. Usually, this instruction is an XPath that udmTemplate() can execute on the XML, but it can also be a 'special' field that is also given to the function as parameter. Furthermore, there is also the sub field support that we need for Hippo's content blocks.

Web.content and content lists

The page body, which Hippo stores in content blocks of different types, is extracted to the content field of the web decorator. Since this could be seen as the most complex UDM transformation step and this could also be easily modified for other hippo installations and even completely different CMSs, it is given its own sub chapter here. This is a small part of the JSON template, included in the zip file in the folder bots/JSON templates:

Let's go through these lines, step by step.

  • 'content' is the field name inside the web decorator
    • 'xpath' means that this XPath should be executed (by udmTemplate()) to get the value of this field. By default, this value is what will be filled in in the field, except if there are any other (sibling) instructions for further processing. In this case, the result could for instance be a list of three content block nodes.
    • '_contentList' means that the value of the previous step is an intermediate result, which should be broken down in sub fields. A list follows of all the possible blocks of sub fields. All coming XPaths will be executed on each result of the XPath above.
      • 'type' holds the value of the content element type, which is in this case "text"
      • '_condition' contains a mechanism to check whether a content element is of this specific type
        • 'xpath' contains the XPath to execute for the check
        • 'result' contains the string that should be compared with the result of the XPath. If they are the same, then the current content element is indeed of this ("text") type
      • 'fields' contains all sub fields. They are processed like normal, elementary fields
        • 'title' is one of the sub field names, and under that its XPath.
        • 'body' is the other sub field relevant for the "text" content block, again with the XPath to get the content.
      • The next content block type ("image") starts here. It has a different _condition result and different fields, but it works the same. That is why you should be able to insert your content blocks too, even if the fields are very different.
Please let us know if you were able to get these templates modified to your source system.

Build reports

This is optional, but likely helpful. Three examples are included, which you can inspect yourself.

Closing remarks

Hopefully, this article helps in setting up your own Hippo export. Feedback is appreciated.