When directly dumping an entire database is not possible, a so-called crawler is needed. These robots are given a starting point and 'crawl around' the source objects until every bit of content is accesed. This article shows how to create a basic crawler for disk drives or network locations. The code has a resume functionality, that makes sure that if the program terminates for whatever reason, it can pick up crawling where it left off. Of course, the general methodology of crawling used here can be applied to other content systems that store assets based on a parent-child relationship.

Introduction

A fileshare can be seen as a tree, where every subfolder is a branch and every file is a leaf (in this analogy, empty folders are leafless branches). Indexing all branches and leafs means keeping track where exactly you are on the entire tree.
Two orders in which to visit all items are shown in the pictures above. For an object in a folder tree that resides n levels deep, keeping track of progress in depth-first search requires saving information about the n connections leading to the object. Breath-first search, on the other hand, requires at least keeping a queue of all items m(n) at level n! A depth-first search is often the best choice, because most folder structures are wider than they are deep. An estimation for typical numbers in a production environment of 200 GB would be a depth of 10 with around 200000 items at the widest level!
450px-Breadth-first-tree.svg.png
450px-Depth-first-tree.svg.png
Order of visiting folders in breadth-first search Order of visiting folders in depth-first search

Implementing a depth-first crawler in Xill

The heart of the depth-first crawler is the processFolders function, shown below in a stripped-down form. Given a root folder, processFolders() operates as follows:

  1. Is the given folder path readable?
  2. Are there any subfolders within this folder?
  3. - If so, for every subfolder, pass the path of the subfolder to another instance of processFolders(), returning to step 1 (this happens for points 1-3 in the figure on the right)
  4. - If not, the bottom of the folder hierarchy is reached (this happens for point 4 in the figure). 
  5. -- Save information about the current folder
  6. -- Save information about all files in the current folder
  7. -- Processing this folder is finished, so go back up one level and resume execution of the rest of step 3 (for example, return to point 3 and start processing point 5)
  8. function crawlFolders (folder) {
         if (File.canRead(folder.path)) {
              foreach (subfolder in File.iterateFolders(folder.path, false)) {
                   crawlFolders (subfolder);
              }
              processFolder (folder);
              foreach (file in File.iterateFiles(folder.path, false)) {
                   processFile (file);
              }
         }
    }

In this way, the crawler will go as deep as possible into all subfolders by storing properties of all files and folders it encounters, and calling itself to enter into even deeper-lying folders. 

Implementing resume functionality

What if your robot crashes at some point, before completing its crawl? It would be a shame to let all that effort go to waste, right? The following two functions allow keeping track of which folders have already been crawled.  A global variable completionList maintains key/value pairs with all folders that have been completely crawled, and their parents. Whenever a folder has been crawled, it is added to the list. Upon adding a folder to this list, all subfolders belonging to it (it's children) are removed from the list to keep it at a manageable size. Remember that in our depth-first search, we go as deep as possible into the tree structure, and then 

By saving the completionList object as a text file, the crawler robot can survive a crash. The function markBranchCompletion should be called after returning from a crawlFolders() call. The function completedBefore() should be used to verify if a folder has been crawled before processing a folder; if it returns True, the tested folder URI has already been processed, as can be seen in the full robot code at the bottom of this page.

    

function markBranchCompletion (folder) {
     completionList[folder.path] = folder.parent;
     var iterable = Collection.duplicate(completionList);
     foreach (key, value in iterable) {
          if (value == folder.path) {
               Collection.remove(completionList, key);
          }    
     }
     File.write(System.toJSON(completionList, true), "resume.json");
}

function completedBefore (folderpath) {
     if (completionList == {}) { return (false); }
     foreach (key, value in completionList) {
          if (key == folderpath) { return (true); }   
     }
     return (false);
}

  

How to go from here?

At this point, the basic crawling functionality is in place. The requirements of your project will dictate what additional actions are performed upon encountering a file or a folder, and what (if any) further analysis is needed upon completing the crawl. Ideas for expansion include:

- Store results in a DB, such as Mongo using the UDM

- Store extended file information, such as mimetype or MD5 hash of a binary

- Crawl multiple roots with one robot

- Run analysis scripts on results

Robot code:  

use System, File, Collection, String;

// SET THIS TO FALSE IF YOU WANT TO RESTART CRAWLING FROM THE BEGINNING
var resume = true;

// global vars
var completionList = {}; 
if (File.exists("resume.json") && resume) {
     completionList = System.parseJSON(File.getText("resume.json"));
}

// fileshare root to crawl
var root = "C:\\Program Files";

// enter main loop
System.print("=== start crawling ===\nresume status: "::resume);
processFolders({"path": root, "parent" : "_ROOT_RECORD_"});
System.print("=== finished crawling ===");

///////////////////
///  functions  ///
///////////////////

function processFolders (folder) {
     if (!completedBefore(folder.path)) {
          if (File.canRead(folder.path)) {
               foreach (subfolder in File.iterateFolders(folder.path, false)) {
                    processFolders (subfolder);
               }
               map <processFiles> (File.iterateFiles(folder.path, false));
          }
          // do something with the info of this folder object
          System.print("FOLDER OBJECT FOUND: "::folder.path);
     }
     // mark this folder as completed
     if (resume) { markBranchCompletion(folder); }
}

function processFiles (document) {
     // do something with the info of this document object
     System.print("   DOCUMENT OBJECT FOUND: "::document, "debug");
}

function markBranchCompletion (folder) {
     completionList[folder.path] = folder.parent;
     var iterable = Collection.duplicate(completionList);
     foreach (key, value in iterable) {
          if (value == folder.path) {
               Collection.remove(completionList, key);
          }    
     }
     File.write(System.toJSON(completionList, true), "resume.json");
}

function completedBefore (folderpath) {
     if (completionList == {}) { return (false); }
     foreach (key, value in completionList) {
          if (key == folderpath) { return (true); }   
     }
     return (false);
}