Import and Export Tools

Import and export tools are included in the Cordra distribution. The export tool enables you to extract the digital objects from a given Cordra setup (single instance or a distributed system) as files into the environment from which the tool is run. The import tool enables you to ingest the output of an export tool process as digital objects into any Cordra setup. The rest of this section describes the specifics of these tools.

These tools do not use the Cordra API, but rather use the Cordra storage module to interact with the underlying storage system. As a result, they are able to copy objects in and out of Cordra in their entirety, without modifying their contents in any way.

The import and export tools can be found in the /WEB-INF/tools directory after unzipping cordra.war. To run the scripts, make sure the scripts are executable; if necessary, change file access permissions (e.g., by using a command like “chmod +x” on *nix systems).

Warning

Since these tools talk directly to storage they bypass all of the usual validation checks that Cordra makes. For example, during import, objects that do not match schemas can be inserted. Likewise, type objects or user objects with duplicate names can be inserted. Doing so could result in unexpected and/or unwanted behaviour. Cordra should be shutdown before you attempt an import or an export in order to curtail any parallel administrative activity.

When exported, files are produced that represent Cordra objects. Each file contains information from a corresponding Cordra object, and includes a JSON map of any payloads; and those payloads are encoded as base64 strings. Metadata and schema-driven information of each Cordra object is represented as JSON in those files. If wholesale changes to digital objects are required, it is easy to edit the representative files while in the export format, and subsequently import them into Cordra. However, because payloads are encoded as base64 strings, editing payloads while in the export format is not straightforward.

Once you have imported the objects, Cordra should be re-indexed to function properly. How you reindex depends on the type of indexer you are using. Deleting the existing index and restarting Cordra will trigger a reindex. For example, if you are using the default index that comes with Cordra, you can simply remove the data/cordraIndex directory. For related details, see Reindexing.

The specific commands to import and export are described next.

Export

An example to export digital objects from a local file system based Cordra:

./export-tool -c path/to/Cordra/config.json -d path/to/Cordra/data/folder/ -o path/to/folder/of/Cordra/Objects/ --tree --number-of-threads 24 --progress

The -c option is required, as config.json includes the details of the storage being used. If the storage specified in config.json uses the filesystem, the -d option is required. (It can be omitted for MongoDB and Amazon S3 storage, for example)

Either the -o option (to export to files) or -s option (to export to newline-delimited JSON on stdout) is required. If the -o option is provided, an additional --tree option may be provided to arrange exported Cordra Objects in a directory tree based on the hashes of the IDs of the Cordra Objects.

A --number-of-threads option may be used to indicate the number of threads exporting Cordra objects. The default number of threads used to export Cordra Objects is 1; export to a file system is generally I/O bound.

With the --progress option the tool reports a count of Cordra Objects exported after each export of a Cordra Object completes.

To run the import switch using a command-line interface, execute a command like the following.

An example to export from a backend system such as MongoDB or Amazon S3 (with their coordinates in config.json):

./export-tool -c config.json -o objects

Import

An example to import digital objects into a local file system based Cordra:

./import-tool -c path/to/Cordra/config.json -d path/to/Cordra/data/folder -i path/to/folder/of/Cordra/Objects/ --number-of-threads 32 --delete-design-first --delete-all-first

Either the -i option (to import from files in a directory) or the -s option (to import from newline-delimited JSON on stdin) is required. If -s is used, input must consist of JSON objects, one on each line, where each JSON object is free of newline characters and carriage return characters.

If the --delete-design-first option is used, the design object will be deleted before import.

If the --delete-all-first switch is used, all objects except the design object will be deleted before import.

An example to import digital objects into a backend system such as MongoDB or Amazon S3 (with their coordinates in config.json):

./import-tool -c config.json -i objects

Warning

As explained above, a reindexing step is necessary after an import.

Hashed directory output

With large numbers of objects, you may exceed the maximum number of files that your file system allows in a single directory. In such a case you can use the -t or --tree option.

./export-tool -c cordra/data/config.json -d cordra/data/ -o objects -t

This will hash the object ids and break that resulting hash into segments that are used to create a directory tree. The import tool does not need to have specified whether the import is from a hashed directory or not; it will work either way.

Limit the objects that are exported by id

./export-tool -c cordra/data/config.json -d cordra/data/ -o objects -i 123/abc -i 123/xyz

Here multiple -i arguments can be passed to the tool to specify which objects to export.

If you have a large number of objects you want to explicitly export, you can list their ids in a new line separated file and use the -l option.

./export-tool -c cordra/data/config.json -d cordra/data/ -o objects -l ids.txt

An additional tool called ids-by-query can be used to generate an ids file by running a query. Unlike export-tool and import-tool, it needs to access a running Cordra. This tool comes with the Cordra distribution and can be found in the bin directory.

./bin/ids-by-query -b http://localhost:8080 -u <username> -p <password> -o ids.txt -q <query>

Piping export to import

Instead of writing objects to files in a directory the export tool can instead write objects as newline-delimited JSON to standard out using the -s option. This can be piped to the import tool in a *nix environment.

./export-tool -c cordra/data/config.json -s | ./import-tool -c cordra2/data/config.json -s