Usage
DataLad Catalog can be used from the command line or with its Python API. You can access detailed usage information in the Command Line Reference or the Python Module Reference respectively.
The overall catalog generation process actually starts several steps before the
involvement of datalad-catalog
. Typical steps include:
curating data into datasets (a group of files in an hierarchical tree)
adding metadata to datasets and files (the process for this and the resulting metadata formats and content vary widely depending on domain, file types, data availability, and more)
extracting the metadata using an automated tool to output metadata items into a standardized and queryable set
translating the metadata into the catalog schema
These steps can follow any arbitrarily specified procedures and can use any
arbitrarily specified tools to get the job done. Once they are completed, the
datalad-catalog
tool can be used for catalog generation and customization.
Create a catalog
To create a new catalog, start by running datalad catalog-create
datalad catalog-create --catalog /tmp/my-cat
This will create a catalog with the following structure:
artwork
: images that make the catalog prettyassets
: mainly the JavaScript and CSS code that underlie the user interface of the catalogmetadata
: where metadata content for any datasets and files rendered by the catalog will be containedschema
: which contains JSON documents laying out the schema that the specific catalog complies withtemplates
: HTML template documents for rendering specific componentsconfig.json
: the configuration file with rules for rendering and updating the catalogindex.html
: the main HTML content rendered in the browser
Note
The --config-file
argument allows the catalog to be created with a custom
configuration. If not specified, a default configuration is applied at the catalog
level.
Add metadata
To add metadata to an existing catalog, run datalad catalog-add
, specifying
the (location of the) metadata to be added. DataLad Catalog accepts metadata in the
form of:
a path to a file containing JSON lines
JSON lines from STDIN
a JSON serialized string
where each line or record is a single, correctly formatted, JSON object. The correct format for the added metadata is specified by the Catalog Schema.
datalad catalog-add --catalog /tmp/my-cat --metadata path/to/metadata.jsonl
The metadata
directory is now populated.
Note
The --config-file
argument allows the specific dataset in the catalog
to be created with a custom configuration. If not specified, the configuration
at the dataset level will be inferred from the catalog level.
Metadata validation
To check if metadata is valid before adding it to a catalog, datalad catalog-validate
can be run to check if the metadata conforms to the Catalog Schema.
datalad catalog-validate --catalog /tmp/my-cat --metadata path/to/metadata.jsonl
The metadata will then be validated against the schema version of the supplied
catalog. If the --catalog
argument is not provided, validation happens against
the schema version contained in the installed datalad-catalog
package.
Note
The validator runs internally whenever datalad catalog-add
is called,
so there is no need to run validation explicitly unless desired.
Set catalog properties
Properties of the catalog can be set via the datalad catalog-set
command. For
example, setting a "main" dataset is necessary in order to indicate which dataset
will be shown on the catalog homepage. To set this homepage, run
datalad catalog-set home
, specifying the dataset_id
and dataset_version
:
datalad catalog-set --catalog /tmp/my-cat --dataset_id abcd --dataset_version 1234 home
Note
Tip
It could be a good idea to populate the catalog with datasets that are all linked as subdatasets from the main dataset displayed on the home page, since this would allow users to navigate to all other datasets from the main page. This linkage is done implicitly if the catalog home page is a DataLad superdataset with nested subdatasets.
View the catalog
To serve the content of a catalog via a local HTTP server for viewing or
testing, run datalad catalog-serve
.
datalad catalog-serve --catalog /tmp/my-cat
Once the content is served, the catalog can be viewed by visiting the localhost URL.
Update
Catalog content can be updated using the add
or remove
commands. To add
content, simply re-run datalad catalog-add
, providing the path to the new
metadata.
datalad catalog-add --catalog /tmp/my-cat --metadata path/to/new/metadata.jsonl
If a newly added dataset or version of a dataset was added incorrectly,
datalad catalog-remove
can be used to get rid of the incorrect addition.
datalad catalog-remove --dataset_id abcd --dataset_version 1234 --reckless
Note
A standard catalog-remove
call without the --reckless
flag will provide
a warning and do nothing else, for safety. Remember to add the flag in order
to remove the metadata.
Configure
A useful feature of the catalog process is to be able to configure certain
properties according to your preferences. This is done with help of a config
file (in either JSON
or YAML
format) and the -F/--config-file
flag.
A config file can be passed during catalog creation in order to set the config
on the catalog level:
datalad catalog-create --catalog /tmp/my-custom-cat --config-file path/to/custom_config.json
A config file can also be passed when adding metadata in order to set the config on the dataset-level:
datalad catalog-add --catalog /tmp/my-custom-cat --metadata path/to/metadata.jsonl --config-file path/to/custom_dataset_config.json
In the latter case, the config will be set for all new dataset entries corresponding
to metadata source objects in the metadata provided to the catalog-add
operation.
If no config file is specified on the catalog level, a default config file is used.
The catalog-level config also serves as the default config on the dataset level,
which is used if no config file is specified via the catalog-add
command.
Note
For detailed information on how to structure and use config files, please refer to the dedicated documentation in Catalog Configuration.
Get catalog properties
Properties of the catalog can be retrieved via the datalad catalog-get
command. For
example, the specifics of the catalog home page can be retrieved as follows:
datalad catalog-get --catalog /tmp/my-cat home
Or the metadata of a specific dataset contained in the catalog can be retrieved as follows:
datalad catalog-get --catalog /tmp/my-cat --dataset_id abcd --dataset_version 1234 metadata
Translate
datalad-catalog
can translate a metadata item originating from a particular
source structure, and extracted using datalad-metalad
, into the catalog schema.
Before translation from a specific source will work, an extractor-specific translator
should be provided and exposed as an entry point (via a DataLad extension) as part of the
datalad.metadata.translators
group. Then, translate metadata as follows:
datalad catalog-translate --metadata path/to/extracted/metadata.jsonl
This command will output the translated objects as JSON lines to stdout
, which can
be saved to disk and later used, for example, for catalog entry generation.
Workflows
Several subprocesses need to be run in order to create a new catalog with multiple entries, or in order to update an existing catalog with new entries. These processes can include:
tracking datasets that are intended to be entries in a catalog as subdatasets of a DataLad super-dataset
extracting (and temporarily storing) metadata from the super- and subdatasets
translating extracted metadata (and temporarily storing it)
creating a catalog
adding translated metadata to the catalog
updating the catalog's superdataset (i.e. homepage) if the DataLad superdataset version changed
It is evident that these steps can become quite cumbersome and even resource intensive if run
at scale. Therefore, in order to streamline these processes, to automate them as much as possible,
and to shift the effort away from the user, datalad-catalog
can run workflows for catalog
generation and updates. It builds on top of the following functionality:
DataLad datasets and nesting for maintaining a super-/subdataset hierarchy.
datalad-metalad
's metadata extraction functionalitydatalad-catalog
's metadata translation functionalitydatalad-catalog
for maintaining a catalog
workflow-new
To run a workflow from scratch on a dataset and all of its subdatasets:
datalad catalog-workflow --type new --catalog /tmp/my-cat --dataset path/to/superdataset --extractor metalad_core
This workflow will:
Clone the super-dataset and all its first-level subdatasets
Create the catalog if it does not yet exists
Run dataset-level metadata extraction on the super- and subdatasets
Translate all extracted metadata to the catalog schema
Add the translated metadata as entries to the catalog
Set the catalog's home page to the id and version of the DataLad super-dataset.
workflow-update
To run a workflow for updating an existing catalog after registering a new subdataset to the superdataset which the catalog represents:
datalad catalog-workflow --type update --catalog /tmp/my-cat --dataset path/to/superdataset --subdataset path/to/subdataset --extractor metalad_core
This workflow assumes:
The subdataset has already been added as a submodule to the parent dataset
The parent dataset already contains the subdataset commit
This workflow will:
Clone the super-dataset and new subdataset
Run dataset-level metadata extraction on the super-dataset and new subdataset
Translate all extracted metadata to the catalog schema
Add the translated metadata as entries to the catalog
Reset the catalog's home page to the latest id and version of the DataLad super-dataset.