Metadata and datalad-catalog

The catalog is rendered from structured metadata generated by datalad-catalog. In this section, more information is provided about the nature of metadata (in general and in relation to the catalog) and the states that metadata generally pass through in order to end up as part of a catalog.

What is metadata?

Metadata describe the files in your dataset, as well as its overall content. Implicit metadata include basic descriptions of the data itself (such as the names, types, sizes, and relative locations of all files in your dataset), while explicit metadata items (such as a description of your dataset, its contributors and project specifications) can be added to your dataset as you see fit. MetaLad provides functionality for adding metadata items of arbitrary size, format, and amount, and does not impose restrictions in this regard.

Many standards exist for specifying and structuring metadata. Some examples include:

  • DataCite: The DataCite Metadata Schema is a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes, along with recommended use instructions.

  • XMP: The Extensible Metadata Platform is an ISO standard for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets. It also provides guidelines for embedding XMP information into popular image, video and document file formats, such as JPEG and PDF.

  • Frictionless Data: The Frictionless Data Package is a container format for describing a coherent collection of data in a single 'package', providing the basis for convenient delivery, installation and management of datasets.

Standardized file formats may also contain format-specific information (such as bit rate and duration for audio files, or resolution and color mode for image files), while domain- standard files (such as Digital Imaging and Communications in Medicine, i.e. DICOM) also supply embedded or sidecar metadata.

Note

In order to create a user-friendly catalog, DataLad Catalog should receive structured metadata adhering to a specified Catalog Schema as input. This means that structured metadata first has to be sourced and then translated into the schema.

Metadata handling with MetaLad

Since datalad-catalog provides its own schema in a standard vocabulary, any metadata that conforms to this schema can be submitted to the tool in order to generate a catalog and its entries. Metadata items do not necessarily have to be derived from DataLad datasets, and the metadata extraction does not have to be conducted via datalad-metalad. However, datalad-metalad provides highly applicable functionality that simplifies the process of metadata handling for the purpose of generating structured outputs that could be used for catalog generation.

datalad-metalad has functionality to:

  • add metadata of an arbitrary format to a DataLad dataset

  • dump metadata that was previously added to a DataLad dataset

  • extract metadata from files or datasets using format-specific extractors

as well as to run batch jobs with these and other methods. Find out more about MetaLad and its capabilities in the dedicated DataLad Handbook Chapter.

The benefit of using datalad-metalad in a catalog-generation workflow comes with the use of its extraction interface and custom extractors. An extractor is nothing other than something that understands a specific schema (or data structure) and can extract information from a file or dataset that adheres to said schema. For example, the metalad_core extractor that ships with datalad-metalad can extracting implicit metadata, such as author information, dataset identifier/version, bytesize (for files), and more, from a DataLad dataset and its files. And the metalad_studyminimeta extractor extracts information from DataLad dataset containing a .studyminimeta.yaml file in its root directory. MetaLad ships with a variety of dataset- and file-level extractors, and so does a number of DataLad extensions including datalad-catalog. If an extractor for a specific metadata format is not available a custom extractor can be created and provided via a DataLad extension. If this sounds like something you need, please refer to the documentation on writing your own extractor.

An extractor will output its metadata, which has a structure specified by the dedicated extractor, inside a wrapper object provided by datalad-metalad. This means the top-level structure of all metadata extracted by datalad-metalad will be the same, while that of the property containing the actual extracted metadata will differ based on the extractor.

Metadata translation

As mentioned above, datalad-catalog provides its own schema in a standard vocabulary, and incoming metadata need to validate successfully against this schema. Since extracted metadata will typically be structured according to whatever schema was specified by the extractor, and information in such a schema will first have to be translated to the catalog schema before catalog entry generation can continue.

datalad-catalog provides a catalog-translate command through which custom translators can be created and used to translate MetaLad-extracted metadata into the catalog schema. The catalog ships with several translators (including ones for metalad_core and metalad_studyminimeta) and provides a base class that makes it straightforward to implement custom translators. Before translation from a specific source will work, the extractor-specific translator should be provided and exposed as an entry point (via a DataLad extension) as part of the datalad.metadata.translators group.

Then datalad-catalog will be able to find the correct translator automatically based on unique properties in a MetaLad-extracted metadata object. This is done by applying matching criteria that is specified by the translator, and running a translate() method if the match was successful.

The Catalog Schema

Finally, the result of the metadata extraction and translation workflow will be metadata that conforms to the catalog's own schema, which uses the vocabulary defined by JSON Schema (specifically draft 2020-12). Find out more about the Catalog Schema.