DataLad extension for semantic metadata handling

This software is a DataLad extension that equips DataLad with an alternative command suite for metadata handling (extraction, aggregation, reporting). It is backward-compatible with the metadata storage format in DataLad proper, while being substantially more performant (especially on large dataset hierarchies). Additionally, it provides new metadata extractors and improved variants of DataLad’s own ones that are tuned for better performance and richer, JSON-LD compliant metadata reports.


High-level API commands

These commands provide and improved and extended equivalent to the metadata and aggregate_metadata commands (and the primitive extract-metadata plugin) that ship with the DataLad core package.

meta_extract(extractorname, path, dataset, …) Run a metadata extractor on a dataset or file.
meta_aggregate([dataset, path]) Aggregate metadata of one or more sub-datasets for later reporting.
meta_dump([dataset, path, recursive]) Dump a dataset’s aggregated metadata for dataset and file metadata

Metadata extractors

To use any of the contained extractors their names needs to be prefixed with metalad_, such that the runprov extractor is effectively named metalad_runprov.

core Metadata extractor for Datalad’s own core storage
annex Metadata extractor for Git-annex metadata
custom Metadata extractor for custom (JSON-LD) metadata contained in a dataset
runprov Metadata extractor for provenance information in DataLad’s run records


DataLad development is being performed as part of a US-German collaboration in computational neuroscience (CRCNS) project “DataGit: converging catalogues, warehouses, and deployment logistics into a federated ‘data distribution’” (Halchenko/Hanke), co-funded by the US National Science Foundation (NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411). Additional support is provided by the German federal state of Saxony-Anhalt and the European Regional Development Fund (ERDF), Project: Center for Behavioral Brain Sciences, Imaging Platform

DataLad is built atop the git-annex software that is being developed and maintained by Joey Hess.

Indices and tables