DataLad has built-in, modular, and extensible support for metadata in various formats. Metadata is extracted from a dataset and its content by one or more extractors that have to be enabled in a dataset’s configuration. Extractors yield metadata in a JSON-LD-like structure that can be arbitrarily complex and deeply nested. Metadata from each extractor is kept unmodified, unmangled, and separate from metadata of other extractors. This design enables tailored applications using particular metadata that can use Datalad as a content-agnostic aggregation and transport layer without being limited or impacted by other metadata sources and schemas.
Extracted metadata is stored in a dataset in (compressed) files using a JSON stream format, separately for metadata describing a dataset as a whole, and metadata describing individual files in a dataset. This limits the amount of metadata that has to be obtained and processed for applications that do not require all available metadata.
DataLad provides a content-agnostic metadata aggregation mechanism that stores metadata of sub-datasets (with arbitrary nesting levels) in a superdataset, where it can then be queried without having the subdatasets locally present.
Lastly, DataLad comes with a search command that enable metadata queries via a flexible query language. However, alternative applications for metadata queries (e.g. graph-based queries) can be built on DataLad, by requesting a complete or partial dump of aggregated metadata available in a dataset.
Supported metadata formats¶
This following sections provide an overview of supported metadata formats.
This is a custom metadata format, inspired by the standard used for Debian
software packages that is particularly suited for manual entry. This format is
a good choice when metadata describing a dataset as a whole cannot be obtained
from some other structured format. The syntax is RFC 822-compliant. In other
words: this is a text-based format that uses the syntax of email headers.
Metadata must be placed in
DATASETROOT/.datalad/meta.rfc822 for this format.
Here is an example:
Name: myamazingdataset Version: 1.0.0-rc3 Description: Basic summary A text with arbitrary length and content that can span multiple . paragraphs (this is a new one) License: CC0 The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. . You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Homepage: http://example.com Funding: Grandma's and Grandpa's support Issue-Tracker: https://github.com/datalad/datalad/issues Cite-As: Mike Author (2016). We made it. The breakthrough journal of unlikely events. 1, 23-453. DOI: 10.0000/nothere.48421
The following fields are supported:
- A description of the target audience of the dataset.
- A comma-delimited list of authors of the dataset, preferably in the format.
Firstname Lastname <Email Adress>
- Instructions on how to cite the dataset, or a structured citation.
- Description of the dataset as a whole. The first line should represent a compact short description with no more than 6-8 words.
- A digital object identifier for the dataset.
- Information on potential funding for the creation of the dataset and/or its content. This field can also be used to acknowledge non-monetary support.
- A URL to a project website for the dataset.
- A URL to an issue tracker where known problems are documented and/or new reports can be submitted.
- Can be used in addition and analog to
Author, when authors (creators of the data) need to be distinguished from maintainers of the dataset.
- A short name for the dataset. It may be beneficial to avoid special characters, umlauts, spaces, etc. to enable widespread use of this name for URL, catalog keys, etc. in unmodified form.
- A version for the dataset. This should be in a format that is alphanumerically sortable and lead to a “greater” version for an update of a dataset.
Brain Imaging Data Structure (BIDS)¶
DataLad has basic support for extraction of metadata from the BIDS