DataLad has built-in, modular, and extensible support for metadata in various formats. Metadata is extracted from a dataset and its content by one or more extractors that have to be enabled in a dataset’s configuration. Extractors yield metadata in a JSON-LD-like structure that can be arbitrarily complex and deeply nested. Metadata from each extractor is kept unmodified, unmangled, and separate from metadata of other extractors. This design enables tailored applications using particular metadata that can use Datalad as a content-agnostic aggregation and transport layer without being limited or impacted by other metadata sources and schemas.
Extracted metadata is stored in a dataset in (compressed) files using a JSON stream format, separately for metadata describing a dataset as a whole, and metadata describing individual files in a dataset. This limits the amount of metadata that has to be obtained and processed for applications that do not require all available metadata.
DataLad provides a content-agnostic metadata aggregation mechanism that stores metadata of sub-datasets (with arbitrary nesting levels) in a superdataset, where it can then be queried without having the subdatasets locally present.
Lastly, DataLad comes with a search command that enable metadata queries via a flexible query language. However, alternative applications for metadata queries (e.g. graph-based queries) can be built on DataLad, by requesting a complete or partial dump of aggregated metadata available in a dataset.
Supported metadata sources¶
This following sections provide an overview of included metadata extractors for
particular types of data structures and file formats.
Only annex and datalad_core
extractors are enabled by default. Any additional metadata extractor should be
enabled by setting the datalad.metadata.nativetype configuration variable
git config command or by editing
git config -f .datalad/config --add datalad.metadata.nativetype audio
would add audio metadata extractor to the list.
Annex metadata (
Content tracked by git-annex can have associated metadata records. From DataLad’s perspective, git-annex metadata is just another source of metadata that can be extracted and aggregated.
You can use the git-annex metadata command to assign git-annex metadata. And, if you have a table or records that contain data sources and metadata, you can use datalad addurls to quickly populate a dataset with files and associated git-annex metadata. (///labs/openneurolab/metasearch is an example of such a dataset.)
Pros of git-annex level metadata¶
- Many git-annex commands, such as git-annex get and git-annex copy, can use metadata to decide which files (keys) to operate on, making it possible to automate file (re)distribution based on their metadata annotation
- Assigned metadata is available for use by git-annex right away without requiring any additional “aggregation” step
- git-annex view can be used to quickly generate completely new layouts of the repository solely based on the metadata fields associated with the files
Cons of git-annex level metadata¶
- Metadata fields are actually stored per git-annex key rather than per file. If multiple files contain the same content, metadata will be shared among them.
- Files whose content is tracked directly by git cannot have git-annex metadata assigned.
- No per repository/directory metadata, and no mechanism to use/aggregate metadata from sub-datasets
- Field names cannot contain some symbols, such as ‘:’
- Metadata is stored within the git-annex branch, so it is distributed across all clones of the dataset, making it hard to scale for large metadata sizes or to work with sensitive metadata (not intended to be redistributed)
- It is a generic storage with no prescribed vocabularly, making it very flexible but also requiring consistency and harmonization to make the stored metadata useful for search
Example uses of git-annex metadata¶
Annotating files for different purposes¶
FreeSurfer project uses git-annex for managing their source code+data base within a single git/git-annex repository. Files necessary for different scenarios (deployment, testing) are annotated and can be fetched selectively for the scenario at hand.
Automating “non-distribution” of sensitive files¶
In the ReproIn framework for automated conversion of BIDS dataset and in some manually prepared datasets (such as ///labs/gobbini/famface/data and ///labs/haxby/raiders), we annotated materials that must not be publicly shared with a git-annex metadata field distribution-restrictions. We used the following of values to describe why any particular file (content) should not be redistributed:
- sensitive - files which potentially contain participant sensitive information, such as non-defaced anatomicals
- proprietary - files which contain proprietary data, which we have no permissions to share (e.g., movie video files)
Having annotated files this way, we could instruct git-annex to publish all but those restricted files to our server: git annex wanted datalad-public “not metadata=distribution-restrictions=*”.
Flexible directory layout¶
If you are maintaining a collection of music files or PDFs for the lab, you may want to display the files in an alternative or filtered hierarchy. git-annex view could be of help. Example:
datalad install ///labs/openneurolab/metasearch cd metasearch git annex view sex=* handedness=ambidextrous
would give you two directories (Male, Female) with only the files belonging to ambidextrous subjects.
Various audio file formats (
This extractor uses the mutagen package to extract essential metadata from a range of audio file formats. For the most common metadata properties a constrained vocabulary, based on the Music Ontology is employed.
datacite.org compliant datasets (
This extractor can handle dataset-level metadata following the datacite.org specification. No constrained vocabulary is identified at the moment.
Datalad’s internal metadata storage (
This extractor can express Datalad’s internal metadata representation, such as the relationship of a super- and a subdataset. It uses DataLad’s own constrained vocabulary.
RFC822-compliant metadata (
This is a custom metadata format, inspired by the standard used for Debian
software packages that is particularly suited for manual entry. This format is
a good choice when metadata describing a dataset as a whole cannot be obtained
from some other structured format. The syntax is RFC 822-compliant. In other
words: this is a text-based format that uses the syntax of email headers.
Metadata must be placed in
DATASETROOT/.datalad/meta.rfc822 for this format.
Here is an example:
Name: myamazingdataset Version: 1.0.0-rc3 Description: Basic summary A text with arbitrary length and content that can span multiple . paragraphs (this is a new one) License: CC0 The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. . You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Homepage: http://example.com Funding: Grandma's and Grandpa's support Issue-Tracker: https://github.com/datalad/datalad/issues Cite-As: Mike Author (2016). We made it. The breakthrough journal of unlikely events. 1, 23-453. DOI: 10.0000/nothere.48421
The following fields are supported:
- A description of the target audience of the dataset.
- A comma-delimited list of authors of the dataset, preferably in the format.
Firstname Lastname <Email Adress>
- Instructions on how to cite the dataset, or a structured citation.
- Description of the dataset as a whole. The first line should represent a compact short description with no more than 6-8 words.
- A digital object identifier for the dataset.
- Information on potential funding for the creation of the dataset and/or its content. This field can also be used to acknowledge non-monetary support.
- A URL to a project website for the dataset.
- A URL to an issue tracker where known problems are documented and/or new reports can be submitted.
- Can be used in addition and analog to
Author, when authors (creators of the data) need to be distinguished from maintainers of the dataset.
- A short name for the dataset. It may be beneficial to avoid special characters, umlauts, spaces, etc. to enable widespread use of this name for URL, catalog keys, etc. in unmodified form.
- A version for the dataset. This should be in a format that is alphanumerically sortable and lead to a “greater” version for an update of a dataset.
Metadata keys used by this extractor are defined in DataLad’s own constrained vocabulary.
Friction-less data packages (
DataLad has basic support for extraction of essential dataset-level metadata
from friction-less data packages (
file. Metadata keys are constrained to DataLad’s own vocabulary.
Exchangeable Image File Format (
The extractor yields EXIF metadata from any compatible file. It uses the W3C EXIF vocabulary (http://www.w3.org/2003/12/exif/ns/).
Various image/photo formats (
Standard image metadata is extractor using the Pillow package. Core metadata is available using an adhoc vocabulary defined by the extractor.
Extensible Metadata Platform (
This extractor yields any XMP-compliant metadata from any supported file (e.g. PDFs, photos). XMP metadata uses fully qualified terms from standard vocabularies that are simply passed through by the extractor. At the moment metadata extraction from side-car files is not supported, but would be easy to add.
Metadata aggregation and query¶
Metadata aggregation can be performed with the aggregate-metadata command. Aggregation is done for two interrelated but distinct reasons:
- Fast uniform metadata access, independent of local data availability
- Comprehensive data discovery without access to or knowledge of individual datasets
In an individual dataset, metadata aggregation engages any number of enabled metadata extractors to build a JSON-LD based metadata representation that is separate from the original data files. These metadata objects are added to the dataset and are tracked with the same mechanisms that are used for any other dataset content. Based on this metadata, DataLad can provide fast and uniform access to metadata for any dataset component (individual files, subdatasets, the whole dataset itself), based on the relative path of a component within a dataset (available via the metadata command). This extracted metadata can be kept or made available locally for any such query, even when it is impossible or undesirable to keep the associated data files around (e.g. due to size constraints).
For any superdataset (a dataset that contains subdatasets as components), aggregation can go one step further. In this case, aggregation imports extracted metadata from subdatasets into the superdataset to offer the just described query feature for any aggregated subdataset too. This works across any number of levels of nesting. For example, a subdataset that contains the aggregated metadata for eight other datasets (that might have never been available locally) can be aggregated into a local superdataset with all its metadata. In that superdataset, a DataLad user is then able to query information on any content of any subdataset, regardless of their actual availability. This principle also allows any user to install the superdataset from http://datasets.datalad.org and perform local and offline queries about any dataset available online from this server.
Besides full access to all aggregated metadata by path (via the metadata command), DataLad also comes with a search command that provides different search modes to query the entirety of the locally available metadata. Its capabilities include simple keyword searches as well as more complex queries using date ranges or logical conjunctions.