datalad.api.meta_aggregate(path=None, dataset=None, recursive=False, recursion_limit=None, into='top', force=None)

Aggregate metadata of one or more (sub)datasets for later reporting.

Metadata aggregation refers to a procedure that extracts metadata present in a dataset into a portable representation that is stored in a standardized (internal) format. Moreover, metadata aggregation can also extract metadata in this format from one dataset and store it in another (super)dataset. Based on such collections of aggregated metadata it is then possible to discover particular (sub)datasets and individual files in them, without having to obtain the actual dataset repositories first (see the DataLad ‘meta-report’ command).

To enable aggregation of metadata that are contained in files of a dataset, one has to enable one or more metadata extractor for a dataset. DataLad supports a number of common metadata standards, such as the Exchangeable Image File Format (EXIF), Adobe’s Extensible Metadata Platform (XMP), and various audio file metadata systems like ID3. DataLad extension packages can provide metadata data extractors for additional metadata sources. The list of metadata extractors available to a particular DataLad installation is reported by the ‘wtf’ command (‘datalad wtf’).

Enabling a metadata extractor for a dataset is done by adding its name to the ‘datalad.metadata.nativetype’ configuration variable in the dataset’s configuration file (.datalad/config), e.g.:

[datalad "metadata"]
  nativetype = exif
  nativetype = xmp

If an enabled metadata extractor is not available in a particular DataLad installation, metadata extraction will not succeed in order to avoid inconsistent aggregation results.

Enabling multiple extractors is supported. In this case, metadata are extracted by each extractor individually, and stored alongside each other. Metadata aggregation will also extract DataLad’s internal metadata (‘metalad_core’), and git-annex file metadata (‘metalad_annex’).

Metadata aggregation can be performed recursively, in order to aggregate all metadata from all subdatasets. By default, re-aggregation of metadata inspects modifications of datasets and metadata extractor parameterization with respect to the last aggregated state. For performance reasons, re-aggregation will be automatically skipped, if no relevant change is detected. This default behavior can be altered via the --force argument.

Depending on the versatility of the present metadata and the number of dataset or files, aggregated metadata can grow prohibitively large or take a long time to process. See the documentation of the extract-metadata command for a number of configuration settings that can be used to tailor this process on a per-dataset basis.

  • path (sequence of str or None, optional) – path to (sub)datasets whose metadata shall be aggregated. When a given path is pointing into a dataset (instead of to its root), the metadata of the containing dataset will be aggregated. [Default: None]
  • dataset (Dataset or None, optional) – topmost dataset metadata will be aggregated into. If no dataset is specified, a datasets will be discovered based on the current working directory. [Default: None]
  • recursive (bool, optional) – if set, recurse into potential subdataset. [Default: False]
  • recursion_limit (int or None, optional) – limit recursion into subdataset to the given number of levels. [Default: None]
  • into ({'top', 'all'}, optional) – which datasets shall receive the aggregated metadata: all datasets from any leaf dataset to the top-level target dataset including all intermediate datasets (all), or just the top-level dataset (top). [Default: ‘top’]
  • force ({'extraction', 'fromscratch', 'ignoreextractorchange', None}, optional) – Disable specific optimizations: ‘extraction’ overrides change detection and engages all enabled extractors regardless of whether an actual change in a dataset’s state is detected with respect to any existing metadata aggregate; ‘fromscratch’ wipes out any existing metadata aggregates first, including aggregates for unavailable datasets (implies ‘extraction’). ‘ignoreextractorchange’ disables comparison of current and recorded extractor parametrization and avoids re-extraction due to extractor changes alone. [Default: None]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]