MetaLad development history and backward compatibility

Functionality related to metadata has been a part of the DataLad ecosystem from the very start. However, it underwent several evolutions, and this extension is the most recent state of it. If you have been an early adopter of the metadata functionalities of DataLad or MetaLad, this section provides an overview of past systems and notable changes for you to assess upgrades and backward-compatibility to legacy metadata.

First-generation metadata

The first generation of metadata commands was implemented in the main datalad Python package, but barely saw the light of day. Very early users of DataLad might have caught a glimpse of it.

In the 1st-gen metadata implementation, metadata of a dataset had two levels. The first one contained the metadata about the actual content of a dataset (generated by DataLad or other processes), the second one was metadata about the dataset itself (generated by DataLad). The metadata was represented in RDF.

Second-generation metadata

The second generation of metadata commands came to life when the main datalad package was a few years old already. It brought the concept of dedicated _extractors_, including the legacy extractors that are supported to this day. It also provided a range of dedicated metadata subcommands of a datalad metadata command such as aggregate and extract, as well as a dedicated datalad search command. Extracted metadata was stored in a dataset in (compressed) files using a JSON stream format, separately for metadata describing a dataset as a whole, and metadata describing individual files in a dataset.

The 2nd-gen metadata implementation was moved into the datalad-deprecated extension in 2022.

Third-generation metadata

The third generation of metadata commands was developed as the datalad-extension MetaLad. Initially, until version 0.2.1, it was the continuation of developing 2nd generation metadata functionality. Afterwards, beginning with 0.3x series, the metadata model and command set was once more revised into the current state 3rd-gen metadata implementation. This implementation came with an entirely new metadata model.

Gen 2 versus gen 3 metadata

This paragraph is important if you have used datalad-metalad prior to the 0.3.0 release.

Overview of changes

The new system in 0.3.0 is quite different from the previous release in a few ways:

  1. Leaner commands with unix-style behavior, i.e. one command for one operation, and commands are chainable (use results from one command as input for another command, e.g. meta-extract|meta-add).
  2. MetadataRecord modifications does not alter the state of the datalad dataset. In previous releases, changes to metadata have altered the version (commit-hash) of the repository although the primary data did not change. This is not the case in the new system. The new system does provide information about the primary data version, i.e. commit-hash, from which the individual metadata elements were created.
  3. The ability to support a wide range of metadata storage backends in the future (this is facilitated by the [datalad-metadata-model](https://github.com/datalad/metadata-model)) which is developed alongside metalad), which separates the logical metadata model used in metalad from the storage backends, by abstracting the storage backend), Currently git-repository storage is supported.
  4. The ability to transport metadata independently of the data in the dataset. The new system introduces the concept of a metadata-store which is usually the git-repository of the datalad dataset that is described by the metadata. But this is not a mandatory configuration, metadata can be stored in almost any git-repository.
  5. The ability to report a subset of metadata from a remote metadata store without downloading the complete remote metadata. In fact only the minimal necessary information is transported from the remote metadata store. This ability is available to all metadata-based operations, for example, also to filtering.
  6. A new simplified extractor model that distinguishes between two extractor-types: dataset-level extractors and file-extractors. The former are executed with a view on a dataset, the latter are executed with specific information about a single file-path in the dataset. The previous extractors (datalad, and datalad-metalad<=0.2.1) are still supported.
  7. A built-in pipeline mechanism that allows parallel execution of metadata operations like metadata extraction, and metadata filtering. (Still in early stage)
  8. A new set of commands that allow operations that map metadata to metadata. Those operations are called filtering and are implemented by MetadataFilter-classes. Filter are dynamically loaded and custom filter are supports, much like extractors. (Still in early stage)

Backward-compatibility

Certain versions of MetaLad metadata are temporarily incompatible.

Note

Incompability of 0.3.0 and 0.2.x

Please note that the metadata storage format introduced in release 0.3.0 is incompatible with the metadata storage format in previous versions, i.e. 0.2.x, and those in datalad-deprecated. Both storage formats can coexist in storage, but version 0.3.0 of MetaLad will not be able to read metadata that was stored by the previous version and vice versa. Eventually there will be an importer that will pull old-version metadata into the new metadata storage.