datalad.api.ebrains_clone

datalad.api.ebrains_clone(source, path=None, *, dataset=None, depth=None)

Export a dataset from the EBRAINS Knowledge Graph as a DataLad dataset

This command performs a series of queries against the EBRAINS Knowledge Graph (KG) to retrieve essential metadata on (versions of) an EBRAINS dataset. These metadata are used to build a Git-based representation of the dataset's evolution. This includes:

  • Any known DatasetVersion represented as a Git commit, i.e., a DataLad dataset version. The release date of the particular KG dataset version is recorded as the respective commit's date.

  • Any file that is part of a DatasetVersion registered as an annexed file in the respective DataLad dataset version, matching the directory/file name in the corresponding EBRAINS FileRepository, with a URL suitable for file retrieval, and a checksum suitable for git-annex based content verification recorded for each file.

  • Each dataset version is (Git) tagged with the respective VersionIdentifier recorded in the EBRAINS KG.

  • Each dataset version carries the VersionInnovation recorded in the EBRAINS KG as its commit message.

Authentication

This command requires authentication with an EBRAINS user account. An access token has to be obtained and provided via the KG_AUTH_TOKEN environment variable. Please see the ebrain-authenticate command for instructions on obtaining an access token.

Performance notes

For each considered DatasetVersion two principal queries are performed. The version query obtains essential metadata on that DatasetVersion, the second query retrieve a list of files registered for that DatasetVersion. The later query is slow, and takes ~30s per DatasetVersion, regardless of the actual number of files registered. Consequently, cloning a dataset with any significant number of versions in the KG will take a considerable amount of time. This issue is known and tracked at https://github.com/HumanBrainProject/fairgraph/issues/57

Metadata validity

Metadata is always taken "as-is" from the EBRAINS KG. This can lead to unexpected results, in case metadata is are faulty. For example, it may happen that a newer dataset version has an assigned commit data that is older than its preceeding version.

Moreover, the EBRAINS KG does not provide all essential metadata required for annotating a Git commit. For example, the agent identity associated with a DatasetVersion release is not available. This command unconditionally uses DataLad-EBRAINS exporter <ebrains@datalad.org> as author and committer identity for this reason.

Reproducible dataset generation

Because no metadata modifications are performed and no local identity information is considered for generating a DataLad dataset, dataset cloning will yield reproducible results. In other words, running equivalent ebrains-clone commands, on different machines, at different times, by different users will yield the exact same DataLad datasets -- unless the metadata retrieved from the EBRAINS KG changes. Such changes can happen when metadata issues are corrected, or metadata available to a requesting user identity differs.

Examples

Clone the Julich-Brain Cytoarchitectonic Atlas at version 2.4 from the EBRAINS Knowledge Graph (the URL is taken directly from the EBRAINS data search web interface):

datalad ebrains-clone https://search.kg.ebrains.eu/instances/5249afa7-5e04-4ffd-8039-c3a9231f717c

Clone the latest version of the Julich-Brain Cytoarchitectonic Atlas, including all prior versions recorded in the EBRAINS Knowledge Graph. Instead of a URL, here we only query for the respective UUID, which is identical to the "version overview" available via the EBRAINS web interface at https://search.kg.ebrains.eu/instances/5a16d948-8d1c-400c-b797-8a7ad29944b2:

datalad ebrains-clone 5a16d948-8d1c-400c-b797-8a7ad29944b2

UUIDs or URL can be used interchangably as an argument. In both cases, a UUID is extracted from the given argument.

Parameters:
  • source -- URL including an ID of a dataset, or dataset version in the EBRAINS knowledge graph. (Such UUIDs can be found in the trailing part of the URL when looking at a dataset on https://search.kg.ebrains.eu). When an identifier/URL of a particular dataset version is provided all dataset versions preceeding this version are included in the generated dataset (including the identified version). If the URL/ID of a version-less dataset is given, all known versions for that dataset are included.

  • path -- path to clone into. If no path is provided the destination will be the current working directory. [Default: None]

  • dataset -- "Dataset to create. [Default: None]

  • depth -- "Create a shallow clone with a history truncated to the specified number of version recorded in the knowledge graph. [Default: None]

  • on_failure ({'ignore', 'continue', 'stop'}, optional) -- behavior to perform on failure: 'ignore' any failure is reported, but does not cause an exception; 'continue' if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; 'stop': processing will stop on first failure and an exception is raised. A failure is any result with status 'impossible' or 'error'. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: 'continue']

  • result_filter (callable or None, optional) -- if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable's return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]

  • result_renderer -- select rendering mode command results. 'tailored' enables a command- specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the 'generic' result renderer; 'generic' renders each result in one line with key info like action, status, path, and an optional message); 'json' a complete JSON line serialization of the full result record; 'json_pp' like 'json', but pretty-printed spanning multiple lines; 'disabled' turns off result rendering entirely; '<template>' reports any value(s) of any result properties in any format indicated by the template (e.g. '{path}', compare with JSON output for all key-value choices). The template syntax follows the Python "format() language". It is possible to report individual dictionary values, e.g. '{metadata[name]}'. If a 2nd-level key contains a colon, e.g. 'music:Genre', ':' must be substituted by '#' in the template, like so: '{metadata[music#Genre]}'. [Default: 'tailored']

  • result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) -- if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]

  • return_type ({'generator', 'list', 'item-or-list'}, optional) -- return value behavior switch. If 'item-or-list' a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: 'list']