datalad ebrains-clone

Synopsis

datalad ebrains-clone [-h] [-d DATASET] [--depth DEPTH] [--version] URL [PATH]

Description

Export a dataset from the EBRAINS Knowledge Graph as a DataLad dataset

This command performs a series of queries against the EBRAINS Knowledge Graph (KG) to retrieve essential metadata on (versions of) an EBRAINS dataset. These metadata are used to build a Git-based representation of the dataset's evolution. This includes:

Any known DatasetVersion represented as a Git commit, i.e., a DataLad dataset version. The release date of the particular KG dataset version is recorded as the respective commit's date.

Any file that is part of a DatasetVersion registered as an annexed file in the respective DataLad dataset version, matching the directory/file name in the corresponding EBRAINS FileRepository, with a URL suitable for file retrieval, and a checksum suitable for git-annex based content verification recorded for each file.

Each dataset version is (Git) tagged with the respective VersionIdentifier recorded in the EBRAINS KG.

Each dataset version carries the VersionInnovation recorded in the EBRAINS KG as its commit message.

Authentication

This command requires authentication with an EBRAINS user account. An access token has to be obtained and provided via the KG_AUTH_TOKEN environment variable. Please see the ebrain-authenticate command for instructions on obtaining an access token.

Performance notes

For each considered DatasetVersion two principal queries are performed. The version query obtains essential metadata on that DatasetVersion, the second query retrieve a list of files registered for that DatasetVersion. The later query is slow, and takes ~30s per DatasetVersion, regardless of the actual number of files registered. Consequently, cloning a dataset with any significant number of versions in the KG will take a considerable amount of time. This issue is known and tracked at https://github.com/HumanBrainProject/fairgraph/issues/57

Metadata validity

Metadata is always taken "as-is" from the EBRAINS KG. This can lead to unexpected results, in case metadata is are faulty. For example, it may happen that a newer dataset version has an assigned commit data that is older than its preceeding version.

Moreover, the EBRAINS KG does not provide all essential metadata required for annotating a Git commit. For example, the agent identity associated with a DatasetVersion release is not available. This command unconditionally uses DataLad-EBRAINS exporter <ebrains@datalad.org> as author and committer identity for this reason.

Reproducible dataset generation

Because no metadata modifications are performed and no local identity information is considered for generating a DataLad dataset, dataset cloning will yield reproducible results. In other words, running equivalent ebrains-clone commands, on different machines, at different times, by different users will yield the exact same DataLad datasets -- unless the metadata retrieved from the EBRAINS KG changes. Such changes can happen when metadata issues are corrected, or metadata available to a requesting user identity differs.

Examples

Clone the Julich-Brain Cytoarchitectonic Atlas at version 2.4 from the EBRAINS Knowledge Graph (the URL is taken directly from the EBRAINS data search web interface):

datalad ebrains-clone https://search.kg.ebrains.eu/instances/5249afa7-5e04-4ffd-8039-c3a9231f717c

Clone the latest version of the Julich-Brain Cytoarchitectonic Atlas, including all prior versions recorded in the EBRAINS Knowledge Graph. Instead of a URL, here we only query for the respective UUID, which is identical to the "version overview" available via the EBRAINS web interface at https://search.kg.ebrains.eu/instances/5a16d948-8d1c-400c-b797-8a7ad29944b2:

datalad ebrains-clone 5a16d948-8d1c-400c-b797-8a7ad29944b2

UUIDs or URL can be used interchangably as an argument. In both cases, a UUID is extracted from the given argument.

Options

URL

URL including an ID of a dataset, or dataset version in the EBRAINS knowledge graph. (Such UUIDs can be found in the trailing part of the URL when looking at a dataset on https://search.kg.ebrains.eu). When an identifier/URL of a particular dataset version is provided all dataset versions preceeding this version are included in the generated dataset (including the identified version). If the URL/ID of a version-less dataset is given, all known versions for that dataset are included.

PATH

path to clone into. If no PATH is provided the destination will be the current working directory.

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

"Dataset to create.

--depth DEPTH

"Create a shallow clone with a history truncated to the specified number of version recorded in the knowledge graph.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.