datalad ebrains-clone
Synopsis
datalad ebrains-clone [-h] [-d DATASET] [--depth DEPTH] [--version] URL [PATH]
Description
Export a dataset from the EBRAINS Knowledge Graph as a DataLad dataset
This command performs a series of queries against the EBRAINS Knowledge Graph (KG) to retrieve essential metadata on (versions of) an EBRAINS dataset. These metadata are used to build a Git-based representation of the dataset's evolution. This includes:
Any known
DatasetVersion
represented as a Git commit, i.e., a DataLad dataset version. The release date of the particular KG dataset version is recorded as the respective commit's date.Any file that is part of a
DatasetVersion
registered as an annexed file in the respective DataLad dataset version, matching the directory/file name in the corresponding EBRAINSFileRepository
, with a URL suitable for file retrieval, and a checksum suitable for git-annex based content verification recorded for each file.Each dataset version is (Git) tagged with the respective
VersionIdentifier
recorded in the EBRAINS KG.Each dataset version carries the
VersionInnovation
recorded in the EBRAINS KG as its commit message.
Authentication
This command requires authentication with an EBRAINS user account.
An access token has to be obtained and provided via the KG_AUTH_TOKEN
environment variable. Please see the ebrain-authenticate command
for instructions on obtaining an access token.
Performance notes
For each considered DatasetVersion
two principal queries are performed.
The version query obtains essential metadata on that DatasetVersion
,
the second query retrieve a list of files registered for that
DatasetVersion
. The later query is slow, and takes ~30s per
DatasetVersion
, regardless of the actual number of files registered.
Consequently, cloning a dataset with any significant number of versions
in the KG will take a considerable amount of time.
This issue is known and tracked at
https://github.com/HumanBrainProject/fairgraph/issues/57
Metadata validity
Metadata is always taken "as-is" from the EBRAINS KG. This can lead to unexpected results, in case metadata is are faulty. For example, it may happen that a newer dataset version has an assigned commit data that is older than its preceeding version.
Moreover, the EBRAINS KG does not provide all essential metadata required
for annotating a Git commit. For example, the agent identity associated
with a DatasetVersion
release is not available. This command
unconditionally uses DataLad-EBRAINS exporter <ebrains@datalad.org>
as author and committer identity for this reason.
Reproducible dataset generation
Because no metadata modifications are performed and no local identity
information is considered for generating a DataLad dataset, dataset
cloning will yield reproducible results. In other words, running
equivalent ebrains-clone
commands, on different machines, at
different times, by different users will yield the exact same DataLad
datasets -- unless the metadata retrieved from the EBRAINS KG changes.
Such changes can happen when metadata issues are corrected, or metadata
available to a requesting user identity differs.
Examples
Clone the Julich-Brain Cytoarchitectonic Atlas at version 2.4 from the EBRAINS Knowledge Graph (the URL is taken directly from the EBRAINS data search web interface):
datalad ebrains-clone https://search.kg.ebrains.eu/instances/5249afa7-5e04-4ffd-8039-c3a9231f717c
Clone the latest version of the Julich-Brain Cytoarchitectonic Atlas, including all prior versions recorded in the EBRAINS Knowledge Graph. Instead of a URL, here we only query for the respective UUID, which is identical to the "version overview" available via the EBRAINS web interface at https://search.kg.ebrains.eu/instances/5a16d948-8d1c-400c-b797-8a7ad29944b2:
datalad ebrains-clone 5a16d948-8d1c-400c-b797-8a7ad29944b2
UUIDs or URL can be used interchangably as an argument. In both cases, a UUID is extracted from the given argument.
Options
URL
URL including an ID of a dataset, or dataset version in the EBRAINS knowledge graph. (Such UUIDs can be found in the trailing part of the URL when looking at a dataset on https://search.kg.ebrains.eu). When an identifier/URL of a particular dataset version is provided all dataset versions preceeding this version are included in the generated dataset (including the identified version). If the URL/ID of a version-less dataset is given, all known versions for that dataset are included.
PATH
path to clone into. If no PATH is provided the destination will be the current working directory.
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
"Dataset to create.
--depth DEPTH
"Create a shallow clone with a history truncated to the specified number of version recorded in the knowledge graph.
--version
show the module and its version which provides the command