Basic principles

DataLad is designed to be used both as a command-line tool, and as a Python module. The sections Command line reference and Python module reference provide detailed description of the commands and functions of the two interfaces. This section presents common concepts. Although examples will frequently be presented using command line interface commands, all functionality with identically named functions and options are available through Python API as well.

Datasets

A DataLad dataset is a Git repository that may or may not have a data annex that is used to manage data referenced in a dataset. In practice, most DataLad datasets will come with an annex.

Types of IDs used in datasets

Four types of unique identifiers are used by DataLad to enable identification of different aspects of datasets and their components.

Dataset ID

A UUID that identifies a dataset as a whole across its entire history and flavors. This ID is stored in a dataset’s own configuration file (<dataset root>/.datalad/config) under the configuration key datalad.dataset.id. As this configuration is stored in a file that is part of the Git history of a dataset, this ID is identical for all “clones” of a dataset and across all its versions. If the purpose or scope of a dataset changes enough to warrant a new dataset ID, it can be changed by altering the dataset configuration setting.

Annex ID

A UUID assigned to an annex of each individual clone of a dataset repository. Git-annex uses this UUID to track file content availability information. The UUID is available under the configuration key annex.uuid and is stored in the configuration file of a local clone (<dataset root>/.git/config). A single dataset instance (i.e. clone) can only have a single annex UUID, but a dataset with multiple clones will have multiple annex UUIDs.

Commit ID

A Git hexsha or tag that identifies a version of a dataset. This ID uniquely identifies the content and history of a dataset up to its present state. As the dataset history also includes the dataset ID, a commit ID of a DataLad dataset is unique to a particular dataset.

Content ID

Git-annex key (typically a checksum) assigned to the content of a file in a dataset’s annex. The checksum reflects the content of a file, not its name. Hence the content of multiple identical files in a single (or across) dataset(s) will have the same checksum. Content IDs are managed by Git-annex in a dedicated annex branch of the dataset’s Git repository.

Dataset nesting

Datasets can contain other datasets (subdatasets), which can in turn contain subdatasets, and so on. There is no limit to the depth of nesting datasets. Each dataset in such a hierarchy has its own annex and its own history. The parent or superdataset only tracks the specific state of a subdataset, and information on where it can be obtained. This is a powerful yet lightweight mechanism for combining multiple individual datasets for a specific purpose, such as the combination of source code repositories with other resources for a tailored application. In many cases DataLad can work with a hierarchy of datasets just as if it were a single dataset. Here is a demo:

~ % datalad create demo
[INFO   ] Creating a new annex repo at /demo/demo
create(ok): /demo/demo (dataset)
~ % cd demo

A DataLad dataset is just a Git repo with some initial configuration

~/demo % git log --oneline
472e34b (HEAD -> master) [DATALAD] new dataset
f968257 [DATALAD] Set default backend for all files to be MD5E

We can generate nested datasets, by telling DataLad to register a new dataset in a parent dataset

~/demo % datalad create -d . sub1
[INFO   ] Creating a new annex repo at /demo/demo/sub1
add(ok): sub1 (dataset) [added new subdataset]
add(notneeded): sub1 (dataset) [nothing to add from /demo/demo/sub1]
add(notneeded): .gitmodules (file) [already included in the dataset]
save(ok): /demo/demo (dataset)
create(ok): sub1 (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  create (ok: 1)
  save (ok: 1)

A subdataset is nothing more than regular Git submodule

~/demo % git submodule
 5f0cddf2026e3fb4864139f27e7415fd72c7d4d0 sub1 (heads/master)

Of course subdatasets can be nested

~/demo % datalad create -d . sub1/justadir/sub2
[INFO   ] Creating a new annex repo at /demo/demo/sub1/justadir/sub2
add(ok): sub1/justadir/sub2 (dataset) [added new subdataset]
add(notneeded): sub1/justadir/sub2 (dataset) [nothing to add from /demo/demo/sub1/justadir/sub2]
add(notneeded): sub1/.gitmodules (file) [already included in the dataset]
add(notneeded): sub1 (dataset) [already known subdataset]
save(ok): /demo/demo/sub1 (dataset)
save(ok): /demo/demo (dataset)
create(ok): sub1/justadir/sub2 (dataset)
action summary:
  add (notneeded: 3, ok: 1)
  create (ok: 1)
  save (ok: 2)

Unlike Git, DataLad automatically takes care of committing all changes associated with the added subdataset up to the given parent dataset

~/demo % git status
On branch master
nothing to commit, working tree clean

Let’s create some content in the deepest subdataset

~/demo % mkdir sub1/justadir/sub2/anotherdir
~/demo % touch sub1/justadir/sub2/anotherdir/afile

Git can only tell us that something underneath the top-most subdataset was modified

~/demo % git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

     modified:   sub1 (untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

DataLad saves us from further investigation

~/demo % datalad diff -r
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
untracked(directory): sub1/justadir/sub2/anotherdir

Like Git, it can report individual untracked files, but also across repository boundaries

~/demo % datalad diff -r --report-untracked all
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
     untracked(file): sub1/justadir/sub2/anotherdir/afile

Adding this new content with Git or git-annex would be an exercise

~/demo % git add sub1/justadir/sub2/anotherdir/afile
fatal: Pathspec 'sub1/justadir/sub2/anotherdir/afile' is in submodule 'sub1'

DataLad does not require users to determine the correct repository in the tree

~/demo % datalad add -d . sub1/justadir/sub2/anotherdir/afile
add(ok): sub1/justadir/sub2/anotherdir/afile (file)
save(ok): /demo/demo/sub1/justadir/sub2 (dataset)
save(ok): /demo/demo/sub1 (dataset)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 3)

Again, all associated changes in the entire dataset tree, up to the given parent dataset, were committed

~/demo % git status
On branch master
nothing to commit, working tree clean

DataLad’s ‘diff’ is able to report the changes from these related commits throughout the repository tree

~/demo % datalad diff --revision @~1 -r
   modified(dataset): sub1
   modified(dataset): sub1/justadir/sub2
         added(file): sub1/justadir/sub2/anotherdir/afile

Dataset collections

A superdataset can also be seen as a curated collection of datasets, for example, for a certain data modality, a field of science, a certain author, or from one project (maybe the resource for a movie production). This lightweight coupling between super and subdatasets enables scenarios where individual datasets are maintained by a disjoint set of people, and the dataset collection itself can be curated by a completely independent entity. Any individual dataset can be part of any number of such collections.

Benefiting from Git’s support for workflows based on decentralized “clones” of a repository, DataLad’s datasets can be (re-)published to a new location without losing the connection between the “original” and the new “copy”. This is extremely useful for collaborative work, but also in more mundane scenarios such as data backup, or temporary deployment of a dataset on a compute cluster, or in the cloud. Using git-annex, data can also get synchronized across different locations of a dataset (siblings in DataLad terminology). Using metadata tags, it is even possible to configure different levels of desired data redundancy across the network of dataset, or to prevent publication of sensitive data to publicly accessible repositories. Individual datasets in a hierarchy of (sub)datasets need not be stored at the same location. Continuing with an earlier example, it is possible to post a curated collection of datasets, as a superdataset, on GitHub, while the actual datasets live on different servers all around the world.

Basic command line usage

All of DataLad’s functionality is available through a single command: datalad

Running the datalad command without any arguments, gives a summary of basic options, and a list of available sub-commands.

~ % datalad
usage: datalad [-h] [-l LEVEL] [-C PATH] [--version]
               [--dbg] [--idbg] [-c KEY=VALUE]
               [-f {default,json,json_pp,tailored,'<template>'}]
               [--report-status {success,failure,ok,notneeded,impossible,error}]
               [--report-type {dataset,file}]
               [--on-failure {ignore,continue,stop}] [--cmd]
               {create,install,get,publish,uninstall,drop,remove,update,create-sibling,create-sibling-github,unlock,save,search,metadata,aggregate-metadata,test,ls,clean,add-archive-content,download-url,run,rerun,addurls,export-archive,extract-metadata,export-to-figshare,no-annex,wtf,add-readme,annotate-paths,clone,create-test-dataset,diff,siblings,sshrun,subdatasets}
               ...
[ERROR  ] Please specify the command
~ % #

More comprehensive information is available via the –help long-option (we will truncate the output here)

~ % datalad --help       | head -n20
Usage: datalad [global-opts] command [command-opts]

DataLad provides a unified data distribution with the convenience of git-annex
repositories as a backend.  DataLad command line tools allow to manipulate
(obtain, create, update, publish, etc.) datasets and their collections.

*Commands for dataset operations*

  create
      Create a new dataset from scratch
  install
      Install a dataset from a (remote) source
  get
      Get any dataset content (files/directories/subdatasets)
  publish
      Publish a dataset to a known sibling
  uninstall
      Uninstall subdatasets

Getting information on any of the available sub commands works in the same way – just pass –help AFTER the sub-command (output again truncated)

~ % datalad create --help       | head -n20
Usage: datalad create [-h] [-f] [-D DESCRIPTION] [-d PATH] [--no-annex]
                      [--nosave] [--annex-version ANNEX_VERSION]
                      [--annex-backend ANNEX_BACKEND]
                      [--native-metadata-type LABEL] [--shared-access MODE]
                      [--git-opts STRING] [--annex-opts STRING]
                      [--annex-init-opts STRING] [--text-no-annex]
                      [PATH]

Create a new dataset from scratch.

This command initializes a new dataset at a given location, or the
current directory. The new dataset can optionally be registered in an
existing superdataset (the new dataset's path needs to be located
within the superdataset for that, and the superdataset needs to be given
explicitly). It is recommended to provide a brief description to label
the dataset's nature *and* location, e.g. "Michael's music on black
laptop". This helps humans to identify data locations in distributed
scenarios.  By default an identifier comprised of user and machine name,
plus path will be generated.

API principles

You can use DataLad’s install command to download datasets. The command accepts URLs of different protocols (http, ssh) as an argument. Nevertheless, the easiest way to obtain a first dataset is downloading the default superdataset from https://datasets.datalad.org/ using a shortcut.

Downloading DataLad’s default superdataset

https://datasets.datalad.org provides a super-dataset consisting of datasets from various portals and sites. Many of them were crawled, and periodically updated, using datalad-crawler extension. The argument /// can be used as a shortcut that points to the superdataset located at https://datasets.datalad.org/. Here are three common examples in command line notation:

datalad install ///

installs this superdataset (metadata without subdatasets) in a datasets.datalad.org/ subdirectory under the current directory

datalad install -r ///openfmri

installs the openfmri superdataset into an openfmri/ subdirectory. Additionally, the -r flag recursively downloads all metadata of datasets available from http://openfmri.org as subdatasets into the openfmri/ subdirectory

datalad install -g -J3 -r ///labs/haxby

installs the superdataset of datasets released by the lab of Dr. James V. Haxby and all subdatasets’ metadata. The -g flag indicates getting the actual data, too. It does so by using 3 parallel download processes (-J3 flag).

Downloading datasets via http

In most places where DataLad accepts URLs as arguments these URLs can be regular http or https protocol URLs. For example:

datalad install https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

Downloading datasets via ssh

DataLad also supports SSH URLs, such as ssh://me@localhost/path.

datalad install ssh://me@localhost/path

Finally, DataLad supports SSH login style resource identifiers, such as me@localhost:/path.

datalad install me@localhost:/path

Commands install vs get

The install and get commands might seem confusingly similar at first. Both of them could be used to install any number of subdatasets, and fetch content of the data files. Differences lie primarily in their default behaviour and outputs, and thus intended use. Both install and get take local paths as their arguments, but their default behavior and output might differ;

  • install primarily operates and reports at the level of datasets, and returns as a result dataset(s) which either were just installed, or were installed previously already under specified locations. So result should be the same if the same install command ran twice on the same datasets. It does not fetch data files by default

  • get primarily operates at the level of paths (datasets, directories, and/or files). As a result it returns only what was installed (datasets) or fetched (files). So result of rerunning the same get command should report that nothing new was installed or fetched. It fetches data files by default.

In how both commands operate on provided paths, it could be said that install == get -n, and install -g == get. But install also has ability to install new datasets from remote locations given their URLs (e.g., https://datasets.datalad.org/ for our super-dataset) and SSH targets (e.g., [login@]host:path) if they are provided as the argument to its call or explicitly as --source option. If datalad install --source URL DESTINATION (command line example) is used, then dataset from URL gets installed under PATH. In case of datalad install URL invocation, PATH is taken from the last name within URL similar to how git clone does it. If former specification allows to specify only a single URL and a PATH at a time, later one can take multiple remote locations from which datasets could be installed.

So, as a rule of thumb – if you want to install from external URL or fetch a sub-dataset without downloading data files stored under annex – use install. In Python API install is also to be used when you want to receive in output the corresponding Dataset object to operate on, and be able to use it even if you rerun the script. In all other cases, use get.