Use case 1: Publishing and cloning datasets

Problem statement

Imagine you have been creating a reproducible workflow using DataLad from the get go. Everything is finished now, code, data, and paper are ready. Last thing to do: Publish your data, code, results, and workflows – ideally, all together, easily accessible, and also fast.

The solution: Publish the complete dataset to the OSF and let others clone the project to get access to data, code, version history, and workflows. Therefore, you decide on the annex sibling mode.

Creating the OSF sibling

Given OSF credentials are set, we can create a sibling in annex mode. We will also make the project public (--public), and attach some meta data (--category, --tag) to it.

The code below will create a new public OSF project called best-study-ever, a dataset sibling called osf-annex, and a readily configured storage sibling osf-annex-storage. The project on the OSF will have a description with details on how to clone it and some meta data.

# inside of the tutorial DataLad dataset
$ datalad create-sibling-osf --title best-study-ever \
  -s osf-annex \
  --category data \
  --tag reproducibility \
  --public

create-sibling-osf(ok): https://osf.io/<id>/
[INFO   ] Configure additional publication dependency on "osf-annex-storage"
configure-sibling(ok): /tmp/collab_osf (sibling)

Publishing the dataset

Afterwards, all that’s left to do is a datalad push to publish the dataset to the OSF.

$ datalad push --to osf-annex

The resulting dataset has all data and its Git history, but is not as human-readable as on a local computer:

Cloning the dataset

The dataset can be cloned with an osf://<id> URL, where ID is the project ID assigned at project creation:

$ datalad clone osf://n6bgd/ best-study-ever
  install(ok): /tmp/best-study-ever (dataset)

All data can subsequently be obtained using datalad get.