Tutorial

The datalad-dataverse extension provides a single command, add-sibling-dataverse. This tutorial shows you how it can be used for interactions and publications to Dataverse. For a high-level overview of what's possible with this extension, see also What can I use this extension for?.

A full Dataverse interaction requires 5 steps:

1. Create a Dataverse dataset
2. Create a DataLad dataset
3. Add a Dataverse sibling to your dataset
4. Make your dataset public
5. Clone a dataset from Dataverse

1. Create a Dataverse dataset

If you want to publish a dataset to Dataverse, you will need a dedicated location on Dataverse that we will publish our dataset to. For this, we will use a Dataverse dataset.

Go to your favourite Dataverse instance, log in or create an account, and create a new draft Dataverse dataset via the Add Data header:

The Add Dataset button takes you to a configurator for your Dataverse dataset. Provide all relevant details and metadata entries in the form. Importantly, don't upload any of your data files - this will be done by DataLad once we datalad push later.

Once you have clicked Save Dataset, you'll have a draft Dataverse dataset. It already has a DOI, and you can find it under the Metadata tab as "Persistent identifier":

Finally, make a note of the URL of your dataverse instance (e.g., https://demo.dataverse.org), and the DOI of your draft dataset. You will need this information for step 3.

2. Create a DataLad dataset

Next, you'll need a DataLad dataset to push to your Dataverse dataset. If you already have one, skip this step. If not, use datalad create <dataset-name-of-your-choice> to create a new dataset to populate, or transform an existing directory into a DataLad dataset using

$ datalad create -d <path-to-directory> --force

In both cases, any files you add into the dataset can be saved using datalad save. If you have never done this before, its a good idea to give the first pages of the DataLad handbook a quick read first.

Here's a toy example dataset with a single saved file:

$ datalad create my-test-dataset
create(ok): /tmp/my-test-dataset (dataset)
$ cd my-test-dataset
$ echo 12345 > my-file
$ datalad save -m "Saving my first file"
add(ok): my-file (file)
save(ok): . (dataset)
action summary:
   add (ok: 1)
   save (ok: 1)

3. Add a Dataverse sibling to your dataset

Now that you have a draft Dataverse dataset on Dataverse and a local DataLad dataset, let them get to know each other using the datalad add-sibling-dataverse command. This command registers the remote Dataverse Dataset as a known remote location to your Dataset and will allow you to publish the entire Dataset (Git history and annexed data) or parts of it to Dataverse.

If you run this command for the first time, you will need to provide an API Token to authenticate against the chosen Dataverse instance in an interactive prompt. This is how this would look:

$ datalad add-sibling-dataverse https://demo.dataverse.org doi:10.70122/FK2/NQPP6A
A dataverse API token is required for access.
Find it at https://demo.dataverse.org by clicking on your name at the top right corner and then clicking on API Token
token:

You'll find this token if you follow the instructions in the prompt under your user account on your Dataverse instance, and you can copy-paste it into the command line:

If authentication with the token was successful, it will be saved into your system's keyring. If you have accounts on several different dataverse instances or multiple users with different tokens, you can use and store several tokens with the --credential parameter of the command. For example:

datalad add-sibling-dataverse \
  --credential demo-dataverse \
  https://demo.dataverse.org doi:10.70122/FK2/NQPP6A

will search for a previously used credential demo-dataverse, or prompt for a token if it can't find one and save it after success.

The datalad add-sibling-dataverse command needs at least two pieces of information: The URL of your Dataverse instance, and a persistent identifier of the draft Dataverse dataset created in step 1. Depending on what you want to transfer to Dataverse, you also need to configure the command with the correct --mode. Two popular choices are annex and filetree. The former, which is also the default, will prepare the Dataverse dataset to contain both the Git revision history of your dataset as well as its annexed contents (if your Dataverse instance supports this, and your data doesn't exceed file size limits). The latter will publish a single snapshot of your dataset ("as it currently is", without version history). Let's illustrate the differences in detail:

annex mode

--mode annex is the command's default, and will be used when you don't explicitly provide the --mode parameter. It will create a non-human readable representation of your DataLad dataset on Dataverse that includes Git history and annexed data:

$ datalad add-sibling-dataverse \
  https://demo.dataverse.org \
  doi:10.70122/FK2/NQPP6A \
  --mode annex
add_sibling_dataverse.storage(ok): . [dataverse-storage: https://demo.dataverse.org (DOI: doi:10.70122/FK2/NQPP6A)]
[INFO   ] Configure additional publication dependency on "dataverse-storage"
add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/NQPP6A (DOI: doi:10.70122/FK2/NQPP6A)]

As soon as you've created the sibling, you can push:

$ datalad push --to dataverse
copy(ok): my-file (file) [to dataverse-storage...]
publish(ok): . (dataset) [refs/heads/master->dataverse:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]]

action summary:
   copy (ok: 1)
   publish (ok: 2)

And this is the result on Dataverse:

filetree mode

--mode filetree is an export mode, i.e., it will mirror a snapshot of the current state of your DataLad dataset to Dataverse. This is more human readable on Dataverse, but wouldn't include historical versions of your annexed files. The Git history of your dataset is included in this mode, too.

$ datalad add-sibling-dataverse \
  https://demo.dataverse.org \
  doi:10.70122/FK2/ZS0YL3 \
  --mode filetree
add_sibling_dataverse.storage(ok): . [dataverse-storage: https://demo.dataverse.org (DOI: doi:10.70122/FK2/ZS0YL3)]
[INFO   ] Configure additional publication dependency on "dataverse-storage"
add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (DOI: doi:10.70122/FK2/ZS0YL3)]

Now, you can push:

$ datalad push --to dataverse
copy(ok): .datalad/.gitattributes (dataset)
copy(ok): .datalad/config (dataset)
copy(ok): .gitattributes (dataset)
copy(ok): my-file (dataset)
publish(ok): . (dataset) [refs/heads/master->dataverse:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]]
action summary:
  copy (ok: 4)
  publish (ok: 2)

And this is the result on Dataverse:

Note that Dataverse has a number of file name requirements that restrict which characters can be used in file or directory names. DataLad works around this by encoding file or directory names to comply to the allowed character set. Therefore, your Dataverse dataset might display files with slightly different names from what your local DataLad dataset displays. These names will be changed into their original form when the dataset is cloned, though.

4. Make your dataset public

Your dataset on Dataverse will be in draft mode after you've pushed content into it. Use the webinterface to make it public and share it.

5. Clone a dataset from Dataverse

Finally, you or others can clone your datasets from Dataverse. They'll need a special type of URL and the datalad clone command for this.

The URL required for cloning starts with datalad-annex::? and is provided to you by the datalad add-dataverse-sibling command. Alternatively, you can also copy-paste it from the configuration of your remotes:

$ git remote -v
dataverse       datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (fetch)
dataverse       datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=yes&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/ZS0YL3 (push)

Once you have this URL, anyone with an account on the Dataverse instance and the correct permissions for the dataset can clone it:

$ datalad clone \
 'datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//demo.dataverse.org&doi=doi:10.70122/FK2/NQPP6A' \
  my-clone
[INFO   ] Remote origin uses a protocol not supported by git-annex; setting annex-ignore
[INFO   ] access to 1 dataset sibling dataverse-storage not auto-enabled, enable with:
|            datalad siblings -d "/tmp/my-clone" enable -s dataverse-storage
install(ok): /tmp/tmp/my-clone-of-annex-mode (dataset)

Note

As a convenience it is also possible to use the URL of a Dataverse dataset landing page directly for clone:

datalad clone \
  https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPNGCY

Internally, it will be converted into the format shown above. This is intended to be used for one-off interactive cloning. For scripted access it is advisable to use the detailed format above, because changes in dataset layout and API are less likely to have an impact on this method, whereas the behavior of the convenience method may change in the future.

Afterwards, enable the special remote in the clone with the provided command, and retrieve file content using datalad get:

$ cd my-clone
$ datalad siblings -d "/tmp/my-clone" enable -s dataverse-storage
 .: dataverse-storage(?) [git]
$ datalad get my-file
get(ok): my-file (file) [from dataverse-storage...]