Walk-through: Dataset hosting on GIN

In this walkthrough, we will use DataLad Gooey to create a dataset, save it contents, and publish it to GIN (G-Node Infrastructure).

Prerequisites

In order to use GIN for hosting and sharing your datasets, you need to:

  • Register a GIN account;

  • add a personal access token to your GIN account (for creation of repositories with DataLad);

  • add an SSH key to your GIN account (for uploading annexed contents).

Follow the instructions on GIN to do so.

If you need to generate an SSH key pair and want to stay in the world of graphical interfaces, we recommend using PuTTYgen for this purpose. Your private key needs to be placed in the .ssh folder in your home directory for it to be picked up correctly.

Create a dataset

Let's assume that we are starting with an existing folder which already has some content, but is not yet a DataLad dataset. Let's open the DataLad Gooey and set a base directory to our folder, or its parent directory.

Our first operation is to to create a DataLad dataset. For this, right-click your folder and select Directory commandsCreate a dataset. This will populate the Command tab on the right with options for the selected command. The first value (Create at) is already populated, since we used right-click to issue the command. We leave Dataset with file annex checked (default), and Register in superdataset not set (default). In this example we want to configure our dataset to annex binary, but not text files. To do so, select text2git from the list of Configuration procedure(s) and click Add. Finally, check the OK if target directory not empty to enforce dataset creation out of a non-empty folder. With the options selected, click OK.

_images/created.png

Save the contents

Right-click the newly created dataset, and select Dataset commandsSave the state in a dataset. Parameters required for the Save command should appear in the Command tab. Fill in the Description of change (this is the commit message associated with the save). Leave all other fields default (note: Do not put files in annex is greyed out, not checked, i.e. it has no value). Here, we are saving all files at once, but if we wanted we could limit the save operation to selected files, or trigger it by clicking on a specific file. Once ready, click OK.

Note that after this operation, "untracked" files changed their state to "clean". Different from files, the dataset state is still "untracked", because it is not registered in any superdataset. Because we used the text2git configuration, only the PNG file changed its type to "annexed-file" in the screenshot below.

_images/saved.png

Create a GIN sibling

Creating a GIN sibling will create a new repository on GIN, and configure your dataset with its address. To perform this action, right-click your dataset, and select Dataset commandsCreate a GIN sibling. Fill in the New repository name on Gin (and, optionally, check the Make GIN repository private). You can leave all other options default.

In the Name of the credential to be used field, you can pick previously used credentials. If no value is given, and no previous credentials exist, the credentials will be save with website name (gin.g-node.org) by default.

Click OK.

At this point, a pop-up window will appear and you will be asked for a token. Paste the access token generated from GIN website, and click OK.

_images/created-sibling.png

Push to the GIN sibling

Right-click Dataset commandsPush data/updates to a sibling. The only thing you need to select is the value of To dataset sibling - this will be the sibling name from the step above. Leave other options default, and click OK.

_images/pushed.png

Retrieve the data from GIN

Finally we can confirm that our dataset can be obtained from GIN (possibly by other users who have access). Let's try making a clone in the same parent directory as our dataset, but under a different name. To do this, from the menu select DataLadClone a dataset. In the Clone from field, enter the dataset URL taken from GIN. Here, you can use either the HTTPS (for public repositories) or SSH (for private repositories) URL. Note that when using HTTPS, you need to remove the .git from the URL ending for proper interaction with GIN. Next, click the directory selection icon to the right of Clone into field, and use the directory picker to create and select a new directory named, for example, cloned-dataset. Afterwards, click OK.

To obtain the annexed contents in the cloned dataset, right click it in the file tree, select Dataset commandsGet content, and click OK. With other options kept default, this will download all annexed content in the dataset; if you wanted to obtain contents selectively, you could use the Limit to option. Alternatively, you could right-click individual files, and use File commandsGet content.