# A typical collaborative data management workflow¶

In this demo we will look at how datalad can be used in a rather common data management workflow: A 3rd-party dataset is obtained to serve as input for an analysis. The data processing is then collaboratively performed by two colleagues. Upon completion the results are published alongside the original data for further consumption.

## Build atop 3rd-party data¶

Now, meet Bob. Bob has just started in the lab and has never used the version control system Git before. The first thing he does, is to configure his identity as it will be used to track changes in the datasets he will be working with. This step only needs to be done once on his first day in the lab.

# enter Bob's home directory
HOME="$BOBS_HOME" cd ~ git config --global --add user.name Bob git config --global --add user.email bob@example.com  After this initial setup, Bob is ready to go and can create his first dataset. datalad create myanalysis --description "my phd in a day" cd myanalysis  A datalad dataset can contain other datasets. As any content of a dataset is tracked and its precise state is recorded, this is a powerful method to specify and later resolve data dependencies. In this example, Bob wants to work with structural MRI data from the studyforrest project, a public brain imaging data resource. These data are made available through GitHub, so Bob can simply install the relevant dataset from this service and into his own dataset: datalad install -d . --source https://github.com/psychoinformatics-de/studyforrest-data-structural.git src/forrest_structural  and see that the forrest_structural was registered as a git submodule, which is a subdataset of his myanalysis dataset, but no data was fetched (datalad ls -L provides size_installed/total_size column): # mostly for a test grep src/forrest_structural .gitmodules # to demonstrate ls datalad ls -r -L .  Bob has decided to collect all data inputs for his project in a subdirectory src/, to make it obvious which parts of his analysis steps and code require 3rd-party data. Upon completion of the above command, Bob has now access to the entire dataset content, and precise current version of that dataset got linked to his myanalysis. However, no data was actually downloaded (yet). DataLad datasets primarily contain information on a dataset’s content and where to obtain it, hence the installation above was done rather quickly, and will still be relatively lean even for a dataset that contains several hundred GBs of data. For his first steps Bob just needs a single file of the dataset. In order to make it available locally, Bob can use the get command, and datalad will obtain requested data files from a remote data provider. datalad get src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz # just test data for now, could be #datalad get src/forrest_structural/sub-*/anat/sub-*_T1w.nii.gz  Although we originally installed the dataset from Github, the actual data is hosted elsewhere. DataLad supports multiple redundant data providers per each file in a dataset, and will transparently attempt to obtain data from an alternative location if a particular data provider is not available. Bob wants his analysis to be easily reproducible, and therefore manages his analysis scripts in the same dataset repository as the input data. Managing input data, analysis code, and results the same version control system creates a precise record of what version of code and input data was used to create which particular results. DataLad datasets are regular Git repositories and therefore provide the same powerful source code management features, as any other Git repository, and make them available for data too. Bob decided to adopt the convention to collect all of his analysis code in a subdirectory code/ in the root of his dataset. His first “analysis” script is rather simple: mkdir code echo "file src/forrest_structural/sub-01/anat/sub-*_T1w.nii.gz > result.txt" > code/run_analysis.sh  In order to definitively document which data file his analysis needs at this point, Bob creates a second script that can (re-)obtain the required files: echo "datalad get src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz" > code/get_required_data.sh  In the future, this won’t be necessary anymore as datalad itself will be able to record this information upon request. At this point Bob is satisfied with his initial progress. He wants to record this precise state. In order to do that, Bob needs to make his just created scripts a part of his dataset. Again the install command is used for this purpose. However, Bob doesn’t just want datalad to track these files and facilitate future downloads. He wants all Git features for working with them, so he adds them directly to the Git repository underlying his dataset. # add all content in the code/ directory directly to git datalad add --to-git code  At this point, datalad is aware of all changes that were made to the dataset and all the changes Bob made were automatically recorded, as you could easily check with git log command. As Bob’s analysis is completely scripted, he can now run it in full: bash code/get_required_data.sh bash code/run_analysis.sh  and add generated results to the dataset and provide a custom message to better describe accomplished work: datalad add -m "First analysis results" result.txt  You could also use --nosave option with add, and invoke datalad save later on to group multiple changes into a single commit. # git log  ## Local collaboration¶ Some time later, Bob needs help with his analysis. He turns to his colleague Alice for help. Alice and Bob both work on the same computing server. Alice initially went through a similar configuration procedure of her Git identity as Bob. HOME="$ALICES_HOME"
cd
git config --global --add user.name Alice
git config --global --add user.email alice@example.com


Bob has told Alice in which directory he keeps his analysis dataset. The colleagues’ directories are configured to have permissions that allow for read-access for all lab-member, so Alice can obtain Bob’s work directly from his home directory, including the studyforrest-structural subdataset he had:

# TODO: needs to get --description to avoid confusion
cd ~/myanalysis
datalad siblings add -s alice --url "$ALICES_HOME/bobs_analysis"  Once registered, Bob can update his dataset based on Alice’s version, and merge here changes with his own. datalad update -s alice --merge  He can, once again, use the get command to obtain the latest version of data files to get access to data contributed by Alice. datalad get result.txt  ## Going public¶ Lastly, let’s assume that Bob completed his analysis and he is ready to share the results with the world, or a remote collaborator. One way to make datasets available, is to upload them to a webserver via SSH. DataLad supports this by creating a sibling for the dataset on the server, to which the dataset can by published (repeatedly). # this generated sibling for the dataset and all subdatasets datalad create-sibling --recursive -s public "$SERVER_URL"


Once the remote sibling is created and registered under the name “public”, Bob can publish his version to it.

datalad publish -r --to public .


This command can be repeated as often as desired. DataLad checks the state of both the local and the remote sibling and transmits the changes.