DataLad extension for working with the UK Biobank¶
Overview¶
Introduction¶
This software is a DataLad extension that equips DataLad with a set of commands to obtain, monitor, and restructure imaging data releases of the UK Biobank. It is designed to download MRI bulk data, track additions/redactions/fixes from the UK Biobank, and (optionally) restructure into BIDS layout.
What is the UK Biobank?¶
The UK Biobank is a national and international health resource with unparalleled research opportunities, open to all bona fide health researchers. The UK Biobank aims to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses — including cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression, and forms of dementia. It is following the health and well-being of 500,000 volunteer participants, and aims to collect imaging data for 100,000 of the participants. It provides health information, which does not identify participants, to approved researchers in the UK and overseas, from academia and industry.
Requirements¶
In order to download data from the UK Biobank with datalad-ukbiobank
, you
will need the following:
- Approved access to download the UK Biobank data. This can be gained as the Principal Investigator (PI) on an approved application, or as a collaborator with “delegate” status.
- A keyfile (file containing a 64-character password that is provided by the UK Biobank after a successful application)
- A bulk data file (requires the download of the main dataset and conversion to a bulk file; see the UK Biobank accessing data guide)
- The ukbfetch commandline tool
Installation¶
The easiest way to install the latest version of datalad-ukbiobank
is from
PyPi. It is recommended to use a dedicated virtual environment:
# create and enter a new virtual environment (optional)
python3 -m venv ~/.venvs/datalad
source ~/.venvs/datalad/bin/activate
# install from PyPi
pip install datalad_ukbiobank
Concepts & Terms¶
The extension operates on a single UK Biobank subject, and acts as a wrapper
around the ukbfetch
tool to retrieve, ingest, restructure, and update data.
The first command, ukb-init
, initializes a dataset for a given UK Biobank
participant and data field(s). The second command, ukb-update
updates a
dataset for the initialized subject/data fields.
Dataset Structure¶
datalad-ukbiobank
allows for only one subject per dataset. Tailored or
comprehensive superdatasets can then be created to link the desired subject
datasets as subdatasets. This structure keeps each dataset lightweight and
promotes parallel downloads.
Branches¶
Data can be viewed in different layouts by checking out layout-specific branches:
incoming
- the unextracted archives, as downloaded from the UK Biobank (e.g. zip files)
incoming-native
- the extracted files in the original layout provided by the UK Biobank
incoming-bids
- if enabled, the extracted files converted to a BIDS(-like) layout
Bulk File¶
The required bulk file lists all participant IDs and data field IDs that are
available for download for an approved application. These participant IDs and
data field IDs are then used as input for the ukb-init
command.
To generate a bulk file, follow the UK Biobank accessing data guide to first download the main dataset and then generate a bulk file. Section 3.2.2 of this document explains how to create modality specific bulk files (e.g. participant IDs for all those with T1 structural brain images).
Once a bulk file is created, it can be parsed to extract the desired participant
and data field IDs for download with datalad-ukbiobank
.
Snippet of a bulk file:
1002532 20227_2_0
1002532 20227_3_0
1002532 20249_2_0
1002532 20249_3_0
1002532 20250_2_0
1002532 20250_3_0
1003339 20251_2_0
1003339 20251_3_0
1003339 20252_2_0
1003339 20252_3_0
1003339 20253_2_0
1003339 20253_3_0
- Participant ID
- These are unique to each application/project (e.g. 1002532).
- Data field IDs
- Indicates the data type (e.g. 20227 = NIFTI functional rest image), instance index (e.g. 2 = first imaging visit), and array index (e.g. 0). The instance index distinguishes data that were gathered at different times (sessions). The array index indicates if multiple pieces of data were gathered at the same time. These fields are explained in more detail in section 2.8 of the UK Biobank accessing data guide
ukbfetch¶
ukbfetch is a tool provided by the UK Biobank. It downloads specified bulk data, and requires authentication with a keyfile. See the ukbfetch documentation for specifics.
datalad-ukbionbank
downloads data with the ukbfetch
tool (which must be
available in PATH
).
The UK Biobank allows multiple downloads in parallel, but limits each application to 10 concurrent downloads.
Note
If you already have UK Biobank archives downloaded, and want to use
datalad-ukbiobank
without re-downloading everything, you can simply replace
ukbfetch
with a script
to obtain the relevant files from where they are located.
Quick Start¶
Download Data¶
To download UK Biobank data for a subject, start by creating and initializing a new dataset. In this example, two data records with two instances (sessions) each are selected.
datalad create sub-1002532
cd sub-1002532
datalad ukb-init 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0
After initialization, run ukb-update
to download data from the UK Biobank.
datalad ukb-update --keyfile <path_to_keyfile> --merge
This will create two branches:
incoming
: the pristine archives downloaded from UK Biobankincoming-native
: the extracted files in the original layout provided by the UK Biobank
With ukb-update --merge
, content is merged from incoming-native
into the active branch automatically.
Get Updates¶
To update a single subject’s dataset, simply re-run ukb-update
to
re-download the data and register any potential changes. Running ukb-update
will always re-download the data, regardless if there are upstream changes.
Again, the --merge
option will merge any updates into the active branch.
datalad ukb-update --keyfile <path_to_keyfile> --merge
Add or Remove Data Types¶
To add/remove data types, first re-initialize the dataset (with --force
) to
select the new data types. In this example, another data record with two
instances (sessions) is added to the list of selected data records.
datalad ukb-init --force 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0 20250_2_0 20250_3_0
After re-initialization, run ukb-update
to download the data.
datalad ukb-update --keyfile <path_to_keyfile> --merge
Structure in BIDS¶
To enable a BIDS(-like) layout of the data, re-initialize the dataset with the
--bids
option. This option can also be used when first initializing the
dataset.
datalad ukb-init --force --bids 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0 20250_2_0 20250_3_0
After re-initialization, run ukb-update
to create an additional
incoming-bids
branch containing a BIDS(-like) conversion of the extracted
downloads. If the --merge
option is specified, it will merge the
incoming-bids
branch into the active branch.
datalad ukb-update --keyfile <path_to_keyfile> --merge --force
The BIDS conversion only happens if the re-downloaded data is different from
the previously download data. If there are no changes to the content on
re-download, but you want to initiate the BIDS conversion, the --force
option can be used.
Save Space¶
The --drop
option can be used to avoid storing multiple copies of the same
data. In this example, the downloaded archives are kept and the extracted files
are dropped.
datalad ukb-update --keyfile <path_to_keyfile> --merge --force --drop extracted
It is also possible to keep the extracted content and drop the archives using
--drop archives
.
Command Line Reference¶
datalad ukb-init¶
Synopsis¶
datalad ukb-init [-h] [-f] [--bids] [-d DATASET] [--version] PARTICPANT-ID DATARECORD-ID [DATARECORD-ID ...]
Description¶
Initialize an existing dataset to track a UKBiobank participant
A batch file for the ‘ukbfetch’ tool will be generated and placed into the dataset. By selecting the relevant data records, raw and/or preprocessed data will be tracked.
After initialization the dataset will contain at least three branches:
- ‘incoming’: to track the pristine ZIP files downloaded from UKB
- ‘incoming-native’: to track individual files (some extracted from ZIP files)
- ‘incoming-bids’: to track individual files in a layout where file name conform to BIDS-conventions
- main branch: based off of incoming-native or incoming-bids (if enabled) with potential manual modifications applied
Examples
Initialize a dataset in the current directory:
% datalad ukb-init 5874415 20227_2_0 20249_2_0
Initialize a dataset in the current directory in BIDS layout:
% datalad ukb-init --bids 5874415 20227_2_0
Options¶
PARTICPANT-ID¶
UKBiobank participant ID to use for this dataset (note: these encoded IDs are unique to each application/project). Constraints: value must be a string
DATARECORD-ID¶
One or more data record identifiers. Constraints: value must be a string
-h, --help, --help-np¶
show this help message. –help-np forcefully disables the use of a pager for displaying the help message
-f, --force¶
force (re-)initialization.
--bids¶
additionally maintain an incoming-bids branch with a BIDS-like organization.
-d DATASET, --dataset DATASET¶
specify the dataset to perform the initialization on. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--version¶
show the module and its version which provides the command
Authors¶
datalad is developed by Michael Hanke <michael.hanke@gmail.com>.
datalad ukb-update¶
Synopsis¶
datalad ukb-update [-h] [-k PATH] [--merge] [-f] [--drop {extracted|archives}] [-d DATASET] [--version]
Description¶
Update an existing dataset of a UKbiobank participant
This command expects an ukb-init initialized DataLad dataset. The dataset may or may not have any downloaded content already.
Downloads are performed with the UKBFETCH tool, which is expected to be available and executable.
Options¶
-h, --help, --help-np¶
show this help message. –help-np forcefully disables the use of a pager for displaying the help message
-k PATH, --keyfile PATH¶
path to a file with an authentification key (ukbfetch -a …). If none is given, the configuration datalad.ukbiobank.keyfile is consulted. Constraints: value must be a string or value must be NONE
--merge¶
merge any updates into the active branch. If a BIDS layout is maintained in the dataset (incoming-bids branch) it will be merged into the active branch, the incoming-native branch otherwise.
-f, --force¶
update the incoming branch(es), even if (re-)download did not yield changed content (can be useful when restructuring setup has changed).
--drop {extracted|archives}¶
Drop file content to avoid storage duplication. ‘extracted’: drop all content of files extracted from downloaded archives to yield the most compact storage at the cost of partial re-extraction when accessing archive content; ‘archives’: keep extracted content, but drop archives instead. By default no content is dropped, duplicating archive content in extracted form. Constraints: value must be one of (‘extracted’, ‘archives’)
-d DATASET, --dataset DATASET¶
specify the dataset to perform the initialization on. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
--version¶
show the module and its version which provides the command
Authors¶
datalad is developed by Michael Hanke <michael.hanke@gmail.com>.
Python API¶
ukb_init (participant, records[, force, …]) |
Initialize an existing dataset to track a UKBiobank participant |
ukb_update ([keyfile, merge, force, drop, …]) |
Update an existing dataset of a UKbiobank participant |
Indices and tables¶
Acknowledgements¶
This development was supported by European Union’s Horizon 2020 research and innovation programme under grant agreement VirtualBrainCloud (H2020-EU.3.1.5.3, grant no. 826421).