DataLad extension for working with the UK Biobank

Overview

Introduction

This software is a DataLad extension that equips DataLad with a set of commands to obtain, monitor, and restructure imaging data releases of the UK Biobank. It is designed to download MRI bulk data, track additions/redactions/fixes from the UK Biobank, and (optionally) restructure into BIDS layout.

What is the UK Biobank?

The UK Biobank is a national and international health resource with unparalleled research opportunities, open to all bona fide health researchers. The UK Biobank aims to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses — including cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression, and forms of dementia. It is following the health and well-being of 500,000 volunteer participants, and aims to collect imaging data for 100,000 of the participants. It provides health information, which does not identify participants, to approved researchers in the UK and overseas, from academia and industry.

Requirements

In order to download data from the UK Biobank with datalad-ukbiobank, you will need the following:

  • Approved access to download the UK Biobank data. This can be gained as the Principal Investigator (PI) on an approved application, or as a collaborator with “delegate” status.
  • A keyfile (file containing a 64-character password that is provided by the UK Biobank after a successful application)
  • A bulk data file (requires the download of the main dataset and conversion to a bulk file; see the UK Biobank accessing data guide)
  • The ukbfetch commandline tool

Installation

The easiest way to install the latest version of datalad-ukbiobank is from PyPi. It is recommended to use a dedicated virtual environment:

# create and enter a new virtual environment (optional)
python3 -m venv ~/.venvs/datalad
source ~/.venvs/datalad/bin/activate

# install from PyPi
pip install datalad_ukbiobank

Concepts & Terms

The extension operates on a single UK Biobank subject, and acts as a wrapper around the ukbfetch tool to retrieve, ingest, restructure, and update data. The first command, ukb-init, initializes a dataset for a given UK Biobank participant and data field(s). The second command, ukb-update updates a dataset for the initialized subject/data fields.

Dataset Structure

datalad-ukbiobank allows for only one subject per dataset. Tailored or comprehensive superdatasets can then be created to link the desired subject datasets as subdatasets. This structure keeps each dataset lightweight and promotes parallel downloads.

Branches

Data can be viewed in different layouts by checking out layout-specific branches:

incoming
the unextracted archives, as downloaded from the UK Biobank (e.g. zip files)
incoming-native
the extracted files in the original layout provided by the UK Biobank
incoming-bids
if enabled, the extracted files converted to a BIDS(-like) layout

Bulk File

The required bulk file lists all participant IDs and data field IDs that are available for download for an approved application. These participant IDs and data field IDs are then used as input for the ukb-init command.

To generate a bulk file, follow the UK Biobank accessing data guide to first download the main dataset and then generate a bulk file. Section 3.2.2 of this document explains how to create modality specific bulk files (e.g. participant IDs for all those with T1 structural brain images).

Once a bulk file is created, it can be parsed to extract the desired participant and data field IDs for download with datalad-ukbiobank.

Snippet of a bulk file:

1002532 20227_2_0
1002532 20227_3_0
1002532 20249_2_0
1002532 20249_3_0
1002532 20250_2_0
1002532 20250_3_0
1003339 20251_2_0
1003339 20251_3_0
1003339 20252_2_0
1003339 20252_3_0
1003339 20253_2_0
1003339 20253_3_0
Participant ID
These are unique to each application/project (e.g. 1002532).
Data field IDs
Indicates the data type (e.g. 20227 = NIFTI functional rest image), instance index (e.g. 2 = first imaging visit), and array index (e.g. 0). The instance index distinguishes data that were gathered at different times (sessions). The array index indicates if multiple pieces of data were gathered at the same time. These fields are explained in more detail in section 2.8 of the UK Biobank accessing data guide

ukbfetch

ukbfetch is a tool provided by the UK Biobank. It downloads specified bulk data, and requires authentication with a keyfile. See the ukbfetch documentation for specifics.

datalad-ukbionbank downloads data with the ukbfetch tool (which must be available in PATH).

The UK Biobank allows multiple downloads in parallel, but limits each application to 10 concurrent downloads.

Note

If you already have UK Biobank archives downloaded, and want to use datalad-ukbiobank without re-downloading everything, you can simply replace ukbfetch with a script to obtain the relevant files from where they are located.

Quick Start

Download Data

To download UK Biobank data for a subject, start by creating and initializing a new dataset. In this example, two data records with two instances (sessions) each are selected.

datalad create sub-1002532
cd sub-1002532
datalad ukb-init 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0

After initialization, run ukb-update to download data from the UK Biobank.

datalad ukb-update --keyfile <path_to_keyfile> --merge

This will create two branches:

  • incoming: the pristine archives downloaded from UK Biobank
  • incoming-native: the extracted files in the original layout provided by the UK Biobank

With ukb-update --merge, content is merged from incoming-native into the active branch automatically.

Get Updates

To update a single subject’s dataset, simply re-run ukb-update to re-download the data and register any potential changes. Running ukb-update will always re-download the data, regardless if there are upstream changes. Again, the --merge option will merge any updates into the active branch.

datalad ukb-update --keyfile <path_to_keyfile> --merge

Add or Remove Data Types

To add/remove data types, first re-initialize the dataset (with --force) to select the new data types. In this example, another data record with two instances (sessions) is added to the list of selected data records.

datalad ukb-init --force 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0 20250_2_0 20250_3_0

After re-initialization, run ukb-update to download the data.

datalad ukb-update --keyfile <path_to_keyfile> --merge

Structure in BIDS

To enable a BIDS(-like) layout of the data, re-initialize the dataset with the --bids option. This option can also be used when first initializing the dataset.

datalad ukb-init --force --bids 1002532 20227_2_0 20227_3_0 20249_2_0 20249_3_0 20250_2_0 20250_3_0

After re-initialization, run ukb-update to create an additional incoming-bids branch containing a BIDS(-like) conversion of the extracted downloads. If the --merge option is specified, it will merge the incoming-bids branch into the active branch.

datalad ukb-update --keyfile <path_to_keyfile> --merge --force

The BIDS conversion only happens if the re-downloaded data is different from the previously download data. If there are no changes to the content on re-download, but you want to initiate the BIDS conversion, the --force option can be used.

Save Space

The --drop option can be used to avoid storing multiple copies of the same data. In this example, the downloaded archives are kept and the extracted files are dropped.

datalad ukb-update --keyfile <path_to_keyfile> --merge --force --drop extracted

It is also possible to keep the extracted content and drop the archives using --drop archives.

Command Line Reference

datalad ukb-init

Synopsis

datalad ukb-init [-h] [-f] [--bids] [-d DATASET] [--version] PARTICPANT-ID DATARECORD-ID [DATARECORD-ID ...]

Description

Initialize an existing dataset to track a UKBiobank participant

A batch file for the ‘ukbfetch’ tool will be generated and placed into the dataset. By selecting the relevant data records, raw and/or preprocessed data will be tracked.

After initialization the dataset will contain at least three branches:

  • ‘incoming’: to track the pristine ZIP files downloaded from UKB
  • ‘incoming-native’: to track individual files (some extracted from ZIP files)
  • ‘incoming-bids’: to track individual files in a layout where file name conform to BIDS-conventions
  • main branch: based off of incoming-native or incoming-bids (if enabled) with potential manual modifications applied

Examples

Initialize a dataset in the current directory:

% datalad ukb-init 5874415 20227_2_0 20249_2_0

Initialize a dataset in the current directory in BIDS layout:

% datalad ukb-init --bids 5874415 20227_2_0

Options

PARTICPANT-ID

UKBiobank participant ID to use for this dataset (note: these encoded IDs are unique to each application/project). Constraints: value must be a string

DATARECORD-ID

One or more data record identifiers. Constraints: value must be a string

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-f, --force

force (re-)initialization.

--bids

additionally maintain an incoming-bids branch with a BIDS-like organization.

-d DATASET, --dataset DATASET

specify the dataset to perform the initialization on. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by Michael Hanke <michael.hanke@gmail.com>.

datalad ukb-update

Synopsis

datalad ukb-update [-h] [-k PATH] [--merge] [-f] [--drop {extracted|archives}] [-d DATASET] [--version]

Description

Update an existing dataset of a UKbiobank participant

This command expects an ukb-init initialized DataLad dataset. The dataset may or may not have any downloaded content already.

Downloads are performed with the UKBFETCH tool, which is expected to be available and executable.

Options

-h, --help, --help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-k PATH, --keyfile PATH

path to a file with an authentification key (ukbfetch -a …). If none is given, the configuration datalad.ukbiobank.keyfile is consulted. Constraints: value must be a string or value must be NONE

--merge

merge any updates into the active branch. If a BIDS layout is maintained in the dataset (incoming-bids branch) it will be merged into the active branch, the incoming-native branch otherwise.

-f, --force

update the incoming branch(es), even if (re-)download did not yield changed content (can be useful when restructuring setup has changed).

--drop {extracted|archives}

Drop file content to avoid storage duplication. ‘extracted’: drop all content of files extracted from downloaded archives to yield the most compact storage at the cost of partial re-extraction when accessing archive content; ‘archives’: keep extracted content, but drop archives instead. By default no content is dropped, duplicating archive content in extracted form. Constraints: value must be one of (‘extracted’, ‘archives’)

-d DATASET, --dataset DATASET

specify the dataset to perform the initialization on. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by Michael Hanke <michael.hanke@gmail.com>.

Python API

ukb_init(participant, records[, force, …]) Initialize an existing dataset to track a UKBiobank participant
ukb_update([keyfile, merge, force, drop, …]) Update an existing dataset of a UKbiobank participant

Indices and tables

Acknowledgements

This development was supported by European Union’s Horizon 2020 research and innovation programme under grant agreement VirtualBrainCloud (H2020-EU.3.1.5.3, grant no. 826421).