Concepts & Terms

The extension operates on a single UK Biobank subject, and acts as a wrapper around the ukbfetch tool to retrieve, ingest, restructure, and update data. The first command, ukb-init, initializes a dataset for a given UK Biobank participant and data field(s). The second command, ukb-update updates a dataset for the initialized subject/data fields.

Dataset Structure

datalad-ukbiobank allows for only one subject per dataset. Tailored or comprehensive superdatasets can then be created to link the desired subject datasets as subdatasets. This structure keeps each dataset lightweight and promotes parallel downloads.

Branches

Data can be viewed in different layouts by checking out layout-specific branches:

incoming
the unextracted archives, as downloaded from the UK Biobank (e.g. zip files)
incoming-native
the extracted files in the original layout provided by the UK Biobank
incoming-bids
if enabled, the extracted files converted to a BIDS(-like) layout

Bulk File

The required bulk file lists all participant IDs and data field IDs that are available for download for an approved application. These participant IDs and data field IDs are then used as input for the ukb-init command.

To generate a bulk file, follow the UK Biobank accessing data guide to first download the main dataset and then generate a bulk file. Section 3.2.2 of this document explains how to create modality specific bulk files (e.g. participant IDs for all those with T1 structural brain images).

Once a bulk file is created, it can be parsed to extract the desired participant and data field IDs for download with datalad-ukbiobank.

Snippet of a bulk file:

1002532 20227_2_0
1002532 20227_3_0
1002532 20249_2_0
1002532 20249_3_0
1002532 20250_2_0
1002532 20250_3_0
1003339 20251_2_0
1003339 20251_3_0
1003339 20252_2_0
1003339 20252_3_0
1003339 20253_2_0
1003339 20253_3_0
Participant ID
These are unique to each application/project (e.g. 1002532).
Data field IDs
Indicates the data type (e.g. 20227 = NIFTI functional rest image), instance index (e.g. 2 = first imaging visit), and array index (e.g. 0). The instance index distinguishes data that were gathered at different times (sessions). The array index indicates if multiple pieces of data were gathered at the same time. These fields are explained in more detail in section 2.8 of the UK Biobank accessing data guide

ukbfetch

ukbfetch is a tool provided by the UK Biobank. It downloads specified bulk data, and requires authentication with a keyfile. See the ukbfetch documentation for specifics.

datalad-ukbionbank downloads data with the ukbfetch tool (which must be available in PATH).

The UK Biobank allows multiple downloads in parallel, but limits each application to 10 concurrent downloads.

Note

If you already have UK Biobank archives downloaded, and want to use datalad-ukbiobank without re-downloading everything, you can simply replace ukbfetch with a script to obtain the relevant files from where they are located.