Configuration

DataLad uses the same configuration mechanism and syntax as Git itself. Consequently, datalad can be configured using the git config command. Both a global user configuration (typically at ~/.gitconfig), and a local repository-specific configuration (.git/config) are inspected.

In addition, datalad supports a persistent dataset-specific configuration. This configuration is stored at .datalad/config in any dataset. As it is part of a dataset, settings stored there will also be in effect for any consumer of such a dataset. Both global and local settings on a particular machine always override configuration shipped with a dataset.

All datalad-specific configuration variables are prefixed with datalad..

It is possible to override or amend the configuration using environment variables. Any variable with a name that starts with DATALAD_ will be available as the corresponding datalad. configuration variable, replacing any _ in the name with a dot, and all letters converted to lower case. Values from environment variables take precedence over configuration file settings.

The following sections provide a (non-exhaustive) list of settings honored by datalad. They are categorized according to the scope they are typically associated with.

Global user configuration

datalad.crawl.init_direct
Default annex repository mode: Should dataset be initialized in direct mode?
datalad.crawl.pipeline.housekeeping

Crawler pipeline house keeping: Should the crawler tidy up datasets (git gc, repack, clean)?

[value must be convertible to type bool]

datalad.externals.nda.dbserver
NDA database server: Hostname of the database server
datalad.locations.cache
Cache directory: Where should datalad cache files? Default: ~/.cache/datalad
datalad.locations.system-plugins
System plugin directory: Where should datalad search for system plugins? Default: /etc/xdg/datalad/plugins
datalad.locations.user-plugins
User plugin directory: Where should datalad search for user plugins? Default: ~/.config/datalad/plugins

Local repository configuration

datalad.crawl.cache

Crawler download caching: Should the crawler cache downloaded files?

[bool]

datalad.crawl.dryrun

Crawler dry-run: Should the crawler ... I AM NOT QUITE SURE WHAT?

[value must be convertible to type bool]

Sticky dataset configuration

datalad.crawl.default_backend
Default annex backend: Content hashing method to be used by git-annex

Miscellaneous configuration

datalad.cmd.protocol
Specifies the protocol number used by the Runner to note shell command or python function call times and allows for dry runs. “externals-time” for ExecutionTimeExternalsProtocol, “time” for ExecutionTimeProtocol and “null” for NullProtocol. Any new DATALAD_CMD_PROTOCOL has to implement datalad.support.protocol.ProtocolInterface:
datalad.cmd.protocol.prefix
Sets a prefix to add before the command call times are noted by DATALAD_CMD_PROTOCOL.:
datalad.exc.str.tblimit
This flag is used by the datalad extract_tb function which extracts and formats stack-traces. It caps the number of lines to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback.:
datalad.log.level
Used for control the verbosity of logs printed to stdout while running datalad commands/debugging:
datalad.log.name
Include name of the log target in the log line:
datalad.log.names
Which names (,-separated) to print log lines for:
datalad.log.namesre
Regular expression for which names to print log lines for:
datalad.log.outputs
Used to control either both stdout and stderr of external commands execution are logged in detail (at DEBUG level):
datalad.log.timestamp

Used to add timestamp to datalad logs: Default: False

[value must be convertible to type bool]

datalad.log.traceback
Runs TraceBack function with collide set to True, if this flag is set to “collide”. This replaces any common prefix between current traceback log and previous invocation with ”...”:
datalad.metadata.create-aggregate-annex-limit
Limit configuration annexing aggregated metadata in new dataset: Git-annex large files expression (see https://git-annex.branchable.com/tips/largefiles; given expression will be wrapped in parentheses) Default: largerthan=20kb
datalad.metadata.maxfieldsize

Maximum metadata field size: Metadata fields exceeding this size (in bytes/chars) are excluded from metadata extractio Default: 100000

[value must be convertible to type ‘int’]

datalad.metadata.nativetype
Native dataset metadata scheme: Set this label to engage a particular metadata extraction parser
datalad.metadata.searchindex-documenttype

Type of search index documents: Labels of document types to include in a search index Default: all

[value must be one of (‘all’, ‘datasets’, ‘files’)]

datalad.metadata.store-aggregate-content

Aggregated content metadata storage: If this flag is enabled, content metadata is aggregated into superdataset to allow for discovery of individual files. If disable unique content metadata values are still aggregated to enable dataset discovery Default: True

[value must be convertible to type bool]

datalad.repo.direct

Direct Mode for git-annex repositories: Set this flag to create annex repositories in direct mode by default Default: False

[value must be convertible to type bool]

datalad.repo.version

git-annex repository version: Specifies the repository version for git-annex to be used by default Default: 5

[value must be convertible to type ‘int’]

datalad.runtime.raiseonerror

Error behavior: Set this flag to cause DataLad to raise an exception on errors that would have otherwise just get logged Default: False

[value must be convertible to type bool]

datalad.search.indexercachesize

Maximum cache size for search index (per process): Actual memory consumption can be twice as high as this value in MB (one process per CPU is used) Default: 256

[value must be convertible to type ‘int’]

datalad.tests.dataladremote

Binary flag to specify whether each annex repository should get datalad special remote in every test repository:

[value must be convertible to type bool]

datalad.tests.knownfailures.probe

Probes tests that are known to fail on whether or not they are actually still failing: Default: False

[value must be convertible to type bool]

datalad.tests.knownfailures.skip

Skips tests that are known to currently fail: Default: True

[value must be convertible to type bool]

datalad.tests.nonetwork

Skips network tests completely if this flag is set Examples include test for s3, git_repositories, openfmri etc:

[value must be convertible to type bool]

datalad.tests.nonlo
Specifies network interfaces to bring down/up for testing. Currently used by travis.:
datalad.tests.noteardown

Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set:

[value must be convertible to type bool]

datalad.tests.protocolremote

Binary flag to specify whether to test protocol interactions of custom remote with annex:

[value must be convertible to type bool]

datalad.tests.runcmdline

Binary flag to specify if shell testing using shunit2 to be carried out:

[value must be convertible to type bool]

datalad.tests.ssh

Skips SSH tests if this flag is not set:

[value must be convertible to type bool]

datalad.tests.temp.dir
Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc:
datalad.tests.temp.fs
Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:
datalad.tests.temp.fssize
Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:
datalad.tests.temp.keep

Function rmtemp will not remove temporary file/directory created for testing if this flag is set:

[value must be convertible to type bool]

datalad.tests.ui.backend
Tests UI backend: Which UI backend to use Default: tests-noninteractive
datalad.tests.usecassette
Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes: