Configuration

DataLad uses the same configuration mechanism and syntax as Git itself. Consequently, datalad can be configured using the git config command. Both a global user configuration (typically at ~/.gitconfig), and a local repository-specific configuration (.git/config) are inspected.

In addition, datalad supports a persistent dataset-specific configuration. This configuration is stored at .datalad/config in any dataset. As it is part of a dataset, settings stored there will also be in effect for any consumer of such a dataset. Both global and local settings on a particular machine always override configuration shipped with a dataset.

All datalad-specific configuration variables are prefixed with datalad..

It is possible to override or amend the configuration using environment variables. Any variable with a name that starts with DATALAD_ will be available as the corresponding datalad. configuration variable, replacing any __ (two underscores) with a hyphen, then any _ (single underscore) with a dot, and finally converting all letters to lower case. Values from environment variables take precedence over configuration file settings.

In addition, the DATALAD_CONFIG_OVERRIDES_JSON environment variable can be set to a JSON record with configuration values. This is particularly useful for options that aren’t accessible through the naming scheme described above (e.g., an option name that includes an underscore).

The following sections provide a (non-exhaustive) list of settings honored by datalad. They are categorized according to the scope they are typically associated with.

Global user configuration

datalad.clone.url-substitute.github

GitHub URL substitution rule: Mangling for GitHub-related URL. A substitution specification is a string with a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times. Substitutions will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Default: (‘,https?://github.com/([^/]+)/(.*)$,\1###\2’, ‘,[/\\]+(?!$),-’, ‘,\s+|(%2520)+|(%20)+,_’, ‘,([^#]+)###(.*),https://github.com/\1/\2’)

datalad.clone.url-substitute.osf

Open Science Framework URL substitution rule: Mangling for OSF-related URLs. A substitution specification is a string with a match and substitution expression, each following Python’s regular expression syntax. Both expressions are concatenated to a single string with an arbitrary delimiter character. The delimiter is defined by prefixing the string with the delimiter. Prefix and delimiter are stripped from the expressions (Example: “,^http://(.*)$,https://1”). This setting can be defined multiple times. Substitutions will be applied incrementally, in order of their definition. The first substitution in such a series must match, otherwise no further substitutions in a series will be considered. However, following the first match all further substitutions in a series are processed, regardless whether intermediate expressions match or not. Default: (‘,^https://osf.io/([^/]+)[/]*$,osf://\1’,)

datalad.extensions.load

DataLad extension packages to load: Indicate which extension packages should be loaded unconditionally on CLI startup or on importing ‘datalad.[core]api’. This enables the respective extensions to customize DataLad with functionality and configurability outside the scope of extension commands. For merely running extension commands it is not necessary to load them specifically Default: None

datalad.externals.nda.dbserver

NDA database server: Hostname of the database server Default: https://nda.nih.gov/DataManager/dataManager

datalad.locations.cache

Cache directory: Where should datalad cache files? Default: ~/.cache/datalad

datalad.locations.default-dataset

Default dataset path: Where should datalad should look for (or install) a default dataset? Default: ~/datalad

datalad.locations.extra-procedures

Extra procedure directory: Where should datalad search for some additional procedures?

datalad.locations.locks

Lockfile directory: Where should datalad store lock files? Default: ~/.cache/datalad/locks

datalad.locations.sockets

Socket directory: Where should datalad store socket files? Default: ~/.cache/datalad/sockets

datalad.locations.system-procedures

System procedure directory: Where should datalad search for system procedures? Default: /etc/xdg/datalad/procedures

datalad.locations.user-procedures

User procedure directory: Where should datalad search for user procedures? Default: ~/.config/datalad/procedures

datalad.ssh.executable

Name of ssh executable for ‘datalad sshrun’: Specifies the name of the ssh-client executable thatdatalad will use. This might be an absolute path. On Windows systems it is currently by default set to point to the ssh executable of OpenSSH for Windows, if OpenSSH for Windows is installed. On other systems it defaults to ‘ssh’. Default: ssh

[value must be a string]

datalad.ssh.identityfile

If set, pass this file as ssh’s -i option.: Default: None

datalad.ssh.multiplex-connections

Whether to use a single shared connection for multiple SSH processes aiming at the same target.: Default: True

[value must be convertible to type bool]

datalad.ssh.try-use-annex-bundled-git

Whether to attempt adjusting the PATH in a remote shell to include Git binaries located in a detected git-annex bundle: If enabled, this will be a ‘best-effort’ attempt that only supports remote hosts with a Bourne shell and the which command available. The remote PATH must already contain a git-annex installation. If git-annex is not found, or the detected git-annex does not have a bundled Git installation, detection failure will not result in an error, but only slow remote execution by one-time sensing overhead per each opened connection. Default: False

[value must be convertible to type bool]

datalad.tests.cache

Cache directory for tests: Where should datalad cache test files? Default: ~/.cache/datalad/tests

datalad.tests.credentials

Credentials to use during tests: Which credentials should be available while running tests? If “plaintext” (default), a new plaintext keyring would be created in tests temporary HOME. If “system”, no custom configuration would be passed to keyring and known to system credentials could be used. Default: plaintext

[value must be one of [CMD: (‘plaintext’, ‘system’) CMD][PY: (‘plaintext’, ‘system’) PY]]

Local repository configuration

datalad.crawl.cache

Crawler download caching: Should the crawler cache downloaded files?

[value must be convertible to type bool]

datalad.fake-dates

Fake (anonymize) dates: Should the dates in the logs be faked? Default: False

[value must be convertible to type bool]

Sticky dataset configuration

datalad.locations.dataset-procedures: Dataset procedure directory: Where should datalad search for dataset procedures (relative to a dataset root)? Default: .datalad/procedures

Miscellaneous configuration

datalad.annex.retry

Value for annex.retry to use for git-annex calls: On transfer failure, annex.retry (sans “datalad.”) controls the number of times that git-annex retries. DataLad will call git-annex with annex.retry set to the value here unless the annex.retry is explicitly configured Default: 3

[value must be convertible to type ‘int’]

datalad.credentials.force-ask

Force (re-)entry of credentials: Should DataLad prompt for credential (re-)entry? This can be used to update previously stored credentials. Default: False

[value must be convertible to type bool]

datalad.credentials.githelper.noninteractive

Non-interactive mode for git-credential helper: Should git-credential-datalad operate in non-interactive mode? This would mean to not ask for user confirmation when storing new credentials/provider configs. Default: False

[bool]

datalad.exc.str.tblimit

This flag is used by datalad to cap the number of traceback steps included in exception logging and result reporting to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback.:

datalad.fake-dates-start

Initial fake date: When faking dates and there are no commits in any local branches, generate the date by adding one second to this value (Unix epoch time). The value must be positive. Default: 1112911993

[value must be convertible to type ‘int’]

datalad.github.token-note

GitHub token note: Description for a Personal access token to generate. Default: DataLad

datalad.install.inherit-local-origin

Inherit local origin of dataset source: If enabled, a local ‘origin’ remote of a local dataset clone source is configured as an ‘origin-2’ remote to make its annex automatically available. The process is repeated recursively for any further qualifying ‘origin’ dataset thereof.Note that if clone.defaultRemoteName is configured to use a name other than ‘origin’, that name will be used instead. Default: True

[value must be convertible to type bool]

datalad.log.level

Used for control the verbosity of logs printed to stdout while running datalad commands/debugging:

datalad.log.name

Include name of the log target in the log line:

datalad.log.names

Which names (,-separated) to print log lines for:

datalad.log.namesre

Regular expression for which names to print log lines for:

datalad.log.outputs

Whether to log stdout and stderr for executed commands: When enabled, setting the log level to 5 should catch all execution output, though some output may be logged at higher levels Default: False

[value must be convertible to type bool]

datalad.log.result-level

Log level for command result messages: If ‘match-status’, it will log ‘impossible’ results as a warning, ‘error’ results as errors, and everything else as ‘debug’. Otherwise the indicated log-level will be used for all such messages Default: debug

[value must be one of [CMD: (‘debug’, ‘info’, ‘warning’, ‘error’, ‘match-status’) CMD][PY: (‘debug’, ‘info’, ‘warning’, ‘error’, ‘match-status’) PY]]

datalad.log.timestamp

Used to add timestamp to datalad logs: Default: False

[value must be convertible to type bool]

datalad.log.traceback

Includes a compact traceback in a log message, with generic components removed. This setting is only in effect when given as an environment variable DATALAD_LOG_TRACEBACK. An integer value specifies the maximum traceback depth to be considered. If set to “collide”, a common traceback prefix between a current traceback and a previously logged traceback is replaced with “…” (maximum depth 100).:

datalad.repo.backend

git-annex backend: Backend to use when creating git-annex repositories Default: MD5E

datalad.repo.direct

Direct Mode for git-annex repositories: Set this flag to create annex repositories in direct mode by default Default: False

[value must be convertible to type bool]

datalad.repo.version

git-annex repository version: Specifies the repository version for git-annex to be used by default Default: 8

[value must be convertible to type ‘int’]

datalad.runtime.max-annex-jobs

Maximum number of git-annex jobs to request when “jobs” option set to “auto” (default): Set this value to enable parallel annex jobs that may speed up certain operations (e.g. get file content). The effective number of jobs will not exceed the number of available CPU cores (or 3 if there is less than 3 cores). Default: 1

[value must be convertible to type ‘int’]

datalad.runtime.max-batched

Maximum number of batched commands to run in parallel: Automatic cleanup of batched commands will try to keep at most this many commands running. Default: 20

[value must be convertible to type ‘int’]

datalad.runtime.max-inactive-age

Maximum time (in seconds) a batched command can be inactive before it is eligible for cleanup: Automatic cleanup of batched commands will consider an inactive command eligible for cleanup if more than this many seconds have transpired since the command’s last activity. Default: 60

[value must be convertible to type ‘int’]

datalad.runtime.max-jobs

Maximum number of jobs DataLad can run in “parallel”: Set this value to enable parallel multi-threaded DataLad jobs that may speed up certain operations, in particular operation across multiple datasets (e.g., install multiple subdatasets, etc). Default: 1

[value must be convertible to type ‘int’]

datalad.runtime.pathspec-from-file

Provide list of files to git commands via –pathspec-from-file: Instructs when DataLad will provide list of paths to ‘git’ commands which support –pathspec-from-file option via some temporary file. If set to ‘multi-chunk’ it will be done only if multiple invocations of the command on chunks of files list is needed. If set to ‘always’, DataLad will always use –pathspec-from-file. Default: multi-chunk

[value must be one of [CMD: (‘multi-chunk’, ‘always’) CMD][PY: (‘multi-chunk’, ‘always’) PY]]

datalad.runtime.raiseonerror

Error behavior: Set this flag to cause DataLad to raise an exception on errors that would have otherwise just get logged Default: False

[value must be convertible to type bool]

datalad.runtime.report-status

Command line result reporting behavior: If set (to other than ‘all’), constrains command result report to records matching the given status. ‘success’ is a synonym for ‘ok’ OR ‘notneeded’, ‘failure’ stands for ‘impossible’ OR ‘error’ Default: None

[value must be one of [CMD: (‘all’, ‘success’, ‘failure’, ‘ok’, ‘notneeded’, ‘impossible’, ‘error’) CMD][PY: (‘all’, ‘success’, ‘failure’, ‘ok’, ‘notneeded’, ‘impossible’, ‘error’) PY]]

datalad.runtime.stalled-external

Behavior for handing external processes: What to do with external processes if they do not finish in some minimal reasonable time. If “abandon”, datalad would proceed without waiting for external process to exit. ATM applies only to batched git-annex processes. Should be changed with caution. Default: wait

[value must be one of [CMD: (‘wait’, ‘abandon’) CMD][PY: (‘wait’, ‘abandon’) PY]]

datalad.save.no-message

Commit message handling: When no commit message was provided: attempt to obtain one interactively (interactive); or use a generic commit message (generic). NOTE: The interactive option is experimental. The behavior may change in backwards-incompatible ways. Default: generic

[value must be one of [CMD: (‘interactive’, ‘generic’) CMD][PY: (‘interactive’, ‘generic’) PY]]

datalad.save.windows-compat-warning

Action when Windows-incompatible file names are saved: Certain characters or names can make file names incompatible with Windows. If such files are saved ‘warning’ will alert users with a log message, ‘error’ will yield an ‘impossible’ result, and ‘none’ will ignore the incompatibility. Default: warning

[value must be one of [CMD: (‘warning’, ‘error’, ‘none’) CMD][PY: (‘warning’, ‘error’, ‘none’) PY]]

datalad.source.epoch

Datetime epoch to use for dates in built materials: Datetime to use for reproducible builds. Originally introduced for Debian packages to interface SOURCE_DATE_EPOCH described at https://reproducible-builds.org/docs/source-date-epoch/ .By default - current time Default: 1712422055.12706

[value must be convertible to type ‘float’]

datalad.tests.dataladremote

Binary flag to specify whether each annex repository should get datalad special remote in every test repository:

[value must be convertible to type bool]

datalad.tests.knownfailures.probe

Probes tests that are known to fail on whether or not they are actually still failing: Default: False

[value must be convertible to type bool]

datalad.tests.knownfailures.skip

Skips tests that are known to currently fail: Default: True

[value must be convertible to type bool]

datalad.tests.nonetwork

Skips network tests completely if this flag is set, Examples include test for S3, git_repositories, OpenfMRI, etc:

[value must be convertible to type bool]

datalad.tests.nonlo

Specifies network interfaces to bring down/up for testing. Currently used by Travis CI.:

datalad.tests.noteardown

Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set:

[value must be convertible to type bool]

datalad.tests.runcmdline

Binary flag to specify if shell testing using shunit2 to be carried out:

[value must be convertible to type bool]

datalad.tests.setup.testrepos

Pre-creates repositories for @with_testrepos within setup_package: Default: False

[value must be convertible to type bool]

datalad.tests.ssh

Skips SSH tests if this flag is not set:

[value must be convertible to type bool]

datalad.tests.temp.dir

Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc: Default: None

[value must be a string]

datalad.tests.temp.fs

Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:

datalad.tests.temp.fssize

Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation:

datalad.tests.temp.keep

Function rmtemp will not remove temporary file/directory created for testing if this flag is set:

[value must be convertible to type bool]

datalad.tests.ui.backend

Tests UI backend: Which UI backend to use Default: tests-noninteractive

datalad.tests.usecassette

Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes:

datalad.ui.color

Colored terminal output: Enable or disable ANSI color codes in outputs; “on” overrides NO_COLOR environment variable Default: auto

[value must be one of [CMD: (‘on’, ‘off’, ‘auto’) CMD][PY: (‘on’, ‘off’, ‘auto’) PY]]

datalad.ui.progressbar

UI progress bars: Default backend for progress reporting Default: None

[value must be one of [CMD: (‘tqdm’, ‘tqdm-ipython’, ‘log’, ‘none’) CMD][PY: (‘tqdm’, ‘tqdm-ipython’, ‘log’, ‘none’) PY]]

datalad.ui.suppress-similar-results

Suppress rendering of similar repetitive results: If enabled, after a certain number of subsequent results that are identical regarding key properties, such as ‘status’, ‘action’, and ‘type’, additional similar results are not rendered by the common result renderer anymore. Instead, a count of suppressed results is displayed. If disabled, or when not running in an interactive terminal, all results are rendered. Default: True

[value must be convertible to type bool]

datalad.ui.suppress-similar-results-threshold

Threshold for suppressing similar repetitive results: Minimum number of similar results to occur before suppression is considered. See ‘datalad.ui.suppress-similar-results’ for more information. Default: 10

[value must be convertible to type ‘int’]