DataLad NEXT extension

This DataLad extension can be thought of as a staging area for additional functionality, or for improved performance and user experience. Unlike other topical or more experimental extensions, the focus here is on functionality with broad applicability. This extension is a suitable dependency for other software packages that intend to build on this improved set of functionality.

Installation and usage

Install from PyPi or Github like any other Python package:

# create and enter a new virtual environment (optional)
$ virtualenv --python=python3 ~/env/dl-next
$ . ~/env/dl-next/bin/activate
# install from PyPi
$ python -m pip install datalad-next

Once installed, additional commands provided by this extension are immediately available. However, in order to fully benefit from all improvements, the extension has to be enabled for auto-loading by executing:

git config --global --add datalad.extensions.load next

Doing so will enable the extension to also alter the behavior the core DataLad package and its commands.

Functionality provided by DataLad NEXT

The following table of contents offers entry points to the main components provided by this extension. The project README offers a more detailed summary in a different format.

High-level API commands

create_sibling_webdav(url, *[, dataset, ...])

Create a sibling(-tandem) on a WebDAV server

credentials([action, spec, name, prompt, ...])

Credential management and query

download(spec, *[, dataset, force, ...])

Download from URLs

ls_file_collection(type, collection, *[, hash])

Report information on files in a collection

next_status(*[, dataset, untracked, ...])

Report on the (modification) status of a dataset

tree([path, depth, recursive, ...])

Visualize directory and dataset hierarchies

Command line reference

datalad create-sibling-webdav

Synopsis
datalad create-sibling-webdav [-h] [-d DATASET] [-s NAME] [--storage-name NAME] [--mode MODE] [--credential NAME] [--existing EXISTING] [-r] [-R LEVELS] [--version] URL
Description

Create a sibling(-tandem) on a WebDAV server

WebDAV is a standard HTTP protocol extension for placing files on a server that is supported by a number of commercial storage services (e.g. 4shared.com, box.com), but also instances of cloud-storage solutions like Nextcloud or ownCloud. These software packages are also the basis for some institutional or public cloud storage solutions, such as EUDAT B2DROP.

For basic usage, only the URL with the desired dataset location on a WebDAV server needs to be specified for creating a sibling. However, the sibling setup can be flexibly customized (no storage sibling, or only a storage sibling, multi-version storage, or human-browsable single-version storage).

This command does not check for conflicting content on the WebDAV server!

When creating siblings recursively for a dataset hierarchy, subdataset exports are placed at their corresponding relative paths underneath the root location on the WebDAV server.

Collaboration on WebDAV siblings

The primary use case for WebDAV siblings is dataset deposition, where only one site is uploading dataset and file content updates. For collaborative workflows with multiple contributors, please make sure to consult the documentation on the underlying datalad-annex:: Git remote helper for advice on appropriate setups: http://docs.datalad.org/projects/next/

Git-annex implementation details

Storage siblings are presently configured to NOT be enabled automatically on cloning a dataset. Due to a limitation of git-annex, this would initially fail (missing credentials). Instead, an explicit datalad siblings enable --name <storage-sibling-name> command must be executed after cloning. If necessary, it will prompt for credentials.

This command does not (and likely will not) support embedding credentials in the repository (see embedcreds option of the git-annex webdav special remote; https://git-annex.branchable.com/special_remotes/webdav), because such credential copies would need to be updated, whenever they change or expire. Instead, credentials are retrieved from DataLad's credential system. In many cases, credentials are determined automatically, based on the HTTP authentication realm identified by a WebDAV server.

This command does not support setting up encrypted remotes (yet). Neither for the storage sibling, nor for the regular Git-remote. However, adding support for it is primarily a matter of extending the API of this command, and passing the respective options on to the underlying git-annex setup.

This command does not support setting up chunking for webdav storage siblings (https://git-annex.branchable.com/chunking).

Examples

Create a WebDAV sibling tandem for storage of a dataset's file content and revision history. A user will be prompted for any required credentials, if they are not yet known.:

% datalad create-sibling-webdav "https://webdav.example.com/myds"

Such a dataset can be cloned by DataLad via a specially crafted URL. Again, credentials are automatically determined, or a user is prompted to enter them:

% datalad clone "datalad-annex::?type=webdav&encryption=none&url=https://webdav.example.com/myds"

A sibling can also be created with a human-readable file tree, suitable for data exchange with non-DataLad users, but only able to host a single version of each file:

% datalad create-sibling-webdav --mode filetree "https://example.com/browsable"

Cloning such dataset siblings is possible via a convenience URL:

% datalad clone "webdavs://example.com/browsable"

In all cases, the storage sibling needs to explicitly enabled prior to file content retrieval:

% datalad siblings enable --name example.com-storage
Options
URL

URL identifying the sibling root on the target WebDAV server.

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

specify the dataset to process. If no dataset is given, an attempt is made to identify the dataset based on the current working directory.

-s NAME, --name NAME

name of the sibling. If none is given, the hostname-part of the WebDAV URL will be used. With RECURSIVE, the same name will be used to label all the subdatasets' siblings.

--storage-name NAME

name of the storage sibling (git-annex special remote). Must not be identical to the sibling name. If not specified, defaults to the sibling name plus '-storage' suffix. If only a storage sibling is created, this setting is ignored, and the primary sibling name is used.

--mode MODE

Siblings can be created in various modes: full-featured sibling tandem, one for a dataset's Git history and one storage sibling to host any number of file versions ('annex'). A single sibling for the Git history only ('git-only'). A single annex sibling for multi-version file storage only ('annex-only'). As an alternative to the standard (annex) storage sibling setup that is capable of storing any number of historical file versions using a content hash layout ('annex'|'annex-only'), the 'filetree' mode can used. This mode offers a human- readable data organization on the WebDAV remote that matches the file tree of a dataset (branch). However, it can, consequently, only store a single version of each file in the file tree. This mode is useful for depositing a single dataset snapshot for consumption without DataLad. The 'filetree' mode nevertheless allows for cloning such a single-version dataset, because the full dataset history can still be pushed to the WebDAV server. Git history hosting can also be turned off for this setup ('filetree-only'). When both a storage sibling and a regular sibling are created together, a publication dependency on the storage sibling is configured for the regular sibling in the local dataset clone. [Default: 'annex']

--credential NAME

name of the credential providing a user/password credential to be used for authorization. The credential can be supplied via configuration setting 'datalad.credential.<name>.user|secret', or environment variable DATALAD_CREDENTIAL_<NAME>_USER|SECRET, or will be queried from the active credential store using the provided name. If none is provided, the last-used credential for the authentication realm associated with the WebDAV URL will be used. Only if a credential name was given, it will be encoded in the URL of the created WebDAV Git remote, credential auto-discovery will be performed on each remote access.

--existing EXISTING

action to perform, if a (storage) sibling is already configured under the given name. In this case, sibling creation can be skipped ('skip') or the sibling (re-)configured ('reconfigure') in the dataset, or the command be instructed to fail ('error'). [Default: 'error']

-r, --recursive

if set, recurse into potential subdatasets.

-R LEVELS, --recursion-limit LEVELS

limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type 'int' or value must be NONE

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad credentials

Synopsis
datalad credentials [-h] [--prompt PROMPT] [-d DATASET] [--version] [action] [[name] [:]property[=value] ...]
Description

Credential management and query

This command enables inspection and manipulation of credentials used throughout DataLad.

The command provides four basic actions:

QUERY

When executed without any property specification, all known credentials with all their properties will be yielded. Please note that this may not include credentials that only comprise of a secret and no other properties, or legacy credentials for which no trace in the configuration can be found. Therefore, the query results are not guaranteed to contain all credentials ever configured by DataLad.

When additional property/value pairs are specified, only credentials that have matching values for all given properties will be reported. This can be used, for example, to discover all suitable credentials for a specific "realm", if credentials were annotated with such information.

SET

This is the companion to 'get', and can be used to store properties and secret of a credential. Importantly, and in contrast to a 'get' operation, given properties with no values indicate a removal request. Any matching properties on record will be removed. If a credential is to be stored for which no secret is on record yet, an interactive session will prompt a user for a manual secret entry.

Only changed properties will be contained in the result record.

The appearance of the interactive secret entry can be configured with the two settings datalad.credentials.repeat-secret-entry and datalad.credentials.hidden-secret-entry.

REMOVE

This action will remove any secret and properties associated with a credential identified by its name.

GET (plumbing operation)

This is a read-only action that will never store (updates of) credential properties or secrets. Given properties will amend/overwrite those already on record. When properties with no value are given, and also no value for the respective properties is on record yet, their value will be requested interactively, if a --prompt text was provided too. This can be used to ensure a complete credential record, comprising any number of properties.

Details on credentials

A credential comprises any number of properties, plus exactly one secret. There are no constraints on the format or property values or the secret, as long as they are encoded as a string.

Credential properties are normally stored as configuration settings in a user's configuration ('global' scope) using the naming scheme:

datalad.credential.<name>.<property>

Therefore both credential name and credential property name must be syntax-compliant with Git configuration items. For property names this means only alphanumeric characters and dashes. For credential names virtually no naming restrictions exist (only null-byte and newline are forbidden). However, when naming credentials it is recommended to use simple names in order to enable convenient one-off credential overrides by specifying DataLad configuration items via their environment variable counterparts (see the documentation of the configuration command for details. In short, avoid underscores and special characters other than '.' and '-'.

While there are no constraints on the number and nature of credential properties, a few particular properties are recognized on used for particular purposes:

  • 'secret': always refers to the single secret of a credential

  • 'type': identifies the type of a credential. With each standard type, a list of mandatory properties is associated (see below)

  • 'last-used': is an ISO 8601 format time stamp that indicated the last (successful) usage of a credential

Standard credential types and properties

The following standard credential types are recognized, and their mandatory field with their standard names will be automatically included in a 'get' report.

  • 'user_password': with properties 'user', and the password as secret

  • 'token': only comprising the token as secret

  • 'aws-s3': with properties 'key-id', 'session', 'expiration', and the secret_id as the credential secret

Legacy support

DataLad credentials not configured via this command may not be fully discoverable (i.e., including all their properties). Discovery of such legacy credentials can be assisted by specifying a dedicated 'type' property.

Examples

Report all discoverable credentials:

% datalad credentials

Set a new credential mycred & input its secret interactively:

% datalad credentials set mycred

Remove a credential's type property:

% datalad credentials set mycred :type

Get all information on a specific credential in a structured record:

% datalad -f json credentials get mycred

Upgrade a legacy credential by annotating it with a 'type' property:

% datalad credentials set legacycred type=user_password

Set a new credential of type user_password, with a given user property, and input its secret interactively:

% datalad credentials set mycred type=user_password user=<username>

Obtain a (possibly yet undefined) credential with a minimum set of properties. All missing properties and secret will be prompted for, no information will be stored! This is mostly useful for ensuring availability of an appropriate credential in an application context:

% datalad credentials --prompt 'can I haz info plz?' get newcred :newproperty
Options
action

which action to perform. [Default: 'query']

[name] [:]property[=value]

specification ofa credential name and credential properties. Properties are either given as name/value pairs or as a property name prefixed by a colon. Properties prefixed with a colon indicate a property to be deleted (action 'set'), or a property to be entered interactively, when no value is set yet, and a prompt text is given (action 'get'). All property names are case-insensitive, must start with a letter or a digit, and may only contain '-' apart from these characters.

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

--prompt PROMPT

message to display when entry of missing credential properties is required for action 'get'. This can be used to present information on the nature of a credential and for instructions on how to obtain a credential.

-d DATASET, --dataset DATASET

specify a dataset whose configuration to inspect rather than the global (user) settings.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad download

Synopsis
datalad download [-h] [-d DATASET] [--force {overwrite-existing}] [--credential NAME] [--hash ALGORITHM] [--version] <path>|<url>|<url-path-pair> [<path>|<url>|<url-path-pair> ...]
Description

Download from URLs

This command is the front-end to an extensible framework for performing downloads from a variety of URL schemes. Built-in support for the schemes 'http', 'https', 'file', and 'ssh' is provided. Extension packages may add additional support.

In contrast to other downloader tools, this command integrates with the DataLad credential management and is able to auto-discover credentials. If no credential is available, it automatically prompts for them, and offers to store them for reuse after a successful authentication.

Simultaneous hashing (checksumming) of downloaded content is supported with user-specified algorithms.

The command can process any number of downloads (serially). it can read download specifications from (command line) arguments, files, or STDIN. It can deposit downloads to individual files, or stream to STDOUT.

Implementation and extensibility

Each URL scheme is processed by a dedicated handler. Additional schemes can be supported by sub-classing datalad_next.url_operations.UrlOperations and implementing the download() method. Extension packages can register new handlers, by patching them into the datalad_next.download._urlscheme_handlers registry dict.

Examples

Download webpage to "myfile.txt":

% datalad download "http://example.com myfile.txt"

Read download specification from STDIN (e.g. JSON-lines):

% datalad download -

Simultaneously hash download, hexdigest reported in result record:

% datalad download --hash sha256 http://example.com/data.xml"

Download from SSH server:

% datalad download "ssh://example.com/home/user/data.xml"

Stream a download to STDOUT:

% datalad -f disabled download "http://example.com -"
Options
<path>|<url>|<url-path-pair>

Download sources and targets can be given in a variety of formats: as a URL, or as a URL-path-pair that is mapping a source URL to a dedicated download target path. Any number of URLs or URL-path-pairs can be provided, either as an argument list, or read from a file (one item per line). Such a specification input file can be given as a path to an existing file (as a single value, not as part of a URL-path-pair). When the special path identifier '-' is used, the download is written to STDOUT. A specification can also be read in JSON-lines encoding (each line being a string with a URL or an object mapping a URL-string to a path-string).

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Dataset to be used as a configuration source. Beyond reading configuration items, this command does not interact with the dataset.

--force {overwrite-existing}

By default, a target path for a download must not exist yet. 'force-overwrite' disabled this check.

--credential NAME

name of a credential to be used for authorization. If no credential is identified, the last-used credential for the authentication realm associated with the download target will be used. If there is no credential available yet, it will be prompted for. Once used successfully, a prompt for entering to save such a new credential will be presented.

--hash ALGORITHM

Name of a hashing algorithm supported by the Python 'hashlib' module, e.g. 'md5' or 'sha256'. This option can be given more than once.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad ls-file-collection

Synopsis
datalad ls-file-collection [-h] [--hash ALGORITHM] [--version] {directory,tarfile,zipfile,gittree,gitworktree,annexworktree} ID/LOCATION
Description

Report information on files in a collection

This is a utility that can be used to query information on files in different file collections. The type of information reported varies across collection types. However, each result at minimum contains some kind of identifier for the collection ('collection' property), and an identifier for the respective collection item ('item' property). Each result also contains a type property that indicates particular type of file that is being reported on. In most cases this will be file, but other categories like symlink or directory are recognized too.

If a collection type provides file-access, this command can compute one or more hashes (checksums) for any file in a collection.

Supported file collection types are:

directory

Reports on the content of a given directory (non-recursively). The collection identifier is the path of the directory. Item identifiers are the names of items within that directory. Standard properties like size, mtime, or link_target are included in the report.

gittree

Reports on the content of a Git "tree-ish". The collection identifier is that tree-ish. The command must be executed inside a Git repository. If the working directory for the command is not the repository root (in case of a non-bare repository), the report is constrained to items underneath the working directory. Item identifiers are the relative paths of items within that working directory. Reported properties include gitsha and gittype; note that the gitsha is not equivalent to a SHA1 hash of a file's content, but is the SHA-type blob identifier as reported and used by Git. Reporting of content hashes beyond the gitsha is presently not supported.

gitworktree

Reports on all tracked and untracked content of a Git repository's work tree. The collection identifier is a path of a directory in a Git repository (which can, but needs not be, its root). Item identifiers are the relative paths of items within that directory. Reported properties include gitsha and gittype; note that the gitsha is not equivalent to a SHA1 hash of a file's content, but is the SHA-type blob identifier as reported and used by Git.

tarfile

Reports on members of a TAR archive. The collection identifier is the path of the TAR file. Item identifiers are the relative paths of archive members within the archive. Reported properties are similar to the directory collection type.

Examples

Report on the content of a directory:

% datalad -f json ls-file-collection directory /tmp

Report on the content of a TAR archive with MD5 and SHA1 file hashes:

% datalad -f json ls-file-collection --hash md5 --hash sha1 tarfile myarchive.tar.gz

Register URLs for files in a directory that is also reachable via HTTP. This uses ls-file-collection for listing files and computing MD5 hashes, then using jq to filter and transform the output (just file records, and in a JSON array), and passes them to addurls, which generates annex keys/files and assigns URLs. When the command finishes, the dataset contains no data, but can retrieve the files after confirming their availability (i.e., via git annex fsck):

% datalad -f json ls-file-collection directory wwwdir --hash md5 \
  | jq '. | select(.type == "file")' \
  | jq --slurp . \
  | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - 'https://example.com/{item}'
Options
{directory,tarfile,zipfile,gittree,gitworktree,annexworktree}

Name of the type of file collection to report on.

ID/LOCATION

identifier or location of the file collection to report on. Depending on the type of collection to process, the specific nature of this parameter can be different. A common identifier for a file collection is a path (to a directory, to an archive), but might also be a URL. See the documentation for details on supported collection types.

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

--hash ALGORITHM

One or more names of algorithms to be used for reporting file hashes. They must be supported by the Python 'hashlib' module, e.g. 'md5' or 'sha256'. Reporting file hashes typically implies retrieving/reading file content. This processing may also enable reporting of additional properties that may otherwise not be readily available. This option can be given more than once.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad next-status

Synopsis
datalad next-status [-h] [-d DATASET] [--untracked {no,whole-dir,no-empty-dir,normal,all}] [-r [{no,repository,datasets,mono}]] [-e {no,commit,full}] [--version]
Description

Report on the (modification) status of a dataset

NOTE

This is a preview of an command implementation aiming to replace the DataLad status command.

For now, expect anything here to change again.

This command provides a report that is roughly identical to that of git status. Running with default parameters yields a report that should look familiar to Git and DataLad users alike, and contain the same information as offered by git status.

The main difference to git status are:

  • Support for recursion into submodule. git status does that too, but the report is limited to the global state of an entire submodule, whereas this command can issue detailed reports in changes inside a submodule (any nesting depth).

  • Support for directory-constrained reporting. Much like git status limits its report to a single repository, this command can optionally limit its report to a single directory and its direct children. In this report subdirectories are considered containers (much like) submodules, and a change summary is provided for them.

  • Support for a "mono" (monolithic repository) report. Unlike a standard recursion into submodules, and checking each of them for changes with respect to the HEAD commit of the worktree, this report compares a submodule with respect to the state recorded in its parent repository. This provides an equally comprehensive status report from the point of view of a queried repository, but does not include a dedicated item on the global state of a submodule. This makes nested hierarchy of repositories appear like a single (mono) repository.

  • Support for "adjusted mode" git-annex repositories. These utilize a managed branch that is repeatedly rewritten, hence is not suitable for tracking within a parent repository. Instead, the underlying "corresponding branch" is used, which contains the equivalent content in an un-adjusted form, persistently. This command detects this condition and automatically check a repositories state against the corresponding branch state.

Presently missing/planned features

  • There is no support for specifying paths (or pathspecs) for constraining the operation to specific dataset parts. This will be added in the future.

  • There is no reporting of git-annex properties, such as tracked file size. It is undetermined whether this will be added in the future. However, even without a dedicated switch, this command has support for datasets (and their submodules) in git-annex's "adjusted mode".

Differences to the ``status`` command implementation prior DataLad v2

  • Like git status this implementation reports on dataset modification, whereas the previous status also provided a listing of unchanged dataset content. This is no longer done. Equivalent functionality for listing dataset content is provided by the ls_file_collection command.

  • The implementation is substantially faster. Depending on the context the speed-up is typically somewhere between 2x and 100x.

  • The implementation does not suffer from the limitation re type change detection.

  • Python and CLI API of the command use uniform parameter validation.

Examples

Options
-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, --dataset DATASET

Dataset to be used as a configuration source. Beyond reading configuration items, this command does not interact with the dataset.

--untracked {no,whole-dir,no-empty-dir,normal,all}

Determine how untracked content is considered and reported when comparing a revision to the state of the working tree. 'no': no untracked content is considered as a change; 'normal': untracked files and entire untracked directories are reported as such; 'all': report individual files even in fully untracked directories. In addition to these git-status modes, 'whole-dir' (like normal, but include empty directories), and 'no-empty-dir' (alias for 'normal') are understood. [Default: 'normal']

-r [{no,repository,datasets,mono}], --recursive [{no,repository,datasets,mono}]

Mode of recursion for status reporting. With 'no' the report is restricted to a single directory and its direct children. With 'repository', the report comprises all repository content underneath current working directory or root of a given dataset, but is limited to items directly contained in that repository. With 'datasets', the report also comprises any content in any subdatasets. Each subdataset is evaluated against its respective HEAD commit. With 'mono', a report similar to 'datasets' is generated, but any subdataset is evaluate with respect to the state recorded in its parent repository. In contrast to the 'datasets' mode, no report items on a joint submodule are generated. If no particular value is given with this option the 'datasets' mode is selected. [Default: 'repository']

-e {no,commit,full}, --eval-subdataset-state {no,commit,full}

Evaluation of subdataset state (modified or untracked content) can be expensive for deep dataset hierarchies as subdataset have to be tested recursively for uncommitted modifications. Setting this option to 'no' or 'commit' can substantially boost performance by limiting what is being tested. With 'no' no state is evaluated and subdataset are not investigated for modifications. With 'commit' only a discrepancy of the HEAD commit gitsha of a subdataset and the gitsha recorded in the superdataset's record is evaluated. With 'full' any other modifications are considered too. [Default: 'full']

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

datalad tree

Synopsis
datalad tree [-h] [-L DEPTH] [-r] [-R LEVELS] [--include-files] [--include-hidden] [--version] [path]
Description

Visualize directory and dataset hierarchies

This command mimics the UNIX/MS-DOS 'tree' utility to generate and display a directory tree, with DataLad-specific enhancements.

It can serve the following purposes:

  1. Glorified 'tree' command

  2. Dataset discovery

  3. Programmatic directory traversal

Glorified 'tree' command

The rendered command output uses 'tree'-style visualization:

/tmp/mydir
├── [DS~0] ds_A/
│   └── [DS~1] subds_A/
└── [DS~0] ds_B/
    ├── dir_B/
    │   ├── file.txt
    │   ├── subdir_B/
    │   └── [DS~1] subds_B0/
    └── [DS~1] (not installed) subds_B1/

5 datasets, 2 directories, 1 file

Dataset paths are prefixed by a marker indicating subdataset hierarchy level, like [DS~1]. This is the absolute subdataset level, meaning it may also take into account superdatasets located above the tree root and thus not included in the output. If a subdataset is registered but not installed (such as after a non-recursive datalad clone), it will be prefixed by (not installed). Only DataLad datasets are considered, not pure git/git-annex repositories.

The 'report line' at the bottom of the output shows the count of displayed datasets, in addition to the count of directories and files. In this context, datasets and directories are mutually exclusive categories.

By default, only directories (no files) are included in the tree, and hidden directories are skipped. Both behaviours can be changed using command options.

Symbolic links are always followed. This means that a symlink pointing to a directory is traversed and counted as a directory (unless it potentially creates a loop in the tree).

Dataset discovery

Using the --recursive or --recursion-limit option, this command generates the layout of dataset hierarchies based on subdataset nesting level, regardless of their location in the filesystem.

In this case, tree depth is determined by subdataset depth. This mode is thus suited for discovering available datasets when their location is not known in advance.

By default, only datasets are listed, without their contents. If --depth is specified additionally, the contents of each dataset will be included up to --depth directory levels (excluding subdirectories that are themselves datasets).

Tree filtering options such as --include-hidden only affect which directories are reported as dataset contents, not which directories are traversed to find datasets.

Performance note: since no assumption is made on the location of datasets, running this command with the --recursive or --recursion-limit option does a full scan of the whole directory tree. As such, it can be significantly slower than a call with an equivalent output that uses --depth to limit the tree instead.

Programmatic directory traversal

The command yields a result record for each tree node (dataset, directory or file). The following properties are reported, where available:

"path"

Absolute path of the tree node

"type"

Type of tree node: "dataset", "directory" or "file"

"depth"

Directory depth of node relative to the tree root

"exhausted_levels"

Depth levels for which no nodes are left to be generated (the respective subtrees have been 'exhausted')

"count"

Dict with cumulative counts of datasets, directories and files in the tree up until the current node. File count is only included if the command is run with the --include-files option.

"dataset_depth"

Subdataset depth level relative to the tree root. Only included for node type "dataset".

"dataset_abs_depth"

Absolute subdataset depth level. Only included for node type "dataset".

"dataset_is_installed"

Whether the registered subdataset is installed. Only included for node type "dataset".

"symlink_target"

If the tree node is a symlink, the path to the link target

"is_broken_symlink"

If the tree node is a symlink, whether it is a broken symlink

Examples

Show up to 3 levels of subdirectories below the current directory, including files and hidden contents:

% datalad tree -L 3 --include-files --include-hidden

Find all top-level datasets located anywhere under /tmp:

% datalad tree /tmp -R 0

Report all subdatasets recursively and their directory contents, up to 1 subdirectory deep within each dataset:

% datalad tree -r -L 1
Options
path

path to directory from which to generate the tree. Defaults to the current directory. [Default: '.']

-h, --help, --help-np

show this help message. --help-np forcefully disables the use of a pager for displaying the help message

-L DEPTH, --depth DEPTH

limit the tree to maximum level of subdirectories. If not specified, will generate the full tree with no depth constraint. If paired with --recursive or --recursion-limit, refers to the maximum directory level to output below each dataset.

-r, --recursive

produce a dataset tree of the full hierarchy of nested subdatasets. Note: may have slow performance on large directory trees.

-R LEVELS, --recursion-limit LEVELS

limit the dataset tree to maximum level of nested subdatasets. 0 means include only top-level datasets, 1 means top-level datasets and their immediate subdatasets, etc. Note: may have slow performance on large directory trees.

--include-files

include files in the tree.

--include-hidden

include hidden files/directories in the tree. This option does not affect which directories will be searched for datasets when specifying --recursive or --recursion-limit. For example, datasets located underneath the hidden folder .datalad will be reported even if --include-hidden is omitted.

--version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

Python tooling

datalad-next comprises a number of more-or-less self-contained mini-packages providing particular functionality. These implementations are candidates for a migration into the DataLad core package, and are provided here for immediate use. If and when components are migrated, transition modules will be kept to prevent API breakage in dependent packages.

archive_operations

Handler for operations on various archive types

commands

Essential tooling for implementing DataLad commands

config

Configuration query and manipulation

constraints

Data validation, coercion, and parameter documentation

consts

Common constants

credman

Credential management

datasets

Representations of DataLad datasets built on git/git-annex repositories

exceptions

Special purpose exceptions

iterable_subprocess

Context manager to communicate with a subprocess using iterables

itertools

Various iterators, e.g., for subprocess pipelining and output processing

iter_collections

Iterators for particular types of collections

repo_utils

Common repository operations

runners

Execution of subprocesses

shell

A persistent shell connection

tests

Tooling for test implementations

tests.fixtures

Collection of fixtures for facilitation test implementations

types

Custom types and dataclasses

uis

UI abstractions for user communication

url_operations

Handlers for operations on various URL types and protocols

utils

Assorted utility functions

Git-remote helpers

datalad_annex

git-remote-datalad-annex to fetch/push via any git-annex special remote

Git-annex backends

base

Interface and essential utilities to implement external git-annex backends

xdlra

git-annex external backend XDLRA for git-remote-datalad-annex

Git-annex special remotes

SpecialRemote(annex)

Base class of all datalad-next git-annex special remotes

archivist

git-annex special remote archivist for obtaining files from archives

uncurl

uncurl git-annex external special remote

DataLad patches

Patches that are automatically applied to DataLad when loading the datalad-next extension package.

annexrepo

Credential support for AnnexRepo.enable_remote() and siblings enable

cli_configoverrides

Post DataLad config overrides CLI/ENV as GIT_CONFIG items in process ENV

commanderror

Improve CommandError rendering and add returncode alias for code

common_cfg

Change the default of datalad.annex.retry to 1

configuration

Enable configuration() to query global scope without a dataset

create_sibling_ghlike

Improved credential handling for create_sibling_<github-like>()

create_sibling_gitlab

Streamline user experience

customremotes_main

Connect log_progress-style progress reporting to git-annex

distribution_dataset

DatasetParameter support for resolve_path()

interface_utils

Uniform pre-execution parameter validation for commands

push_optimize

Make push avoid refspec handling for special remote push targets

push_to_export_remote

Add support for export to WebDAV remotes to push()

run

Enhance run() placeholder substitutions to honor configuration defaults

siblings

Auto-deploy credentials when enabling special remotes

test_keyring

Recognize DATALAD_TESTS_TMP_KEYRING_PATH to set alternative secret storage

update

Robustify update() target detection for adjusted mode datasets

Developing with DataLad NEXT

This extension package moves fast in comparison to the DataLad core package. Nevertheless, attention is paid to API stability, adequate semantic versioning, and informative changelogs.

Besides the DataLad commands shipped with this extension package, a number of Python utilities are provided that facilitate the implementation of workflows and additional functionality. An overview is available in the reference manual.

Public vs internal Python API

Anything that can be imported directly from any of the top-level sub-packages in datalad_next is considered to be part of the public API. Changes to this API determine the versioning, and development is done with the aim to keep this API as stable as possible. This includes signatures and return value behavior.

As an example:

from datalad_next.runners import iter_git_subproc

imports a part of the public API, but:

from datalad_next.runners.git import iter_git_subproc

does not.

Use of the internal API

Developers can obviously use parts of the non-public API. However, this should only be done with the understanding that these components may change from one release to another, with no guarantee of transition periods, deprecation warnings, etc.

Developers are advised to never reuse any components with names starting with _ (underscore). Their use should be limited to their individual sub-package.

Contributor information

Developer Guide

This guide sheds light on new and reusable subsystems developed in datalad-next. The target audience are developers that intend to build up on or use functionality provided by this extension.

datalad-next's Constraint System

datalad_next.constraints implements a system to perform data validation, coercion, and parameter documentation for commands via a flexible set of "Constraints". You can find an overview of available Constraints in the respective module overview of the Python tooling.

Adding parameter validation to a command

In order to equip an existing or new command with the constraint system, the following steps are required:

  • Set the commands base class to ValidatedInterface:

from datalad_next.commands import ValidatedInterface

@build_doc
class MyCommand(ValidatedInterface):
    """Download from URLs"""
  • Declare a _validator_ class member:

from datalad_next.commands import (
    EnsureCommandParameterization,
    ValidatedInterface,
)

@build_doc
class MyCommand(ValidatedInterface):
    """Download from URLs"""

_validator_ = EnsureCommandParameterization(dict(
     [...]
 ))
  • Determine for each parameter of the command whether it has constraints, and what those constraints are. If you're transitioning an existing command, remove any constraints= declaration in the _parameter_ class member.

  • Add a fitting Constraint declaration for each parameter into the _validator_ as a key-value pair where the key is the parameter and its value is a Constraint. There does not need to be a Constraint per parameter; only add entries for parameters that need validation.

from datalad_next.commands import (
    EnsureCommandParameterization,
    ValidatedInterface,
)
from datalad_next.constraints import EnsureChoice
from datalad_next.constraints import EnsureDataset

@build_doc
class Download(ValidatedInterface):
    """Download from URLs"""

_validator_ = EnsureCommandParameterization(dict(
     dataset=EnsureDataset(installed=True),
     force=EnsureChoice('yes','no','maybe'),
 ))
Combining constraints

Constraints can be combined in different ways. The |, &, and () operators allow AND, OR, and grouping of Constraints. The following example from the download command defines a chain of possible Constraints:

spec_item_constraint = url2path_constraint | (
     (
         EnsureJSON() | EnsureURLFilenamePairFromURL()
     ) & url2path_constraint)

Constrains can also be combined using AnyOf or AllOf MultiConstraints, which correspond almost entirely to | and &. Here's another example from the download command:

spec_constraint = AnyOf(
    spec_item_constraint,
    EnsureListOf(spec_item_constraint),
    EnsureGeneratorFromFileLike(
        spec_item_constraint,
        exc_mode='yield',
    ),

One can combine an arbitrary number of Constraints. They are evaluated in the order in which they were specified. Logical OR constraints will return the value from the first constraint that does not raise an exception, and logical AND constraints pass the return values of each constraint into the next.

Implementing additional constraints

TODO

Parameter Documentation

TODO

Contributing to datalad-next

We're happy about contributions of any kind to this project - thanks for considering making one!

Please take a look at CONTRIBUTING.md for an overview of development principles and common questions, and get in touch in case of questions or to discuss features, bugs, or other issues.

Indices and tables