DataLad NEXT extension
This DataLad extension can be thought of as a staging area for additional functionality, or for improved performance and user experience. Unlike other topical or more experimental extensions, the focus here is on functionality with broad applicability. This extension is a suitable dependency for other software packages that intend to build on this improved set of functionality.
Installation and usage
Install from PyPi or Github like any other Python package:
# create and enter a new virtual environment (optional)
$ virtualenv --python=python3 ~/env/dl-next
$ . ~/env/dl-next/bin/activate
# install from PyPi
$ python -m pip install datalad-next
Once installed, additional commands provided by this extension are immediately available. However, in order to fully benefit from all improvements, the extension has to be enabled for auto-loading by executing:
git config --global --add datalad.extensions.load next
Doing so will enable the extension to also alter the behavior the core DataLad package and its commands.
Functionality provided by DataLad NEXT
The following table of contents offers entry points to the main components provided by this extension. The project README offers a more detailed summary in a different format.
High-level API commands
|
Create a sibling(-tandem) on a WebDAV server |
|
Credential management and query |
|
Download from URLs |
|
Report information on files in a collection |
|
Report on the (modification) status of a dataset |
|
Visualize directory and dataset hierarchies |
Command line reference
datalad create-sibling-webdav
Synopsis
datalad create-sibling-webdav [-h] [-d DATASET] [-s NAME] [--storage-name NAME] [--mode MODE] [--credential NAME] [--existing EXISTING] [-r] [-R LEVELS] [--version] URL
Description
Create a sibling(-tandem) on a WebDAV server
WebDAV is a standard HTTP protocol extension for placing files on a server that is supported by a number of commercial storage services (e.g. 4shared.com, box.com), but also instances of cloud-storage solutions like Nextcloud or ownCloud. These software packages are also the basis for some institutional or public cloud storage solutions, such as EUDAT B2DROP.
For basic usage, only the URL with the desired dataset location on a WebDAV server needs to be specified for creating a sibling. However, the sibling setup can be flexibly customized (no storage sibling, or only a storage sibling, multi-version storage, or human-browsable single-version storage).
This command does not check for conflicting content on the WebDAV server!
When creating siblings recursively for a dataset hierarchy, subdataset exports are placed at their corresponding relative paths underneath the root location on the WebDAV server.
Collaboration on WebDAV siblings
The primary use case for WebDAV siblings is dataset deposition, where
only one site is uploading dataset and file content updates.
For collaborative workflows with multiple contributors, please make sure
to consult the documentation on the underlying datalad-annex::
Git remote helper for advice on appropriate setups:
http://docs.datalad.org/projects/next/
Git-annex implementation details
Storage siblings are presently configured to NOT be enabled
automatically on cloning a dataset. Due to a limitation of git-annex, this
would initially fail (missing credentials). Instead, an explicit
datalad siblings enable --name <storage-sibling-name>
command must be
executed after cloning. If necessary, it will prompt for credentials.
This command does not (and likely will not) support embedding credentials
in the repository (see embedcreds
option of the git-annex webdav
special remote; https://git-annex.branchable.com/special_remotes/webdav),
because such credential copies would need to be updated, whenever they
change or expire. Instead, credentials are retrieved from DataLad's
credential system. In many cases, credentials are determined automatically,
based on the HTTP authentication realm identified by a WebDAV server.
This command does not support setting up encrypted remotes (yet). Neither for the storage sibling, nor for the regular Git-remote. However, adding support for it is primarily a matter of extending the API of this command, and passing the respective options on to the underlying git-annex setup.
This command does not support setting up chunking for webdav storage siblings (https://git-annex.branchable.com/chunking).
Examples
Create a WebDAV sibling tandem for storage of a dataset's file content and revision history. A user will be prompted for any required credentials, if they are not yet known.:
% datalad create-sibling-webdav "https://webdav.example.com/myds"
Such a dataset can be cloned by DataLad via a specially crafted URL. Again, credentials are automatically determined, or a user is prompted to enter them:
% datalad clone "datalad-annex::?type=webdav&encryption=none&url=https://webdav.example.com/myds"
A sibling can also be created with a human-readable file tree, suitable for data exchange with non-DataLad users, but only able to host a single version of each file:
% datalad create-sibling-webdav --mode filetree "https://example.com/browsable"
Cloning such dataset siblings is possible via a convenience URL:
% datalad clone "webdavs://example.com/browsable"
In all cases, the storage sibling needs to explicitly enabled prior to file content retrieval:
% datalad siblings enable --name example.com-storage
Options
URL
URL identifying the sibling root on the target WebDAV server.
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
specify the dataset to process. If no dataset is given, an attempt is made to identify the dataset based on the current working directory.
-s NAME, --name NAME
name of the sibling. If none is given, the hostname-part of the WebDAV URL will be used. With RECURSIVE, the same name will be used to label all the subdatasets' siblings.
--storage-name NAME
name of the storage sibling (git-annex special remote). Must not be identical to the sibling name. If not specified, defaults to the sibling name plus '-storage' suffix. If only a storage sibling is created, this setting is ignored, and the primary sibling name is used.
--mode MODE
Siblings can be created in various modes: full-featured sibling tandem, one for a dataset's Git history and one storage sibling to host any number of file versions ('annex'). A single sibling for the Git history only ('git-only'). A single annex sibling for multi-version file storage only ('annex-only'). As an alternative to the standard (annex) storage sibling setup that is capable of storing any number of historical file versions using a content hash layout ('annex'|'annex-only'), the 'filetree' mode can used. This mode offers a human- readable data organization on the WebDAV remote that matches the file tree of a dataset (branch). However, it can, consequently, only store a single version of each file in the file tree. This mode is useful for depositing a single dataset snapshot for consumption without DataLad. The 'filetree' mode nevertheless allows for cloning such a single-version dataset, because the full dataset history can still be pushed to the WebDAV server. Git history hosting can also be turned off for this setup ('filetree-only'). When both a storage sibling and a regular sibling are created together, a publication dependency on the storage sibling is configured for the regular sibling in the local dataset clone. [Default: 'annex']
--credential NAME
name of the credential providing a user/password credential to be used for authorization. The credential can be supplied via configuration setting 'datalad.credential.<name>.user|secret', or environment variable DATALAD_CREDENTIAL_<NAME>_USER|SECRET, or will be queried from the active credential store using the provided name. If none is provided, the last-used credential for the authentication realm associated with the WebDAV URL will be used. Only if a credential name was given, it will be encoded in the URL of the created WebDAV Git remote, credential auto-discovery will be performed on each remote access.
--existing EXISTING
action to perform, if a (storage) sibling is already configured under the given name. In this case, sibling creation can be skipped ('skip') or the sibling (re-)configured ('reconfigure') in the dataset, or the command be instructed to fail ('error'). [Default: 'error']
-r, --recursive
if set, recurse into potential subdatasets.
-R LEVELS, --recursion-limit LEVELS
limit recursion into subdatasets to the given number of levels. Constraints: value must be convertible to type 'int' or value must be NONE
--version
show the module and its version which provides the command
datalad credentials
Synopsis
datalad credentials [-h] [--prompt PROMPT] [-d DATASET] [--version] [action] [[name] [:]property[=value] ...]
Description
Credential management and query
This command enables inspection and manipulation of credentials used throughout DataLad.
The command provides four basic actions:
QUERY
When executed without any property specification, all known credentials with all their properties will be yielded. Please note that this may not include credentials that only comprise of a secret and no other properties, or legacy credentials for which no trace in the configuration can be found. Therefore, the query results are not guaranteed to contain all credentials ever configured by DataLad.
When additional property/value pairs are specified, only credentials that have matching values for all given properties will be reported. This can be used, for example, to discover all suitable credentials for a specific "realm", if credentials were annotated with such information.
SET
This is the companion to 'get', and can be used to store properties and secret of a credential. Importantly, and in contrast to a 'get' operation, given properties with no values indicate a removal request. Any matching properties on record will be removed. If a credential is to be stored for which no secret is on record yet, an interactive session will prompt a user for a manual secret entry.
Only changed properties will be contained in the result record.
The appearance of the interactive secret entry can be configured with the two settings datalad.credentials.repeat-secret-entry and datalad.credentials.hidden-secret-entry.
REMOVE
This action will remove any secret and properties associated with a credential identified by its name.
GET (plumbing operation)
This is a read-only action that will never store (updates of) credential
properties or secrets. Given properties will amend/overwrite those already
on record. When properties with no value are given, and also no value for
the respective properties is on record yet, their value will be requested
interactively, if a --prompt
text was provided too. This can be
used to ensure a complete credential record, comprising any number of
properties.
Details on credentials
A credential comprises any number of properties, plus exactly one secret. There are no constraints on the format or property values or the secret, as long as they are encoded as a string.
Credential properties are normally stored as configuration settings in a user's configuration ('global' scope) using the naming scheme:
datalad.credential.<name>.<property>
Therefore both credential name and credential property name must be
syntax-compliant with Git configuration items. For property names this
means only alphanumeric characters and dashes. For credential names
virtually no naming restrictions exist (only null-byte and newline are
forbidden). However, when naming credentials it is recommended to use
simple names in order to enable convenient one-off credential overrides
by specifying DataLad configuration items via their environment variable
counterparts (see the documentation of the configuration
command
for details. In short, avoid underscores and special characters other than
'.' and '-'.
While there are no constraints on the number and nature of credential properties, a few particular properties are recognized on used for particular purposes:
'secret': always refers to the single secret of a credential
'type': identifies the type of a credential. With each standard type, a list of mandatory properties is associated (see below)
'last-used': is an ISO 8601 format time stamp that indicated the last (successful) usage of a credential
Standard credential types and properties
The following standard credential types are recognized, and their mandatory field with their standard names will be automatically included in a 'get' report.
'user_password': with properties 'user', and the password as secret
'token': only comprising the token as secret
'aws-s3': with properties 'key-id', 'session', 'expiration', and the secret_id as the credential secret
Legacy support
DataLad credentials not configured via this command may not be fully discoverable (i.e., including all their properties). Discovery of such legacy credentials can be assisted by specifying a dedicated 'type' property.
Examples
Report all discoverable credentials:
% datalad credentials
Set a new credential mycred & input its secret interactively:
% datalad credentials set mycred
Remove a credential's type property:
% datalad credentials set mycred :type
Get all information on a specific credential in a structured record:
% datalad -f json credentials get mycred
Upgrade a legacy credential by annotating it with a 'type' property:
% datalad credentials set legacycred type=user_password
Set a new credential of type user_password, with a given user property, and input its secret interactively:
% datalad credentials set mycred type=user_password user=<username>
Obtain a (possibly yet undefined) credential with a minimum set of properties. All missing properties and secret will be prompted for, no information will be stored! This is mostly useful for ensuring availability of an appropriate credential in an application context:
% datalad credentials --prompt 'can I haz info plz?' get newcred :newproperty
Options
action
which action to perform. [Default: 'query']
[name] [:]property[=value]
specification ofa credential name and credential properties. Properties are either given as name/value pairs or as a property name prefixed by a colon. Properties prefixed with a colon indicate a property to be deleted (action 'set'), or a property to be entered interactively, when no value is set yet, and a prompt text is given (action 'get'). All property names are case-insensitive, must start with a letter or a digit, and may only contain '-' apart from these characters.
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
--prompt PROMPT
message to display when entry of missing credential properties is required for action 'get'. This can be used to present information on the nature of a credential and for instructions on how to obtain a credential.
-d DATASET, --dataset DATASET
specify a dataset whose configuration to inspect rather than the global (user) settings.
--version
show the module and its version which provides the command
datalad download
Synopsis
datalad download [-h] [-d DATASET] [--force {overwrite-existing}] [--credential NAME] [--hash ALGORITHM] [--version] <path>|<url>|<url-path-pair> [<path>|<url>|<url-path-pair> ...]
Description
Download from URLs
This command is the front-end to an extensible framework for performing downloads from a variety of URL schemes. Built-in support for the schemes 'http', 'https', 'file', and 'ssh' is provided. Extension packages may add additional support.
In contrast to other downloader tools, this command integrates with the DataLad credential management and is able to auto-discover credentials. If no credential is available, it automatically prompts for them, and offers to store them for reuse after a successful authentication.
Simultaneous hashing (checksumming) of downloaded content is supported with user-specified algorithms.
The command can process any number of downloads (serially). it can read download specifications from (command line) arguments, files, or STDIN. It can deposit downloads to individual files, or stream to STDOUT.
Implementation and extensibility
Each URL scheme is processed by a dedicated handler. Additional schemes can be supported by sub-classing datalad_next.url_operations.UrlOperations and implementing the download() method. Extension packages can register new handlers, by patching them into the datalad_next.download._urlscheme_handlers registry dict.
Examples
Download webpage to "myfile.txt":
% datalad download "http://example.com myfile.txt"
Read download specification from STDIN (e.g. JSON-lines):
% datalad download -
Simultaneously hash download, hexdigest reported in result record:
% datalad download --hash sha256 http://example.com/data.xml"
Download from SSH server:
% datalad download "ssh://example.com/home/user/data.xml"
Stream a download to STDOUT:
% datalad -f disabled download "http://example.com -"
Options
<path>|<url>|<url-path-pair>
Download sources and targets can be given in a variety of formats: as a URL, or as a URL-path-pair that is mapping a source URL to a dedicated download target path. Any number of URLs or URL-path-pairs can be provided, either as an argument list, or read from a file (one item per line). Such a specification input file can be given as a path to an existing file (as a single value, not as part of a URL-path-pair). When the special path identifier '-' is used, the download is written to STDOUT. A specification can also be read in JSON-lines encoding (each line being a string with a URL or an object mapping a URL-string to a path-string).
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
Dataset to be used as a configuration source. Beyond reading configuration items, this command does not interact with the dataset.
--force {overwrite-existing}
By default, a target path for a download must not exist yet. 'force-overwrite' disabled this check.
--credential NAME
name of a credential to be used for authorization. If no credential is identified, the last-used credential for the authentication realm associated with the download target will be used. If there is no credential available yet, it will be prompted for. Once used successfully, a prompt for entering to save such a new credential will be presented.
--hash ALGORITHM
Name of a hashing algorithm supported by the Python 'hashlib' module, e.g. 'md5' or 'sha256'. This option can be given more than once.
--version
show the module and its version which provides the command
datalad ls-file-collection
Synopsis
datalad ls-file-collection [-h] [--hash ALGORITHM] [--version] {directory,tarfile,zipfile,gittree,gitworktree,annexworktree} ID/LOCATION
Description
Report information on files in a collection
This is a utility that can be used to query information on files in
different file collections. The type of information reported varies across
collection types. However, each result at minimum contains some kind of
identifier for the collection ('collection' property), and an identifier
for the respective collection item ('item' property). Each result
also contains a type
property that indicates particular type of file
that is being reported on. In most cases this will be file
, but
other categories like symlink
or directory
are recognized too.
If a collection type provides file-access, this command can compute one or more hashes (checksums) for any file in a collection.
Supported file collection types are:
directory
Reports on the content of a given directory (non-recursively). The collection identifier is the path of the directory. Item identifiers are the names of items within that directory. Standard properties like
size
,mtime
, orlink_target
are included in the report.gittree
Reports on the content of a Git "tree-ish". The collection identifier is that tree-ish. The command must be executed inside a Git repository. If the working directory for the command is not the repository root (in case of a non-bare repository), the report is constrained to items underneath the working directory. Item identifiers are the relative paths of items within that working directory. Reported properties include
gitsha
andgittype
; note that thegitsha
is not equivalent to a SHA1 hash of a file's content, but is the SHA-type blob identifier as reported and used by Git. Reporting of content hashes beyond thegitsha
is presently not supported.gitworktree
Reports on all tracked and untracked content of a Git repository's work tree. The collection identifier is a path of a directory in a Git repository (which can, but needs not be, its root). Item identifiers are the relative paths of items within that directory. Reported properties include
gitsha
andgittype
; note that thegitsha
is not equivalent to a SHA1 hash of a file's content, but is the SHA-type blob identifier as reported and used by Git.annexworktree
Like
gitworktree
, but amends the reported items with git-annex information, such asannexkey
,annexsize
, andannnexobjpath
.tarfile
Reports on members of a TAR archive. The collection identifier is the path of the TAR file. Item identifiers are the relative paths of archive members within the archive. Reported properties are similar to the
directory
collection type.zipfile
Like
tarfile
for reporting on ZIP archives.
Examples
Report on the content of a directory:
% datalad -f json ls-file-collection directory /tmp
Report on the content of a TAR archive with MD5 and SHA1 file hashes:
% datalad -f json ls-file-collection --hash md5 --hash sha1 tarfile myarchive.tar.gz
Register URLs for files in a directory that is also reachable via
HTTP. This uses ls-file-collection
for listing files and computing
MD5 hashes, then using jq
to filter and transform the output (just
file records, and in a JSON array), and passes them to addurls
,
which generates annex keys/files and assigns URLs. When the command
finishes, the dataset contains no data, but can retrieve the files
after confirming their availability (i.e., via git annex fsck):
% datalad -f json ls-file-collection directory wwwdir --hash md5 \
| jq '. | select(.type == "file")' \
| jq --slurp . \
| datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - 'https://example.com/{item}'
List annex keys of all files in the working tree of a dataset:
% datalad -f json ls-file-collection annexworktree . \
| jq '. | select(.annexkey) | .annexkey'
Options
{directory,tarfile,zipfile,gittree,gitworktree,annexworktree}
Name of the type of file collection to report on.
ID/LOCATION
identifier or location of the file collection to report on. Depending on the type of collection to process, the specific nature of this parameter can be different. A common identifier for a file collection is a path (to a directory, to an archive), but might also be a URL. See the documentation for details on supported collection types.
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
--hash ALGORITHM
One or more names of algorithms to be used for reporting file hashes. They must be supported by the Python 'hashlib' module, e.g. 'md5' or 'sha256'. Reporting file hashes typically implies retrieving/reading file content. This processing may also enable reporting of additional properties that may otherwise not be readily available. This option can be given more than once.
--version
show the module and its version which provides the command
datalad next-status
Synopsis
datalad next-status [-h] [-d DATASET] [--untracked {no,whole-dir,no-empty-dir,normal,all}] [-r [{no,repository,datasets,mono}]] [-e {no,commit,full}] [--version]
Description
Report on the (modification) status of a dataset
NOTE
This is a preview of an command implementation aiming to replace the DataLad
status
command.For now, expect anything here to change again.
This command provides a report that is roughly identical to that of
git status
. Running with default parameters yields a report that
should look familiar to Git and DataLad users alike, and contain
the same information as offered by git status
.
The main difference to git status
are:
Support for recursion into submodule.
git status
does that too, but the report is limited to the global state of an entire submodule, whereas this command can issue detailed reports in changes inside a submodule (any nesting depth).Support for directory-constrained reporting. Much like
git status
limits its report to a single repository, this command can optionally limit its report to a single directory and its direct children. In this report subdirectories are considered containers (much like) submodules, and a change summary is provided for them.Support for a "mono" (monolithic repository) report. Unlike a standard recursion into submodules, and checking each of them for changes with respect to the HEAD commit of the worktree, this report compares a submodule with respect to the state recorded in its parent repository. This provides an equally comprehensive status report from the point of view of a queried repository, but does not include a dedicated item on the global state of a submodule. This makes nested hierarchy of repositories appear like a single (mono) repository.
Support for "adjusted mode" git-annex repositories. These utilize a managed branch that is repeatedly rewritten, hence is not suitable for tracking within a parent repository. Instead, the underlying "corresponding branch" is used, which contains the equivalent content in an un-adjusted form, persistently. This command detects this condition and automatically check a repositories state against the corresponding branch state.
Presently missing/planned features
There is no support for specifying paths (or pathspecs) for constraining the operation to specific dataset parts. This will be added in the future.
There is no reporting of git-annex properties, such as tracked file size. It is undetermined whether this will be added in the future. However, even without a dedicated switch, this command has support for datasets (and their submodules) in git-annex's "adjusted mode".
Differences to the ``status`` command implementation prior DataLad v2
Like
git status
this implementation reports on dataset modification, whereas the previousstatus
also provided a listing of unchanged dataset content. This is no longer done. Equivalent functionality for listing dataset content is provided by thels_file_collection
command.The implementation is substantially faster. Depending on the context the speed-up is typically somewhere between 2x and 100x.
The implementation does not suffer from the limitation re type change detection.
Python and CLI API of the command use uniform parameter validation.
Examples
Options
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-d DATASET, --dataset DATASET
Dataset to be used as a configuration source. Beyond reading configuration items, this command does not interact with the dataset.
--untracked {no,whole-dir,no-empty-dir,normal,all}
Determine how untracked content is considered and reported when comparing a revision to the state of the working tree. 'no': no untracked content is considered as a change; 'normal': untracked files and entire untracked directories are reported as such; 'all': report individual files even in fully untracked directories. In addition to these git-status modes, 'whole-dir' (like normal, but include empty directories), and 'no-empty-dir' (alias for 'normal') are understood. [Default: 'normal']
-r [{no,repository,datasets,mono}], --recursive [{no,repository,datasets,mono}]
Mode of recursion for status reporting. With 'no' the report is restricted to a single directory and its direct children. With 'repository', the report comprises all repository content underneath current working directory or root of a given dataset, but is limited to items directly contained in that repository. With 'datasets', the report also comprises any content in any subdatasets. Each subdataset is evaluated against its respective HEAD commit. With 'mono', a report similar to 'datasets' is generated, but any subdataset is evaluate with respect to the state recorded in its parent repository. In contrast to the 'datasets' mode, no report items on a joint submodule are generated. If no particular value is given with this option the 'datasets' mode is selected. [Default: 'repository']
-e {no,commit,full}, --eval-subdataset-state {no,commit,full}
Evaluation of subdataset state (modified or untracked content) can be expensive for deep dataset hierarchies as subdataset have to be tested recursively for uncommitted modifications. Setting this option to 'no' or 'commit' can substantially boost performance by limiting what is being tested. With 'no' no state is evaluated and subdataset are not investigated for modifications. With 'commit' only a discrepancy of the HEAD commit gitsha of a subdataset and the gitsha recorded in the superdataset's record is evaluated. With 'full' any other modifications are considered too. [Default: 'full']
--version
show the module and its version which provides the command
datalad tree
Synopsis
datalad tree [-h] [-L DEPTH] [-r] [-R LEVELS] [--include-files] [--include-hidden] [--version] [path]
Description
Visualize directory and dataset hierarchies
This command mimics the UNIX/MS-DOS 'tree' utility to generate and display a directory tree, with DataLad-specific enhancements.
It can serve the following purposes:
Glorified 'tree' command
Dataset discovery
Programmatic directory traversal
Glorified 'tree' command
The rendered command output uses 'tree'-style visualization:
/tmp/mydir
├── [DS~0] ds_A/
│ └── [DS~1] subds_A/
└── [DS~0] ds_B/
├── dir_B/
│ ├── file.txt
│ ├── subdir_B/
│ └── [DS~1] subds_B0/
└── [DS~1] (not installed) subds_B1/
5 datasets, 2 directories, 1 file
Dataset paths are prefixed by a marker indicating subdataset hierarchy
level, like [DS~1]
.
This is the absolute subdataset level, meaning it may also take into
account superdatasets located above the tree root and thus not included
in the output.
If a subdataset is registered but not installed (such as after a
non-recursive datalad clone
), it will be prefixed by (not
installed)
. Only DataLad datasets are considered, not pure
git/git-annex repositories.
The 'report line' at the bottom of the output shows the count of displayed datasets, in addition to the count of directories and files. In this context, datasets and directories are mutually exclusive categories.
By default, only directories (no files) are included in the tree, and hidden directories are skipped. Both behaviours can be changed using command options.
Symbolic links are always followed. This means that a symlink pointing to a directory is traversed and counted as a directory (unless it potentially creates a loop in the tree).
Dataset discovery
Using the --recursive
or --recursion-limit
option, this command generates the layout of dataset hierarchies based on
subdataset nesting level, regardless of their location in the
filesystem.
In this case, tree depth is determined by subdataset depth. This mode is thus suited for discovering available datasets when their location is not known in advance.
By default, only datasets are listed, without their contents. If
--depth
is specified additionally,
the contents of each dataset will be included up to --depth
directory levels (excluding
subdirectories that are themselves datasets).
Tree filtering options such as --include-hidden
only affect which directories are
reported as dataset contents, not which directories are traversed to find
datasets.
Performance note: since no assumption is made on the location of
datasets, running this command with the --recursive
or --recursion-limit
option does a full scan of the whole directory
tree. As such, it can be significantly slower than a call with an
equivalent output that uses --depth
to
limit the tree instead.
Programmatic directory traversal
The command yields a result record for each tree node (dataset, directory or file). The following properties are reported, where available:
- "path"
Absolute path of the tree node
- "type"
Type of tree node: "dataset", "directory" or "file"
- "depth"
Directory depth of node relative to the tree root
- "exhausted_levels"
Depth levels for which no nodes are left to be generated (the respective subtrees have been 'exhausted')
- "count"
Dict with cumulative counts of datasets, directories and files in the tree up until the current node. File count is only included if the command is run with the
--include-files
option.- "dataset_depth"
Subdataset depth level relative to the tree root. Only included for node type "dataset".
- "dataset_abs_depth"
Absolute subdataset depth level. Only included for node type "dataset".
- "dataset_is_installed"
Whether the registered subdataset is installed. Only included for node type "dataset".
- "symlink_target"
If the tree node is a symlink, the path to the link target
- "is_broken_symlink"
If the tree node is a symlink, whether it is a broken symlink
Examples
Show up to 3 levels of subdirectories below the current directory, including files and hidden contents:
% datalad tree -L 3 --include-files --include-hidden
Find all top-level datasets located anywhere under /tmp
:
% datalad tree /tmp -R 0
Report all subdatasets recursively and their directory contents, up to 1 subdirectory deep within each dataset:
% datalad tree -r -L 1
Options
path
path to directory from which to generate the tree. Defaults to the current directory. [Default: '.']
-h, --help, --help-np
show this help message. --help-np forcefully disables the use of a pager for displaying the help message
-L DEPTH, --depth DEPTH
limit the tree to maximum level of subdirectories. If not specified, will generate the full tree with no depth constraint. If paired with --recursive
or --recursion-limit
, refers to the maximum directory level to output below each dataset.
-r, --recursive
produce a dataset tree of the full hierarchy of nested subdatasets. Note: may have slow performance on large directory trees.
-R LEVELS, --recursion-limit LEVELS
limit the dataset tree to maximum level of nested subdatasets. 0 means include only top-level datasets, 1 means top-level datasets and their immediate subdatasets, etc. Note: may have slow performance on large directory trees.
--include-files
include files in the tree.
--version
show the module and its version which provides the command
Python tooling
datalad-next
comprises a number of more-or-less self-contained
mini-packages providing particular functionality. These implementations
are candidates for a migration into the DataLad core package, and are
provided here for immediate use. If and when components are migrated,
transition modules will be kept to prevent API breakage in dependent
packages.
Handler for operations on various archive types |
|
Essential tooling for implementing DataLad commands |
|
Configuration query and manipulation |
|
Data validation, coercion, and parameter documentation |
|
Common constants |
|
Credential management |
|
Representations of DataLad datasets built on git/git-annex repositories |
|
Special purpose exceptions |
|
Context manager to communicate with a subprocess using iterables |
|
Various iterators, e.g., for subprocess pipelining and output processing |
|
Iterators for particular types of collections |
|
Common repository operations |
|
Execution of subprocesses |
|
A persistent shell connection |
|
Tooling for test implementations |
|
Collection of fixtures for facilitation test implementations |
|
Custom types and dataclasses |
|
UI abstractions for user communication |
|
Handlers for operations on various URL types and protocols |
|
Assorted utility functions |
Git-remote helpers
git-remote-datalad-annex to fetch/push via any git-annex special remote |
Git-annex backends
Interface and essential utilities to implement external git-annex backends |
|
git-annex external backend XDLRA for git-remote-datalad-annex |
Git-annex special remotes
|
Base class of all datalad-next git-annex special remotes |
git-annex special remote archivist for obtaining files from archives |
|
uncurl git-annex external special remote |
DataLad patches
Patches that are automatically applied to DataLad when loading the
datalad-next
extension package.
Credential support for |
|
Post DataLad config overrides CLI/ENV as GIT_CONFIG items in process ENV |
|
Improve |
|
Change the default of |
|
Enable |
|
Improved credential handling for |
|
Streamline user experience |
|
Connect |
|
|
|
Uniform pre-execution parameter validation for commands |
|
Make push avoid refspec handling for special remote push targets |
|
Add support for export to WebDAV remotes to |
|
Enhance |
|
Auto-deploy credentials when enabling special remotes |
|
Recognize DATALAD_TESTS_TMP_KEYRING_PATH to set alternative secret storage |
|
Robustify |
Developing with DataLad NEXT
This extension package moves fast in comparison to the DataLad core package. Nevertheless, attention is paid to API stability, adequate semantic versioning, and informative changelogs.
Besides the DataLad commands shipped with this extension package, a number of Python utilities are provided that facilitate the implementation of workflows and additional functionality. An overview is available in the reference manual.
Public vs internal Python API
Anything that can be imported directly from any of the top-level sub-packages in datalad_next is considered to be part of the public API. Changes to this API determine the versioning, and development is done with the aim to keep this API as stable as possible. This includes signatures and return value behavior.
As an example:
from datalad_next.runners import iter_git_subproc
imports a part of the public API, but:
from datalad_next.runners.git import iter_git_subproc
does not.
Use of the internal API
Developers can obviously use parts of the non-public API. However, this should only be done with the understanding that these components may change from one release to another, with no guarantee of transition periods, deprecation warnings, etc.
Developers are advised to never reuse any components with names starting with _ (underscore). Their use should be limited to their individual sub-package.
Contributor information
Developer Guide
This guide sheds light on new and reusable subsystems developed in datalad-next
.
The target audience are developers that intend to build up on or use functionality provided by this extension.
datalad-next
's Constraint System
datalad_next.constraints
implements a system to perform data validation, coercion, and parameter documentation for commands via a flexible set of "Constraints".
You can find an overview of available Constraints in the respective module overview of the Python tooling.
Adding parameter validation to a command
In order to equip an existing or new command with the constraint system, the following steps are required:
Set the commands base class to
ValidatedInterface
:
from datalad_next.commands import ValidatedInterface
@build_doc
class MyCommand(ValidatedInterface):
"""Download from URLs"""
Declare a
_validator_
class member:
from datalad_next.commands import (
EnsureCommandParameterization,
ValidatedInterface,
)
@build_doc
class MyCommand(ValidatedInterface):
"""Download from URLs"""
_validator_ = EnsureCommandParameterization(dict(
[...]
))
Determine for each parameter of the command whether it has constraints, and what those constraints are. If you're transitioning an existing command, remove any
constraints=
declaration in the_parameter_
class member.Add a fitting Constraint declaration for each parameter into the
_validator_
as a key-value pair where the key is the parameter and its value is a Constraint. There does not need to be a Constraint per parameter; only add entries for parameters that need validation.
from datalad_next.commands import (
EnsureCommandParameterization,
ValidatedInterface,
)
from datalad_next.constraints import EnsureChoice
from datalad_next.constraints import EnsureDataset
@build_doc
class Download(ValidatedInterface):
"""Download from URLs"""
_validator_ = EnsureCommandParameterization(dict(
dataset=EnsureDataset(installed=True),
force=EnsureChoice('yes','no','maybe'),
))
Combining constraints
Constraints can be combined in different ways.
The |
, &
, and ()
operators allow AND
, OR
, and grouping of Constraints.
The following example from the download
command defines a chain of possible Constraints:
spec_item_constraint = url2path_constraint | (
(
EnsureJSON() | EnsureURLFilenamePairFromURL()
) & url2path_constraint)
Constrains can also be combined using AnyOf
or AllOf
MultiConstraints, which correspond almost entirely to |
and &
.
Here's another example from the download
command:
spec_constraint = AnyOf(
spec_item_constraint,
EnsureListOf(spec_item_constraint),
EnsureGeneratorFromFileLike(
spec_item_constraint,
exc_mode='yield',
),
One can combine an arbitrary number of Constraints. They are evaluated in the order in which they were specified. Logical OR constraints will return the value from the first constraint that does not raise an exception, and logical AND constraints pass the return values of each constraint into the next.
Implementing additional constraints
TODO
Parameter Documentation
TODO
Contributing to datalad-next
We're happy about contributions of any kind to this project - thanks for considering making one!
Please take a look at CONTRIBUTING.md for an overview of development principles and common questions, and get in touch in case of questions or to discuss features, bugs, or other issues.