datalad_revolution.dataset

Amendment of the DataLad Dataset base class

class datalad_revolution.dataset.EnsureDataset

Bases: datalad.distribution.dataset.EnsureDataset

class datalad_revolution.dataset.RevolutionDataset(path)

Bases: datalad.distribution.dataset.Dataset

get_subdatasets(**kwargs)
pathobj

pathobj for the dataset

repo

Get an instance of the version control system/repo for this dataset, or None if there is none yet.

If creating an instance of GitRepo is guaranteed to be really cheap this could also serve as a test whether a repo is present.

Returns:
Return type:GitRepo
rev_create(initopts=None, force=False, description=None, dataset=None, no_annex=False, fake_dates=False)

Create a new dataset from scratch.

This command initializes a new dataset at a given location, or the current directory. The new dataset can optionally be registered in an existing superdataset (the new dataset’s path needs to be located within the superdataset for that, and the superdataset needs to be given explicitly via dataset). It is recommended to provide a brief description to label the dataset’s nature and location, e.g. “Michael’s music on black laptop”. This helps humans to identify data locations in distributed scenarios. By default an identifier comprised of user and machine name, plus path will be generated.

This command only creates a new dataset, it does not add existing content to it, even if the target directory already contains additional files or directories.

Plain Git repositories can be created via the no_annex flag. However, the result will not be a full dataset, and, consequently, not all features are supported (e.g. a description).

To create a local version of a remote dataset use the install() command instead.

Note

Power-user info: This command uses git init and git annex init to prepare the new dataset. Registering to a superdataset is performed via a git submodule add operation in the discovered superdataset.

Parameters:
  • path (str or Dataset or None, optional) – path where the dataset shall be created, directories will be created as necessary. If no location is provided, a dataset will be created in the current working directory. Either way the command will error if the target directory is not empty. Use force to create a dataset in a non-empty directory. [Default: None]
  • initopts – options to pass to git init. Options can be given as a list of command line arguments or as a GitPython-style option dictionary. Note that not all options will lead to viable results. For example ‘–bare’ will not yield a repository where DataLad can adjust files in its worktree. [Default: None]
  • force (bool, optional) – enforce creation of a dataset in a non-empty directory. [Default: False]
  • description (str or None, optional) – short description to use for a dataset location. Its primary purpose is to help humans to identify a dataset copy (e.g., “mike’s dataset on lab server”). Note that when a dataset is published, this information becomes available on the remote side. [Default: None]
  • dataset (Dataset or None, optional) – specify the dataset to perform the create operation on. If a dataset is given, a new subdataset will be created in it. [Default: None]
  • no_annex (bool, optional) – if set, a plain Git repository will be created without any annex. [Default: False]
  • fake_dates (bool, optional) – Configure the repository to use fake dates. The date for a new commit will be set to one second later than the latest commit in the repository. This can be used to anonymize dates. [Default: False]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'paths', 'relpaths', 'successdatasets-or-none', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]
rev_diff(to=None, path=None, dataset=None, annex=None, untracked='normal', recursive=False, recursion_limit=None)

Report differences between two states of a dataset (hierarchy)

The two to-be-compared states are given via to –from and –to options. These state identifiers are evaluated in the context of the (specified or detected) dataset. In case of a recursive report on a dataset hierarchy corresponding state pairs for any subdataset are determined from the subdataset record in the respective superdataset. Only changes recorded in a subdataset between these two states are reported, and so on.

Any paths given as additional arguments will be used to constrain the difference report. As with Git’s diff, it will not result in an error when a path is specified that does not exist on the filesystem.

Reports are very similar to those of the rev-status command, with the distinguished content types and states being identical.

Parameters:
  • fr (1-item sequence of str, optional) – original state to compare to, as given by any identifier that Git understands. [Default: ‘HEAD’]
  • to (1-item sequence of str or None, optional) – state to compare against the original state, as given by any identifier that Git understands. If none is specified, the state of the worktree will be used compared. [Default: None]
  • path (sequence of str or None, optional) – path to contrain the report to. [Default: None]
  • dataset (Dataset or None, optional) – specify the dataset to query. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. [Default: None]
  • annex ({None, 'basic', 'availability', 'all'}, optional) – Switch whether to include information on the annex content of individual files in the status report, such as recorded file size. By default no annex information is reported (faster). Three report modes are available: basic information like file size and key name (‘basic’); additionally test whether file content is present in the local annex (‘availability’; requires one or two additional file system stat calls, but does not call git-annex), this will add the result properties ‘has_content’ (boolean flag) and ‘objloc’ (absolute path to an existing annex object file); or ‘all’ which will report all available information (presently identical to ‘availability’). [Default: None]
  • untracked ({'no', 'normal', 'all'}, optional) – If and how untracked content is reported when comparing a revision to the state of the work tree. ‘no’: no untracked content is reported; ‘normal’: untracked files and entire untracked directories are reported as such; ‘all’: report individual files even in fully untracked directories. [Default: ‘normal’]
  • recursive (bool, optional) – if set, recurse into potential subdataset. [Default: False]
  • recursion_limit (int or None, optional) – limit recursion into subdataset to the given number of levels. [Default: None]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'paths', 'relpaths', 'successdatasets-or-none', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]
rev_run(dataset=None, inputs=None, outputs=None, expand=None, explicit=False, message=None, sidecar=None)

Run an arbitrary shell command and record its impact on a dataset.

It is recommended to craft the command such that it can run in the root directory of the dataset that the command will be recorded in. However, as long as the command is executed somewhere underneath the dataset root, the exact location will be recorded relative to the dataset root.

If the executed command did not alter the dataset in any way, no record of the command execution is made.

If the given command errors, a CommandError exception with the same exit code will be raised, and no modifications will be saved.

Command format

A few placeholders are supported in the command via Python format specification. “{pwd}” will be replaced with the full path of the current working directory. “{dspath}” will be replaced with the full path of the dataset that run is invoked on. “{inputs}” and “{outputs}” represent the values specified by inputs and outputs. If multiple values are specified, the values will be joined by a space. The order of the values will match that order from the command line, with any globs expanded in alphabetical order (like bash). Individual values can be accessed with an integer index (e.g., “{inputs[0]}”).

Note that the representation of the inputs or outputs in the formatted command string depends on whether the command is given as a list of arguments or as a string. The concatenated list of inputs or outputs will be surrounded by quotes when the command is given as a list but not when it is given as a string. This means that the string form is required if you need to pass each input as a separate argument to a preceding script (i.e., write the command as “./script {inputs}”, quotes included). The string form should also be used if the input or output paths contain spaces or other characters that need to be escaped.

To escape a brace character, double it (i.e., “{{” or “}}”).

Custom placeholders can be added as configuration variables under “datalad.run.substitutions”. As an example:

Add a placeholder “name” with the value “joe”:

% git config --file=.datalad/config datalad.run.substitutions.name joe
% datalad add -m "Configure name placeholder" .datalad/config

Access the new placeholder in a command:

% datalad run "echo my name is {name} >me"
Parameters:
  • cmd – command for execution. [Default: None]
  • dataset (Dataset or None, optional) – specify the dataset to record the command results in. An attempt is made to identify the dataset based on the current working directory. If a dataset is given, the command will be executed in the root directory of this dataset. [Default: None]
  • inputs – A dependency for the run. Before running the command, the content of this file will be retrieved. A value of “.” means “run datalad get .”. The value can also be a glob. [Default: None]
  • outputs – Prepare this file to be an output file of the command. A value of “.” means “run datalad unlock .” (and will fail if some content isn’t present). For any other value, if the content of this file is present, unlock the file. Otherwise, remove it. The value can also be a glob. [Default: None]
  • expand (None or {'inputs', 'outputs', 'both'}, optional) – Expand globs when storing inputs and/or outputs in the commit message. [Default: None]
  • explicit (bool, optional) – Consider the specification of inputs and outputs to be explicit. Don’t warn if the repository is dirty, and only save modifications to the listed outputs. [Default: False]
  • message (str or None, optional) – a description of the state or the changes made to a dataset. [Default: None]
  • sidecar (None or bool, optional) – By default, the configuration variable ‘datalad.run.record-sidecar’ determines whether a record with information on a command’s execution is placed into a separate record file instead of the commit message (default: off). This option can be used to override the configured behavior on a case-by-case basis. Sidecar files are placed into the dataset’s ‘.datalad/runinfo’ directory (customizable via the ‘datalad.run.record-directory’ configuration variable). [Default: None]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'paths', 'relpaths', 'successdatasets-or-none', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]
rev_save(message=None, dataset=None, version_tag=None, recursive=False, recursion_limit=None, updated=False, message_file=None, to_git=None)

Save the current state of a dataset

Saving the state of a dataset records changes that have been made to it. This change record is annotated with a user-provided description. Optionally, an additional tag, such as a version, can be assigned to the saved state. Such tag enables straightforward retrieval of past versions at a later point in time.

Examples

Save any content underneath the current directory, without altering any potential subdataset (use –recursive for that):

% datalad save .

Save any modification of known dataset content, but leave untracked files (e.g. temporary files) untouched:

% dataset save -d <path_to_dataset>

Tag the most recent saved state of a dataset:

% dataset save -d <path_to_dataset> --version-tag bestyet
Parameters:
  • path (sequence of str or None, optional) – path/name of the dataset component to save. If given, only changes made to those components are recorded in the new state. [Default: None]
  • message (str or None, optional) – a description of the state or the changes made to a dataset. [Default: None]
  • dataset (Dataset or None, optional) – “specify the dataset to save. [Default: None]
  • version_tag (str or None, optional) – an additional marker for that state. Every dataset that is touched will receive the tag. [Default: None]
  • recursive (bool, optional) – if set, recurse into potential subdataset. [Default: False]
  • recursion_limit (int or None, optional) – limit recursion into subdataset to the given number of levels. [Default: None]
  • updated (bool, optional) – if given, only saves previously tracked paths. [Default: False]
  • message_file (str or None, optional) – take the commit message from this file. This flag is mutually exclusive with -m. [Default: None]
  • to_git (bool, optional) – flag whether to add data directly to Git, instead of tracking data identity only. Usually this is not desired, as it inflates dataset sizes and impacts flexibility of data transport. If not specified - it will be up to git-annex to decide, possibly on .gitattributes options. Use this flag with a simultaneous selection of paths to save. In general, it is better to pre-configure a dataset to track particular paths, file types, or file sizes with either Git or git- annex. See https://git-annex.branchable.com/tips/largefiles/. [Default: None]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'paths', 'relpaths', 'successdatasets-or-none', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]
rev_status(dataset=None, annex=None, untracked='normal', recursive=False, recursion_limit=None)

Report on the state of dataset content.

This is an analog to git status that is simultaneously crippled and more powerful. It is crippled, because it only supports a fraction of the functionality of its counter part and only distinguishes a subset of the states that Git knows about. But it is also more powerful as it can handle status reports for a whole hierarchy of datasets, with the ability to report on a subset of the content (selection of paths) across any number of datasets in the hierarchy.

Path conventions

All reports are guaranteed to use absolute paths that are underneath the given or detected reference dataset, regardless of whether query paths are given as absolute or relative paths (with respect to the working directory, or to the reference dataset, when such a dataset is given explicitly). Moreover, so-called “explicit relative paths” (i.e. paths that start with ‘.’ or ‘..’) are also supported, and are interpreted as relative paths with respect to the current working directory regardless of whether a reference dataset with specified.

When it is necessary to address a subdataset record in a superdataset without causing a status query for the state _within_ the subdataset itself, this can be achieved by explicitly providing a reference dataset and the path to the root of the subdataset like so:

datalad rev-status --dataset . subdspath

In contrast, when the state of the subdataset within the superdataset is not relevant, a status query for the content of the subdataset can be obtained by adding a trailing path separator to the query path (rsync-like syntax):

datalad rev-status --dataset . subdspath/

When both aspects are relevant (the state of the subdataset content and the state of the subdataset within the superdataset), both queries can be combined:

datalad rev-status --dataset . subdspath subdspath/

When performing a recursive status query, both status aspects of subdataset are always included in the report.

Content types

The following content types are distinguished:

  • ‘dataset’ – any top-level dataset, or any subdataset that is properly registered in superdataset
  • ‘directory’ – any directory that does not qualify for type ‘dataset’
  • ‘file’ – any file, or any symlink that is placeholder to an annexed file
  • ‘symlink’ – any symlink that is not used as a placeholder for an annexed file

Content states

The following content states are distinguished:

  • ‘clean’
  • ‘added’
  • ‘modified’
  • ‘deleted’
  • ‘untracked’
Parameters:
  • path (sequence of str or None, optional) – path to be evaluated. [Default: None]
  • dataset (Dataset or None, optional) – specify the dataset to query. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. [Default: None]
  • annex ({None, 'basic', 'availability', 'all'}, optional) – Switch whether to include information on the annex content of individual files in the status report, such as recorded file size. By default no annex information is reported (faster). Three report modes are available: basic information like file size and key name (‘basic’); additionally test whether file content is present in the local annex (‘availability’; requires one or two additional file system stat calls, but does not call git-annex), this will add the result properties ‘has_content’ (boolean flag) and ‘objloc’ (absolute path to an existing annex object file); or ‘all’ which will report all available information (presently identical to ‘availability’). [Default: None]
  • untracked ({'no', 'normal', 'all'}, optional) – If and how untracked content is reported when comparing a revision to the state of the work tree. ‘no’: no untracked content is reported; ‘normal’: untracked files and entire untracked directories are reported as such; ‘all’: report individual files even in fully untracked directories. [Default: ‘normal’]
  • recursive (bool, optional) – if set, recurse into potential subdataset. [Default: False]
  • recursion_limit (int or None, optional) – limit recursion into subdataset to the given number of levels. [Default: None]
  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]
  • proc_post – Like proc_pre, but procedures are executed after the main command has finished. [Default: None]
  • proc_pre – DataLad procedure to run prior to the main command. The argument a list of lists with procedure names and optional arguments. Procedures are called in the order their are given in this list. It is important to provide the respective target dataset to run a procedure on as the dataset argument of the main command. [Default: None]
  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]
  • result_renderer ({'default', 'json', 'json_pp', 'tailored'} or None, optional) – format of return value rendering on stdout. [Default: None]
  • result_xfm ({'datasets', 'paths', 'relpaths', 'successdatasets-or-none', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]
  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]
datalad_revolution.dataset.get_dataset_root(path)

Return the root of an existent dataset containing a given path

The root path is returned in the same absolute or relative form as the input argument. If no associated dataset exists, or the input path doesn’t exist, None is returned.

If path is a symlink or something other than a directory, its the root dataset containing its parent directory will be reported. If none can be found, at a symlink at path is pointing to a dataset, path itself will be reported as the root.

datalad_revolution.dataset.path_under_dataset(ds, path)
datalad_revolution.dataset.require_dataset(dataset, check_installed=True, purpose=None)
datalad_revolution.dataset.resolve_path(path, ds=None)

Resolve a path specification (against a Dataset location)

Any explicit path (absolute or relative) is returned as an absolute path. In case of an explicit relative path (e.g. “./some”, or “.some” on windows), the current working directory is used as reference. Any non-explicit relative path is resolved against as dataset location, i.e. considered relative to the location of the dataset. If no dataset is provided, the current working directory is used.

Note however, that this function is not able to resolve arbitrarily obfuscated path specifications. All operations are purely lexical, and no actual path resolution against the filesystem content is performed. Consequently, common relative path arguments like ‘../something’ (relative to PWD) can be handled properly, but things like ‘down/../under’ cannot, as resolving this path properly depends on the actual target of any (potential) symlink leading up to ‘..’.

Parameters:
  • path (str or PathLike) – Platform-specific path specific path specification.
  • ds (Dataset or None) – Dataset instance to resolve non-explicit relative paths against.
Returns:

Return type:

pathlib.Path object