datalad.support.annexrepo

Interface to git-annex by Joey Hess.

For further information on git-annex see https://git-annex.branchable.com/.

class datalad.support.annexrepo.AnnexRepo(path, url=None, runner=None, direct=None, backend=None, always_commit=True, create=True, init=False, batch_size=None, version=None, description=None, git_opts=None, annex_opts=None, annex_init_opts=None, repo=None, fake_dates=False)[source]

Bases: datalad.support.gitrepo.GitRepo, datalad.support.repo.RepoInterface

Representation of an git-annex repository.

Paths given to any of the class methods will be interpreted as relative to PWD, in case this is currently beneath AnnexRepo’s base dir (self.path). If PWD is outside of the repository, relative paths will be interpreted as relative to self.path. Absolute paths will be accepted either way.

GIT_ANNEX_MIN_VERSION = '6.20180913'
WEB_UUID = '00000000-0000-0000-0000-000000000001'
add(files, *args, **kwargs)[source]

Add file(s) to the repository.

Parameters:
  • files (list of str) – list of paths to add to the annex
  • git (bool) – if True, add to git instead of annex.
  • backend
  • options
  • update (bool) –
    –update option for git-add. From git’s manpage:
    Update the index just where it already has an entry matching <pathspec>. This removes as well as modifies index entries to match the working tree, but adds no new files.

    If no <pathspec> is given when –update option is used, all tracked files in the entire working tree are updated (old versions of Git used to limit the update to the current directory and its subdirectories).

    Note: Used only, if a call to git-add instead of git-annex-add is performed

Returns:

Return type:

list of dict

add_(files, git=None, backend=None, options=None, jobs=None, git_options=None, annex_options=None, update=False)[source]

Like add, but returns a generator

add_remote(name, url, options=None)[source]

Overrides method from GitRepo in order to set remote.<name>.annex-ssh-options in case of a SSH remote.

add_url_to_file(file_, *args, **kwargs)[source]

Add file from url to the annex.

Downloads file from url and add it to the annex. If annex knows file already, records that it can be downloaded from url.

Note: Consider using the higher-level download_url instead.

Parameters:
  • file (str) –
  • url (str) –
  • options (list) – options to the annex command
  • batch (bool, optional) – initiate or continue with a batched run of annex addurl, instead of just calling a single git annex addurl command
  • unlink_existing (bool, optional) – by default crashes if file already exists and is under git. With this flag set to True would first remove it.
Returns:

In batch mode only ATM returns dict representation of json output returned by annex

Return type:

dict

add_urls(urls, options=None, backend=None, cwd=None, jobs=None, git_options=None, annex_options=None)[source]

Downloads each url to its own file, which is added to the annex.

Parameters:
  • urls (list of str) –
  • options (list, optional) – options to the annex command
  • cwd (string, optional) – working directory from within which to invoke git-annex
adjust(options=None)[source]

enter an adjusted branch

This command is only available in a v6+ git-annex repository.

Parameters:options (list of str) – currently requires ‘–unlock’ or ‘–fix’; default: –unlock
commit(msg=None, options=None, _datalad_msg=False, careless=True, files=None, proxy=False)[source]

Commit changes to git.

Parameters:
  • msg (str, optional) – commit-message
  • options (list of str, optional) – cmdline options for git-commit
  • _datalad_msg (bool, optional) – To signal that commit is automated commit by datalad, so it would carry the [DATALAD] prefix
  • careless (bool, optional) – if False, raise when there’s nothing actually committed; if True, don’t care
  • files (list of str, optional) – path(s) to commit
  • date (str, optional) – Date in one of the formats git understands
  • index_file (str, optional) – An alternative index to use
copy_to(files, *args, **kwargs)[source]

Copy the actual content of files to remote

Parameters:
  • files (str or list of str) – path(s) to copy
  • remote (str) – name of remote to copy files to
Returns:

files successfully copied

Return type:

list of str

default_backends
drop(files, *args, **kwargs)[source]

Drops the content of annexed files from this repository.

Drops only if possible with respect to required minimal number of available copies.

Parameters:
  • files (list of str) – paths to drop
  • options (list of str, optional) – commandline options for the git annex drop command
  • jobs (int, optional) – how many jobs to run in parallel (passed to git-annex call)
Returns:

‘success’ item in each object indicates failure/success per file path.

Return type:

list(JSON objects)

drop_key(keys, options=None, batch=False)[source]

Drops the content of annexed files from this repository referenced by keys

Dangerous: it drops without checking for required minimal number of available copies.

Parameters:
  • keys (list of str, str) –
  • batch (bool, optional) – initiate or continue with a batched run of annex dropkey, instead of just calling a single git annex dropkey command
enable_remote(name, env=None)[source]

Enables use of an existing special remote

Parameters:name (str) – name, the special remote was created with
file_has_content(files, *args, **kwargs)[source]

Check whether files have their content present under annex.

Parameters:
  • files (list of str) – file(s) to check for being actually present.
  • allow_quick (bool, optional) – allow quick check, based on having a symlink into .git/annex/objects. Works only in non-direct mode (TODO: thin mode)
Returns:

For each input file states whether file has content locally

Return type:

list of bool

find(files, *args, **kwargs)[source]

Run git annex find on file(s).

Parameters:
  • files (list of str) – files to find under annex
  • batch (bool, optional) – initiate or continue with a batched run of annex find, instead of just calling a single git annex find command. If any items in files are directories, this value is treated as False.
Returns:

  • A dictionary the maps each item in files to its git annex find
  • result. Items without a successful result will be an empty string, and
  • multi-item results (which can occur for if files includes a
  • directory) will be returned as a list.

fsck()[source]
get(files, *args, **kwargs)[source]

Get the actual content of files

Parameters:
  • files (list of str) – paths to get
  • remote (str, optional) – from which remote to fetch content
  • options (list of str, optional) – commandline options for the git annex get command
  • jobs (int or None, optional) – how many jobs to run in parallel (passed to git-annex call). If not specified (None), then
  • key (bool, optional) – If provided file value is actually a key
Returns:

files

Return type:

list of dict

get_annexed_files(with_content_only=False, patterns=None)[source]

Get a list of files in annex

Parameters:
  • with_content_only (bool, optional) – Only list files whose content is present.
  • patterns (list, optional) – Globs to pass to annex’s –include=. Files that match any of these will be returned (i.e., they’ll be separated by –or).
Returns:

Return type:

A list of file names

get_contentlocation(key, batch=False)[source]

Get location of the key content

Normally under .git/annex objects in indirect mode and within file tree in direct mode.

Unfortunately there is no (easy) way to discriminate situations when given key is simply incorrect (not known to annex) or its content not currently present – in both cases annex just silently exits with -1

Parameters:
  • key (str) – key
  • batch (bool, optional) – initiate or continue with a batched run of annex contentlocation
Returns:

path relative to the top directory of the repository. If no content is present, empty string is returned

Return type:

str

get_corresponding_branch(branch=None)[source]

In case of a managed branch, get the corresponding one.

If branch is not a managed branch, return that branch without any changes.

Note: Since default for branch is the active branch, get_corresponding_branch() is equivalent to get_active_branch() if the active branch is not a managed branch.

Parameters:branch (str) – name of the branch; defaults to active branch
Returns:name of the corresponding branch if there is any, name of the queried branch otherwise.
Return type:str
get_description(uuid=None)[source]

Get annex repository description

Parameters:uuid (str, optional) – For which remote (based on uuid) to report description for
Returns:None returned if not found
Return type:str or None
get_file_backend(files, *args, **kwargs)[source]

Get the backend currently used for file(s).

Parameters:files (list of str) –
Returns:For each file in input list indicates the used backend by a str like “SHA256E” or “MD5”.
Return type:list of str
get_file_key(files, *args, **kwargs)[source]

Get key of an annexed file.

Parameters:
  • files (str or list) – file(s) to look up
  • batch (None or bool, optional) – If True, lookupkey –batch process will be used, which would not crash even if provided file is not under annex (but directly under git), but rather just return an empty string. If False, invokes without –batch. If None, use batch mode if more than a single file is provided.
Returns:

keys used by git-annex for each of the files; in case of a list an empty string is returned if there was no key for that file

Return type:

str or list

Raises:
  • FileInGitError – If running in non-batch mode and a file is under git, not annex
  • FileNotInAnnexError – If running in non-batch mode and a file is not under git at all
get_file_size(file_, *args, **kwargs)[source]
get_groupwanted(name)[source]

Get groupwanted expression for a group name

Parameters:name (str) – Name of the groupwanted group
classmethod get_key_backend(key)[source]

Get the backend from a given key

get_metadata(files, timestamps=False)[source]

Query git-annex file metadata

Parameters:
  • files (str or list(str)) – One or more paths for which metadata is to be queried.
  • timestamps (bool, optional) – If True, the output contains a ‘<metadatakey>-lastchanged’ key for every metadata item, reflecting the modification time, as well as a ‘lastchanged’ key with the most recent modification time of any metadata item.
Returns:

One tuple per file (could be more items than input arguments when directories are given). First tuple item is the filename, second item is a dictionary with metadata key/value pairs. Note that annex metadata tags are stored under the key ‘tag’, which is a regular metadata item that can be manipulated like any other.

Return type:

generator

get_preferred_content(property, remote=None)[source]

Get preferred content configuration of a repository or remote

Parameters:
  • property ({'wanted', 'required', 'group'}) – Type of property to query
  • remote (str, optional) – If not specified (None), returns the property for the local repository.
Returns:

Whether the setting is returned, or an empty string if there is none.

Return type:

str

Raises:
  • ValueError – If an unknown property label is given.
  • CommandError – If the annex call errors.
get_remotes(with_urls_only=False, exclude_special_remotes=False)[source]

Get known (special-) remotes of the repository

Parameters:
  • exclude_special_remotes (bool, optional) – if True, don’t return annex special remotes
  • with_urls_only (bool, optional) – return only remotes which have urls
Returns:

remotes – List of names of the remotes

Return type:

list of str

static get_size_from_key(key)[source]

A little helper to obtain size encoded in a key

get_special_remotes()[source]

Get info about all known (not just enabled) special remotes.

Returns:Keys are special remote UUIDs, values are dicts with arguments for git-annex enableremote. This includes at least the ‘type’ and ‘name’ of a special remote. Each type of special remote may require addition arguments that will be available in the respective dictionary.
Return type:dict
get_status(untracked=True, deleted=True, modified=True, added=True, type_changed=True, submodules=True, path=None)[source]

Return various aspects of the status of the annex repository

Note: Under certain circumstances newly added submodules might be reported as ‘modified’ rather tha ‘added’. See AnnexRepo._submodules_dirty_direct_mode for details.

Parameters:
  • untracked
  • deleted
  • modified
  • added
  • type_changed
  • submodules
  • path
classmethod get_toppath(path, follow_up=True, git_options=None)[source]

Return top-level of a repository given the path.

Parameters:
  • follow_up (bool) – If path has symlinks – they get resolved by git. If follow_up is True, we will follow original path up until we hit the same resolved path. If no such path found, resolved one would be returned.
  • git_options (list of str) – options to be passed to the git rev-parse call
  • None if no parent directory contains a git repository. (Return) –
get_tracking_branch(branch=None, corresponding=True)[source]

Get the tracking branch for branch if there is any.

By default returns the tracking branch of the corresponding branch if branch is a managed branch.

Parameters:
  • branch (str) – local branch to look up. If none is given, active branch is used.
  • corresponding (bool) – If True actually look up the corresponding branch of branch (also if branch isn’t explicitly given)
Returns:

(remote or None, refspec or None) of the tracking branch

Return type:

tuple

get_urls(file_, *args, **kwargs)[source]

Get URLs for a file/key

Parameters:
  • file (str) –
  • key (bool, optional) – Whether provided files are actually annex keys
Returns:

Return type:

A list of URLs

git_annex_version = None
info(files, *args, **kwargs)[source]

Provide annex info for file(s).

Parameters:files (list of str) – files to look for
Returns:Info for each file
Return type:dict
init_remote(name, options)[source]

Creates a new special remote

Parameters:name (str) – name of the special remote
is_available(files, *args, **kwargs)[source]

Check if file or key is available (from a remote)

In case if key or remote is misspecified, it wouldn’t fail but just keep returning False, although possibly also complaining out loud ;)

Parameters:
  • file (str) – Filename or a key
  • remote (str, optional) – Remote which to check. If None, possibly multiple remotes are checked before positive result is reported
  • key (bool, optional) – Whether provided files are actually annex keys
  • batch (bool, optional) – Initiate or continue with a batched run of annex checkpresentkey
Returns:

with True indicating that file/key is available from (the) remote

Return type:

bool

is_crippled_fs()[source]

Return True if git-annex considers current filesystem ‘crippled’.

Returns:
Return type:True if on crippled filesystem, False otherwise
is_direct_mode()[source]

Return True if annex is in direct mode

Returns:
Return type:True if in direct mode, False otherwise.
is_dirty(index=True, working_tree=False, untracked_files=True, submodules=True, path=None)[source]

Returns true if the repo is considered to be dirty

Parameters:
  • index (bool) – if True, consider changes to the index
  • working_tree (bool) – if True, consider changes to the working tree
  • untracked_files (bool) – if True, consider untracked files
  • submodules (bool) – if True, consider submodules
  • path (str or list of str) – path(s) to consider only
Returns:

Return type:

bool

is_managed_branch(branch=None)[source]

Whether branch is managed by git-annex.

ATM this returns true in direct mode (branch ‘annex/direct/my_branch’) and if on an adjusted branch (annex v6+ repository: either ‘adjusted/my_branch(unlocked)’ or ‘adjusted/my_branch(fixed)’

Note: The term ‘managed branch’ is used to make clear it’s meant to be more general than the v6+ ‘adjusted branch’.

Parameters:branch (str) – name of the branch; default: active branch
Returns:True if on a managed branch, False otherwise
Return type:bool
is_remote_annex_ignored(remote)[source]

Return True if remote is explicitly ignored

is_special_annex_remote(remote, check_if_known=True)[source]

Return whether remote is a special annex remote

Decides based on the presence of diagnostic annex- options for the remote

is_under_annex(files, *args, **kwargs)[source]

Check whether files are under annex control

Parameters:
  • files (list of str) – file(s) to check for being under annex
  • allow_quick (bool, optional) – allow quick check, based on having a symlink into .git/annex/objects. Works only in non-direct mode (TODO: thin mode)
Returns:

For each input file states whether file is under annex

Return type:

list of bool

classmethod is_valid_repo(path, allow_noninitialized=False)[source]

Return True if given path points to an annex repository

lock(files, *args, **kwargs)[source]

undo unlock

Use this to undo an unlock command if you don’t want to modify the files any longer, or have made modifications you want to discard.

Parameters:
  • files (list of str) –
  • options (list of str) –
merge_annex(remote=None)[source]

Merge git-annex branch

Merely calls sync with the appropriate arguments.

Parameters:remote (str, optional) – Name of a remote to be “merged”.
migrate_backend(files, *args, **kwargs)[source]

Changes the backend used for file.

The backend used for the key-value of files. Only files currently present are migrated. Note: There will be no notification if migrating fails due to the absence of a file’s content!

Parameters:
  • files (list) – files to migrate.
  • backend (str) – specify the backend to migrate to. If none is given, the default backend of this instance will be used.
precommit()[source]

Perform pre-commit maintenance tasks, such as closing all batched annexes since they might still need to flush their changes into index

proxy(git_cmd, **kwargs)[source]

Use git-annex as a proxy to git

This is needed in case we are in direct mode, since there’s no git working tree, that git can handle.

Parameters:
  • git_cmd (list of str) – the actual git command
  • **kwargs (dict, optional) – passed to _run_annex_command
Returns:

output of the command call

Return type:

(stdout, stderr)

remove(files, *args, **kwargs)[source]

Remove files from git/annex (works in direct mode as well)

Parameters:
  • files
  • force (bool, optional) –
repo_info(fast=False)[source]

Provide annex info for the entire repository.

Returns:Info for the repository, with keys matching the ones returned by annex
Return type:dict
rm_url(file_, *args, **kwargs)[source]

Record that the file is no longer available at the url.

Parameters:
  • file (str) –
  • url (str) –
set_default_backend(backend, persistent=True, commit=True)[source]

Set default backend

Parameters:
  • backend (str) –
  • persistent (bool, optional) – If persistent, would add/commit to .gitattributes. If not – would set within .git/config
set_direct_mode(enable_direct_mode=True)[source]

Switch to direct or indirect mode

Parameters:enable_direct_mode (bool) – True means switch to direct mode, False switches to indirect mode
Raises:CommandNotAvailableError – in case you try to switch to indirect mode on a crippled filesystem
set_groupwanted(name, expr)[source]

Set expr for the name groupwanted

set_metadata(files, reset=None, add=None, init=None, remove=None, purge=None, recursive=False)[source]

Manipulate git-annex file-metadata

Parameters:
  • files (str or list(str)) – One or more paths for which metadata is to be manipulated. The changes applied to each file item are uniform. However, the result may not be uniform across files, depending on the actual operation.
  • reset (dict, optional) – Metadata items matching keys in the given dict are (re)set to the respective values.
  • add (dict, optional) – The values of matching keys in the given dict appended to any possibly existing values. The metadata keys need not necessarily exist before.
  • init (dict, optional) – Metadata items for the keys in the given dict are set to the respective values, if the key is not yet present in a file’s metadata.
  • remove (dict, optional) – Values in the given dict are removed from the metadata items matching the respective key, if they exist in a file’s metadata. Non-existing values, or keys do not lead to failure.
  • purge (list, optional) – Any metadata item with a key matching an entry in the given list is removed from the metadata.
  • recursive (bool, optional) – If False, fail (with CommandError) when directory paths are given as files.
Returns:

JSON obj per modified file

Return type:

generator

set_preferred_content(property, expr, remote=None)[source]

Set preferred content configuration of a repository or remote

Parameters:
  • property ({'wanted', 'required', 'group'}) – Type of property to query
  • expr (str) – Any expression or label supported by git-annex for the given property.
  • remote (str, optional) – If not specified (None), sets the property for the local repository.
Returns:

Raw git-annex output in response to the set command.

Return type:

str

Raises:
  • ValueError – If an unknown property label is given.
  • CommandError – If the annex call errors.
set_remote_dead(name)[source]

Announce to annex that remote is “dead”

set_remote_url(name, url, push=False)[source]

Set the URL a remote is pointing to

Sets the URL of the remote name. Requires the remote to already exist.

Parameters:
  • name (str) – name of the remote
  • url (str) –
  • push (bool) – if True, set the push URL, otherwise the fetch URL; if True, additionally set annexurl to url, to make sure annex uses it to talk to the remote, since access via fetch URL might be restricted.
supports_unlocked_pointers

Return True if repository version supports unlocked pointers.

sync(remotes=None, push=True, pull=True, commit=True, content=False, all=False, fast=False)[source]

Synchronize local repository with remotes

Use this command when you want to synchronize the local repository with one or more of its remotes. You can specify the remotes (or remote groups) to sync with by name; the default if none are specified is to sync with all remotes.

Parameters:
  • remotes (str, list(str), optional) – Name of one or more remotes to be sync’ed.
  • push (bool) – By default, git pushes to remotes.
  • pull (bool) – By default, git pulls from remotes
  • commit (bool) – A commit is done by default. Disable to avoid committing local changes.
  • content (bool) – Normally, syncing does not transfer the contents of annexed files. This option causes the content of files in the work tree to also be uploaded and downloaded as necessary.
  • all (bool) – This option, when combined with content, makes all available versions of all files be synced, when preferred content settings allow
  • fast (bool) – Only sync with the remotes with the lowest annex-cost value configured
unannex(files, *args, **kwargs)[source]

undo accidental add command

Use this to undo an accidental git annex add command. Note that for safety, the content of the file remains in the annex, until you use git annex unused and git annex dropunused.

Parameters:
  • files (list of str) –
  • options (list of str) –
Returns:

successfully unannexed files

Return type:

list of str

unlock(files, *args, **kwargs)[source]

unlock files for modification

Parameters:
  • files (list of str) –
  • options (list of str) –
Returns:

successfully unlocked files

Return type:

list of str

untracked_files

Get a list of untracked files

uuid

Annex UUID

Returns:Returns a the annex UUID, if there is any, or None otherwise.
Return type:str
whereis(files, *args, **kwargs)[source]

Lists repositories that have actual content of file(s).

Parameters:
  • files (list of str) – files to look for
  • output ({'descriptions', 'uuids', 'full'}, optional) – If ‘descriptions’, a list of remotes descriptions returned is per each file. If ‘full’, for each file a dictionary of all fields is returned as returned by annex
  • key (bool, optional) – Whether provided files are actually annex keys
  • options (list, optional) – Options to pass into git-annex call
Returns:

if output == ‘descriptions’, contains a list of descriptions of remotes for each input file, describing the remote for each remote, which was found by git-annex whereis, like:

u'me@mycomputer:~/where/my/repo/is [origin]' or
u'web' or
u'me@mycomputer:~/some/other/clone'

if output == ‘uuids’, returns a list of uuids. if output == ‘full’, returns a dictionary with filenames as keys and values a detailed record, e.g.:

{'00000000-0000-0000-0000-000000000001': {
  'description': 'web',
  'here': False,
  'urls': ['http://127.0.0.1:43442/about.txt', 'http://example.com/someurl']
}}

Return type:

list of list of unicode or dict

class datalad.support.annexrepo.BatchedAnnex(annex_cmd, git_options=None, annex_options=None, path=None, json=False, output_proc=None)[source]

Bases: object

Container for an annex process which would allow for persistent communication

close(return_stderr=False)[source]

Close communication and wait for process to terminate

Returns:stderr output if return_stderr and stderr file was there. None otherwise
Return type:str
class datalad.support.annexrepo.BatchedAnnexes(batch_size=0, git_options=None)[source]

Bases: dict

Class to contain the registry of active batch’ed instances of annex for a repository

clear()[source]

Override just to make sure we don’t rely on __del__ to close all the pipes

close()[source]

Close communication to all the batched annexes

It does not remove them from the dictionary though

get(k[, d]) → D[k] if k in D, else d. d defaults to None.[source]
class datalad.support.annexrepo.ProcessAnnexProgressIndicators(expected=None)[source]

Bases: object

‘Filter’ for annex –json output to react to progress indicators

Instance of this beast should be passed into log_stdout option for git-annex commands runner

finish()[source]
start()[source]
datalad.support.annexrepo.readline_json(stdout)[source]
datalad.support.annexrepo.readline_rstripped(stdout)[source]
datalad.support.annexrepo.readlines_until_ok_or_failed(stdout, maxlines=100)[source]

Read stdout until line ends with ok or failed