datalad_next.gitremotes.datalad_annex

git-remote-datalad-annex to fetch/push via any git-annex special remote

In essence, this Git remote helper bootstraps a utility repository in order to push/fetch the state of a repository to any location accessible by any git-annex special remote implementation. All information necessary for this bootstrapping is taken from the remote URL specification. The internal utility repository is removed again after every invocation. Therefore changes to the remote access configuration can be made any time by simply modifying the configured remote URL.

When installed, this remote helper is invoked for any "URLs" that start with the prefix datalad-annex::. Following this prefix, two types of specifications are support.

  1. Plain parameters list:

    datalad-annex::?type=<special-remote-type>&[...][exporttree=yes]
    

    In this case the prefix is followed by a URL query string that comprises all necessary (and optional) parameters that would be normally given to the git annex initremote command. It is required to specify the special remote type, and it is possible to request "export" mode for any special remote that supports it. Depending on the chosen special remote additional parameters may be required or supported. Please consult the git-annex documentation at https://git-annex.branchable.com/special_remotes/

  2. URL:

    datalad-annex::<url>[?...]
    

    Alternatively, an actual URL can be given after the prefix. In this case, the, now optional, URL query string can still be used to specify arbitrary parameters for special remote initialization. In addition, the query string specification can use Python-format-style placeholder to reference particular URL components as parameters values, in order to avoid double-specification.

    The list of supported placeholders is scheme, netloc, path, fragment, username, password, hostname, port, corresponding to the respective URL components. In addition, a noquery placeholder is supported, which resolves to the entire URL except any query string. An example of such a URL specification is:

    datalad-annex::file:///tmp/example?type=directory&directory={path}&encryption=none'
    

    which would initialize a type=directory special remote pointing at /tmp/example.

Caution with collaborative workflows

There is no protection against simultaneous, conflicting repository state uploads from two different locations! Similar to git-annex's "export" feature, this feature is most appropriately used as a dataset deposition mechanism, where uploads are conducted from a single site only -- deposited for consumption by any number of parties.

If this Git remote helper is to be used for multi-way collaboration, with two or more parties contributing updates, it is advisable to employ a separate datalad-annex:: target site for each contributor, such that only one site is pushing to any given location. Updates are exchanged by the remaining contributors adding the respective other datalad-annex:: sites as additional Git remotes, analog to forks of a repository.

Special remote type support

In addition to the regular list of special remotes, plain http(s) access via URLs is also supported via the 'web' special remote. For such cases, only the base URL and the 'type=web' parameter needs to be given, e.g:

git clone 'datalad-annex::https://example.com?type=web&url={noquery}'

When a plain URL is given, with no parameter specification in a query string, the parameters type=web and exporttree=yes are added automatically by default. This means that this remote helper can clone from any remote deposit accessible via http(s) that matches the layout depicted in the next section.

Remote layout

The representation of a repository at a remote depends on the chosen type of special remote. In general, two files will be deposited. One text file containing a list of Git refs contained in the deposit, and one ZIP file with a (compressed) archive of a bare Git repository. Beside the idiosyncrasies of particular special remotes, to major modes determine the layout of a remote deposit. In "normal" mode, two annex keys (XDLRA--refs, XDLRA--repo-export) will be deposited. In "export" mode, a directory tree is created that is designed to blend with arbitrary repository content, such that a git remote and a git-annex export can be pushed to the same location without conflicting with each other. The aforementioned files will be represented like this:

.datalad
└── dotgit  # named to not be confused with an actual Git repository
    ├── refs
    └── repo.zip

The default LZMA-compression of the ZIP file (in both export and normal mode) can be turned off with the dladotgit=uncompressed URL parameter.

Credential handling

Some git-annex special remotes require the specification of credentials via environment variables. With the URL parameter dlacredential=<name> it is possible to query DataLad for a user/password credential to be used for this purpose. This convenience functionality is supported for the special remotes glacier, s3, and webdav.

When a credential of the given name does not exist, or no credential name was specified, an attempt is made to determine a suitable credential based on, for example, a detected HTTP authentication realm. If no matching credential could be found, the user will be prompted to enter a credential. After having successfully established access, the entered credential will be saved in the local credential store.

DataLad-based credentials are only utilized, when the native git-annex credential setup via environment variables is not in use (see the documentation of a particular special remote implementation for more information).

Implementation details

This Git remote implementation uses two extra repositories, besides the repository (R) it is used with, to do its work:

  1. A tiny repository that is entirely bootstrapped from the remote URL, and is used to retrieve/deposit a complete state of the actual repo an a remote site, via a git-annex special remote setup.

  2. A local, fully functional mirror repo of the remotely stored repository state.

On fetch/push the existence of both additional repositories is ensured. The remote state of retrieved via repo (A), and unpacked to repo (B). The actual fetch/push Git operations are performed locally between the repo (R) and repo (B). On push, repo (B) is then packed up again, and deposited on the remote site via git-annex transfer in repo (A).

Due to a limitation of this implementation, it is possible that when the last upload step fails, Git nevertheless advances the pushed refs, making it appear as if the push was completely successful. That being said, Git will still issue a message (error: failed to push some refs to..) and the git-push process will also exit with a non-zero status. In addition, all of the remote's refs will be annotated with an additional ref named refs/dlra-upload-failed/<remote-name>/<ref-name> to indicate the upload failure. These markers will be automatically removed after the next successful upload.

Note

Confirmed to work with git-annex version 8.20211123 onwards.

Todo

  • At the moment, only one format for repository deposition is supported (a ZIP archive of a working bare repository). However this is not a good format for the purpose of long-term archiving, because it require a functional Git installation to work with. It would be fairly doable to make the deposited format configurable, and support additional formats. An interesting one would be a fast-export stream, basically a plain text serialization of an entire repository.

  • recognize that a different repo is being pushed over an existing one at the remote

  • think about adding additional information into the header of refs maybe give it some kind of stamp that also makes it easier to validate by the XDLRA backend

  • think about preventing duplication between the repo and its local mirror could they safely share git objects? If so, in which direction?