datalad_next.gitremotes.datalad_annex
git-remote-datalad-annex to fetch/push via any git-annex special remote
In essence, this Git remote helper bootstraps a utility repository in order to push/fetch the state of a repository to any location accessible by any git-annex special remote implementation. All information necessary for this bootstrapping is taken from the remote URL specification. The internal utility repository is removed again after every invocation. Therefore changes to the remote access configuration can be made any time by simply modifying the configured remote URL.
When installed, this remote helper is invoked for any "URLs" that start with
the prefix datalad-annex::
. Following this prefix, two types of
specifications are support.
Plain parameters list:
datalad-annex::?type=<special-remote-type>&[...][exporttree=yes]
In this case the prefix is followed by a URL query string that comprises all necessary (and optional) parameters that would be normally given to the
git annex initremote
command. It is required to specify the special remotetype
, and it is possible to request "export" mode for any special remote that supports it. Depending on the chosen special remote additional parameters may be required or supported. Please consult the git-annex documentation at https://git-annex.branchable.com/special_remotes/URL:
datalad-annex::<url>[?...]
Alternatively, an actual URL can be given after the prefix. In this case, the, now optional, URL query string can still be used to specify arbitrary parameters for special remote initialization. In addition, the query string specification can use Python-format-style placeholder to reference particular URL components as parameters values, in order to avoid double-specification.
The list of supported placeholders is
scheme
,netloc
,path
,fragment
,username
,password
,hostname
,port
, corresponding to the respective URL components. In addition, anoquery
placeholder is supported, which resolves to the entire URL except any query string. An example of such a URL specification is:datalad-annex::file:///tmp/example?type=directory&directory={path}&encryption=none'
which would initialize a
type=directory
special remote pointing at/tmp/example
.
Caution with collaborative workflows
There is no protection against simultaneous, conflicting repository state uploads from two different locations! Similar to git-annex's "export" feature, this feature is most appropriately used as a dataset deposition mechanism, where uploads are conducted from a single site only -- deposited for consumption by any number of parties.
If this Git remote helper is to be used for multi-way collaboration, with two
or more parties contributing updates, it is advisable to employ a separate
datalad-annex::
target site for each contributor, such that only one site
is pushing to any given location. Updates are exchanged by the remaining
contributors adding the respective other datalad-annex::
sites as
additional Git remotes, analog to forks of a repository.
Special remote type support
In addition to the regular list of special remotes, plain http(s) access via URLs is also supported via the 'web' special remote. For such cases, only the base URL and the 'type=web' parameter needs to be given, e.g:
git clone 'datalad-annex::https://example.com?type=web&url={noquery}'
When a plain URL is given, with no parameter specification in a query
string, the parameters type=web
and exporttree=yes
are added
automatically by default. This means that this remote helper can clone
from any remote deposit accessible via http(s)
that matches the layout
depicted in the next section.
Remote layout
The representation of a repository at a remote depends on the chosen type of
special remote. In general, two files will be deposited. One text file
containing a list of Git refs
contained in the deposit, and one ZIP file
with a (compressed) archive of a bare Git repository. Beside the idiosyncrasies
of particular special remotes, to major modes determine the layout of a remote
deposit. In "normal" mode, two annex keys (XDLRA--refs
,
XDLRA--repo-export
) will be deposited. In "export" mode, a directory tree is
created that is designed to blend with arbitrary repository content, such that
a git remote and a git-annex export can be pushed to the same location without
conflicting with each other. The aforementioned files will be represented like
this:
.datalad
└── dotgit # named to not be confused with an actual Git repository
├── refs
└── repo.zip
The default LZMA-compression of the ZIP file (in both export and normal mode)
can be turned off with the dladotgit=uncompressed
URL parameter.
Credential handling
Some git-annex special remotes require the specification of credentials via
environment variables. With the URL parameter dlacredential=<name>
it
is possible to query DataLad for a user/password credential to be used for
this purpose. This convenience functionality is supported for the special
remotes glacier
, s3
, and webdav
.
When a credential of the given name does not exist, or no credential name was specified, an attempt is made to determine a suitable credential based on, for example, a detected HTTP authentication realm. If no matching credential could be found, the user will be prompted to enter a credential. After having successfully established access, the entered credential will be saved in the local credential store.
DataLad-based credentials are only utilized, when the native git-annex credential setup via environment variables is not in use (see the documentation of a particular special remote implementation for more information).
Implementation details
This Git remote implementation uses two extra repositories, besides the repository (R) it is used with, to do its work:
A tiny repository that is entirely bootstrapped from the remote URL, and is used to retrieve/deposit a complete state of the actual repo an a remote site, via a git-annex special remote setup.
A local, fully functional mirror repo of the remotely stored repository state.
On fetch/push the existence of both additional repositories is ensured. The remote state of retrieved via repo (A), and unpacked to repo (B). The actual fetch/push Git operations are performed locally between the repo (R) and repo (B). On push, repo (B) is then packed up again, and deposited on the remote site via git-annex transfer in repo (A).
Due to a limitation of this implementation, it is possible that when the last
upload step fails, Git nevertheless advances the pushed refs, making it appear
as if the push was completely successful. That being said, Git will still issue
a message (error: failed to push some refs to..
) and the git-push process
will also exit with a non-zero status. In addition, all of the remote's refs
will be annotated with an additional ref named
refs/dlra-upload-failed/<remote-name>/<ref-name>
to indicate the upload
failure. These markers will be automatically removed after the next successful
upload.
Note
Confirmed to work with git-annex version 8.20211123 onwards.
Todo
At the moment, only one format for repository deposition is supported (a ZIP archive of a working bare repository). However this is not a good format for the purpose of long-term archiving, because it require a functional Git installation to work with. It would be fairly doable to make the deposited format configurable, and support additional formats. An interesting one would be a fast-export stream, basically a plain text serialization of an entire repository.
recognize that a different repo is being pushed over an existing one at the remote
think about adding additional information into the header of refs maybe give it some kind of stamp that also makes it easier to validate by the XDLRA backend
think about preventing duplication between the repo and its local mirror could they safely share git objects? If so, in which direction?