datalad_next.annexremotes.uncurl
uncurl git-annex external special remote
This implementation is a git-annex accessible interface to datalad-next's URL operations framework. It serves two main purposes:
Combine git-annex's capabilities of registering and accessing file content via URLs with DataLad's access credential management and (additional or alternative) transport protocol implementations.
Minimize the maintenance effort for datasets (primarily) composed from content that is remotely accessible via URLs from systems other than Datalad or git-annex in the event of an infrastructure transition (e.g. moving to a different technical system or a different data organization on a storage system).
Requirements
This special remote implementation requires git-annex version 8.20210127 (or later) to be available.
Download helper
The simplest way to use this remote is to initialize it without any particular configuration:
$ git annex initremote uncurl type=external externaltype=uncurl encryption=none
initremote uncurl ok
(recording state in git...)
Once initialized, or later enabled in a clone, git-annex addurl
will check
with the uncurl remote whether it can handle a particular URL, and will let
the remote perform the download in case of positive response. By default, the
remote will claim any URLs with a scheme that the local datalad-next
installation supports. This always includes file://
, http://
, and
https://
, but is extensible, and a particular installation may also support
ssh://
(by default when openssh is installed), or other schemes.
This additional URL support is also available for other commands. Here is an
example how datalad addurls
can be given any uncurl-supported URLs
(here an SSH-URL) directly, provided that the uncurl
remote was initialized
for a dataset (as shown above):
$ echo '[{"url":"ssh://my.server.org/home/me/file", "file":"dummy"}]' \
| datalad addurls - '{url}' '{file}'
This makes legacy commands (e.g., datalad download-url
), unnecessary, and
facilitates the use of more advanced datalad addurls
features (e.g.,
automatic creation of subdatasets) that are not provided by lower-level
commands like git annex addurl
.
Download helper with credential management support
With this setup, download requests now also use DataLad's credential system for authentication. DataLad will automatically lookup matching credentials, prompt for manual entry if none are found, and offer to store them securely for later use after having used them successfully:
$ git annex addurl http://httpbin.org/basic-auth/myuser/mypassword
Credential needed for access to http://httpbin.org/basic-auth/myuser/mypassword
user: myuser
password:
password (repeat):
Enter a name to save the credential
(for accessing http://httpbin.org/basic-auth/myuser/mypassword) securely for future
reuse, or 'skip' to not save the credential
name: httpbin-dummy
addurl http://httpbin.org/basic-auth/myuser/mypassword (from uncurl) (to ...)
ok
(recording state in git...)
By adding files via downloads from URLs in this fashion, datasets can be built that track information across a range of locations/services, using a possibly heterogeneous set of access methods.
This feature is very similar to the datalad
special remote implementation
included in the core DataLad package. The difference here is that alternative
implementations of downloaders are employed and the datalad-next
credential
system is used instead of the "providers" mechanism from DataLad's core
package.
Transforming recorded URLs
The main benefit of using uncurl is, however, only revealed when the original snapshot of where data used to be accessible becomes invalid, maybe because data were moved to a different storage system, or simply a different host.
This would typically require an update of each, now broken, access URL. For datasets with thousands or even millions of files this can be an expensive operation. For data portal operators providing a large number of datasets it is even more tedious.
uncurl enables programmatic, on-access URL rewriting. This is similar, in
spirit, to Git's url.<base>.insteadOf
URL modification feature. However,
modification possibilities reach substantially beyond replacing a base URL.
This feature is based on two customizable settings: 1) a URL template; and 2) a set of match expressions that extract additional identifiers from any recorded access URL for an annex key.
Here is an example: Let's say a file in a dataset has a recorded access URL of:
https://data.example.org/c542/s7612_figure1.pdf
We can let uncurl know that c542
is actually an identifier for a
particular collection of items in this data store. Likewise s7612
is an
identifier of a particular item in that collection, and figure1.pdf
is the
name of a component in that collection item. The following Python regular
expression can be used to "decompose" the above URL into these semantic
components:
(?P<site>https://[^/]+)/(?P<collection>c[^/]+)/(?P<item>s[^/]+)_(?P<component>.*)$
This expression is not the most readable, but it basically chunks the URL
into segments of (?P<name>...)
, so-called named groups (see a
live demo of this expression).
This expression, and additional ones like it, can set as a configuration
parameter of an uncurl remote setup. Extending the configuration established
by the initremote
call above:
$ git annex enableremote uncurl \
'match=(?P<site>https://[^/]+)/(?P<collection>c[^/]+)/(?P<item>s[^/]+)_(?P<component>.*)$'
The last argument is quoted to prevent it from being processed by the shell.
With the match expression configured, URL rewriting can be enabled by declaring
a URL template as another configuration item. The URL template uses the Python
Format String Syntax. If the
new URL for the file above is now
http://newsite.net/ex-archive/c542_s7612_figure1.pdf
, we can declare
the following URL template to have uncurl go to the new site:
http://newsite.net/ex-archive/{collection}_{item}_{component}
This template references the identifiers of the named groups we defined in the
match expression. Again, the URL template can be set via git annex
enableremote
:
$ git annex enableremote uncurl \
'url=http://newsite.net/ex-archive/{collection}_{item}_{component}'
There is no need to separate the enableremote
calls. Both configuration can
be given at the same time. In fact, they can also be given to initremote
immediately.
The three identifiers site
, collection
, item
, and component
are
actually a custom addition to a standard set of identifiers that are available
for composing URLs via a template.
datalad_dsid
- the DataLad dataset ID (UUID)annex_dirhash
- "mixed" variant of the two level hash for a particular key (uses POSIX directory separators, and included a trailing separator)annex_dirhash_lower
- "lower case" variant of the two level hash for a particular key (uses POSIX directory separators, and included a trailing separator)annex_key
- git-annex key name for a requestannex_remoteuuid
- UUID of the special remote (location) used by git-annexgit_remotename
- Name of the Git remote for the uncurl special remote
Note
The URL template must "resolve" to a complete and valid URL. This cannot be verified at configuration time, because even the URL scheme could be a dynamic setting.
Uploading content
The uncurl special remote can upload file content or store annex keys
via supported URL schemes whenever a URL template is defined. At minimum,
storing at file://
and ssh://
URLs are supported. But other URL
scheme handlers with upload support may be available at a local DataLad
installation.
Deleting content
As for uploading, deleting content is only permitted with a configured URL template. Moreover, it also depends on the delete operation being supported for a particular URL scheme.
Configuration overrides
Both match expressions and the URL template can also be configured in a dataset's configuration (committed branch configuration, or any Git configuration scope (local, global, system) using the following configuration item names:
remote.<remotename>.uncurl-url
remote.<remotename>.uncurl-match
where <remotename>
is the name of the special remote in the dataset.
A URL template provided via configuration overrides one defined in the special
remote setup via init/enableremote
.
Match expressions defined as configuration items extend the set of match
expressions that may be included in the special remote setup via
init/enableremote
. The remote.<remotename>.uncurl-match
configuration
item can be set as often as necessary (which one match expression each).
Tips
When multiple match expressions are defined, it is recommended to use unique names for each match-group to avoid collisions.