datalad_next.annexremotes.uncurl

uncurl git-annex external special remote

This implementation is a git-annex accessible interface to datalad-next's URL operations framework. It serves two main purposes:

  1. Combine git-annex's capabilities of registering and accessing file content via URLs with DataLad's access credential management and (additional or alternative) transport protocol implementations.

  2. Minimize the maintenance effort for datasets (primarily) composed from content that is remotely accessible via URLs from systems other than Datalad or git-annex in the event of an infrastructure transition (e.g. moving to a different technical system or a different data organization on a storage system).

Requirements

This special remote implementation requires git-annex version 8.20210127 (or later) to be available.

Download helper

The simplest way to use this remote is to initialize it without any particular configuration:

$ git annex initremote uncurl type=external externaltype=uncurl encryption=none
initremote uncurl ok
(recording state in git...)

Once initialized, or later enabled in a clone, git-annex addurl will check with the uncurl remote whether it can handle a particular URL, and will let the remote perform the download in case of positive response. By default, the remote will claim any URLs with a scheme that the local datalad-next installation supports. This always includes file://, http://, and https://, but is extensible, and a particular installation may also support ssh:// (by default when openssh is installed), or other schemes.

This additional URL support is also available for other commands. Here is an example how datalad addurls can be given any uncurl-supported URLs (here an SSH-URL) directly, provided that the uncurl remote was initialized for a dataset (as shown above):

$ echo '[{"url":"ssh://my.server.org/home/me/file", "file":"dummy"}]' \
    | datalad addurls - '{url}' '{file}'

This makes legacy commands (e.g., datalad download-url), unnecessary, and facilitates the use of more advanced datalad addurls features (e.g., automatic creation of subdatasets) that are not provided by lower-level commands like git annex addurl.

Download helper with credential management support

With this setup, download requests now also use DataLad's credential system for authentication. DataLad will automatically lookup matching credentials, prompt for manual entry if none are found, and offer to store them securely for later use after having used them successfully:

$ git annex addurl http://httpbin.org/basic-auth/myuser/mypassword
Credential needed for access to http://httpbin.org/basic-auth/myuser/mypassword
user: myuser
password:
password (repeat):
Enter a name to save the credential
(for accessing http://httpbin.org/basic-auth/myuser/mypassword) securely for future
reuse, or 'skip' to not save the credential
name: httpbin-dummy

addurl http://httpbin.org/basic-auth/myuser/mypassword (from uncurl) (to ...)
ok
(recording state in git...)

By adding files via downloads from URLs in this fashion, datasets can be built that track information across a range of locations/services, using a possibly heterogeneous set of access methods.

This feature is very similar to the datalad special remote implementation included in the core DataLad package. The difference here is that alternative implementations of downloaders are employed and the datalad-next credential system is used instead of the "providers" mechanism from DataLad's core package.

Transforming recorded URLs

The main benefit of using uncurl is, however, only revealed when the original snapshot of where data used to be accessible becomes invalid, maybe because data were moved to a different storage system, or simply a different host.

This would typically require an update of each, now broken, access URL. For datasets with thousands or even millions of files this can be an expensive operation. For data portal operators providing a large number of datasets it is even more tedious.

uncurl enables programmatic, on-access URL rewriting. This is similar, in spirit, to Git's url.<base>.insteadOf URL modification feature. However, modification possibilities reach substantially beyond replacing a base URL.

This feature is based on two customizable settings: 1) a URL template; and 2) a set of match expressions that extract additional identifiers from any recorded access URL for an annex key.

Here is an example: Let's say a file in a dataset has a recorded access URL of:

https://data.example.org/c542/s7612_figure1.pdf

We can let uncurl know that c542 is actually an identifier for a particular collection of items in this data store. Likewise s7612 is an identifier of a particular item in that collection, and figure1.pdf is the name of a component in that collection item. The following Python regular expression can be used to "decompose" the above URL into these semantic components:

(?P<site>https://[^/]+)/(?P<collection>c[^/]+)/(?P<item>s[^/]+)_(?P<component>.*)$

This expression is not the most readable, but it basically chunks the URL into segments of (?P<name>...), so-called named groups (see a live demo of this expression).

This expression, and additional ones like it, can set as a configuration parameter of an uncurl remote setup. Extending the configuration established by the initremote call above:

$ git annex enableremote uncurl \
    'match=(?P<site>https://[^/]+)/(?P<collection>c[^/]+)/(?P<item>s[^/]+)_(?P<component>.*)$'

The last argument is quoted to prevent it from being processed by the shell.

With the match expression configured, URL rewriting can be enabled by declaring a URL template as another configuration item. The URL template uses the Python Format String Syntax. If the new URL for the file above is now http://newsite.net/ex-archive/c542_s7612_figure1.pdf, we can declare the following URL template to have uncurl go to the new site:

http://newsite.net/ex-archive/{collection}_{item}_{component}

This template references the identifiers of the named groups we defined in the match expression. Again, the URL template can be set via git annex enableremote:

$ git annex enableremote uncurl \
    'url=http://newsite.net/ex-archive/{collection}_{item}_{component}'

There is no need to separate the enableremote calls. Both configuration can be given at the same time. In fact, they can also be given to initremote immediately.

The three identifiers site, collection, item, and component are actually a custom addition to a standard set of identifiers that are available for composing URLs via a template.

  • datalad_dsid - the DataLad dataset ID (UUID)

  • annex_dirhash - "mixed" variant of the two level hash for a particular key (uses POSIX directory separators, and included a trailing separator)

  • annex_dirhash_lower - "lower case" variant of the two level hash for a particular key (uses POSIX directory separators, and included a trailing separator)

  • annex_key - git-annex key name for a request

  • annex_remoteuuid - UUID of the special remote (location) used by git-annex

  • git_remotename - Name of the Git remote for the uncurl special remote

Note

The URL template must "resolve" to a complete and valid URL. This cannot be verified at configuration time, because even the URL scheme could be a dynamic setting.

Uploading content

The uncurl special remote can upload file content or store annex keys via supported URL schemes whenever a URL template is defined. At minimum, storing at file:// and ssh:// URLs are supported. But other URL scheme handlers with upload support may be available at a local DataLad installation.

Deleting content

As for uploading, deleting content is only permitted with a configured URL template. Moreover, it also depends on the delete operation being supported for a particular URL scheme.

Configuration overrides

Both match expressions and the URL template can also be configured in a dataset's configuration (committed branch configuration, or any Git configuration scope (local, global, system) using the following configuration item names:

  • remote.<remotename>.uncurl-url

  • remote.<remotename>.uncurl-match

where <remotename> is the name of the special remote in the dataset.

A URL template provided via configuration overrides one defined in the special remote setup via init/enableremote.

Match expressions defined as configuration items extend the set of match expressions that may be included in the special remote setup via init/enableremote. The remote.<remotename>.uncurl-match configuration item can be set as often as necessary (which one match expression each).

Tips

When multiple match expressions are defined, it is recommended to use unique names for each match-group to avoid collisions.

class datalad_next.annexremotes.uncurl.UncurlRemote(annex: Master)[source]

Bases: SpecialRemote

checkpresent(key: str) bool[source]

Requests the remote to check if a key is present in it.

Parameters:

key (str) --

Returns:

True if the key is present in the remote. False if the key is not present.

Return type:

bool

Raises:

RemoteError -- If the presence of the key couldn't be determined, eg. in case of connection error.

checkurl(url: str) bool[source]

When running git-annex addurl, this is called after CLAIMURL indicated that we could handle a URL. It can return information on the URL target (e.g., size of the download, a target filename, or a sequence thereof with additional URLs pointing to individual components that would jointly make up the full download from the given URL. However, all of that is optional, and a simple True returned is sufficient to make git-annex call TRANSFER RETRIEVE.

claimurl(url: str) bool[source]

Needs to check if want to handle a given URL

If match expressions are configured, matches the URL against all known URL expressions, and returns True if there is any match, or False otherwise.

If no match expressions are configured, return True of the URL scheme is supported, or False otherwise.

extract_tmpl_props(tmpl: str, *, urls: list[str] | None = None, key: str | None = None) dict[str, str][source]
get_key_urls(key: str) list[str][source]
get_mangled_url(fallback_url: str | None, tmpl: str, tmpl_props: dict[str, str]) str | None[source]
initremote() None[source]

Gets called when git annex initremote or git annex enableremote are run. This is where any one-time setup tasks can be done, for example creating the remote folder. Note: This may be run repeatedly over time, as a remote is initialized in different repositories, or as the configuration of a remote is changed. So any one-time setup tasks should be done idempotently.

Raises:

RemoteError -- If the remote could not be initialized.

is_recognized_url(url: str) bool[source]
prepare() None[source]

Tells the remote that it's time to prepare itself to be used. Gets called whenever git annex is about to access any of the below methods, so it shouldn't be too expensive. Otherwise it will slow down operations like git annex whereis or git annex info.

Internet connection can be established here, though it's recommended to defer this until it's actually needed.

Raises:

RemoteError -- If the remote could not be prepared.

remove(key: str) None[source]

Requests the remote to remove a key's contents.

Parameters:

key (str) --

Raises:

RemoteError -- If the key couldn't be deleted from the remote.

transfer_retrieve(key: str, filename: str) None[source]

Get the file identified by key from the remote and store it in local_file.

While the transfer is running, the remote can repeatedly call annex.progress(size) to indicate the number of bytes already stored. This will influence the progress shown to the user.

Parameters:
  • key (str) -- The Key to get from the remote.

  • local_file (str) -- Path where to store the file. Note that in some cases, local_file may contain whitespace.

Raises:

RemoteError -- If the file could not be received from the remote.

transfer_store(key: str, filename: str) None[source]

Store the file in local_file to a unique location derived from key.

It's important that, while a Key is being stored, checkpresent(key) not indicate it's present until all the data has been transferred. While the transfer is running, the remote can repeatedly call annex.progress(size) to indicate the number of bytes already stored. This will influence the progress shown to the user.

Parameters:
  • key (str) -- The Key to be stored in the remote. In most cases, this is going to be the remote file name. It should be at least be unambiguously derived from it.

  • local_file (str) -- Path to the file to upload. Note that in some cases, local_file may contain whitespace. Note that local_file should not influence the filename used on the remote.

Raises:

RemoteError -- If the file could not be stored to the remote.

datalad_next.annexremotes.uncurl.main()[source]

cmdline entry point