datalad_next.annexremotes.archivist

git-annex special remote archivist for obtaining files from archives

class datalad_next.annexremotes.archivist.ArchivistRemote(annex)[source]

Bases: SpecialRemote

git-annex special remote archivist for obtaining files from archives

Successor of the datalad-archive special remote. It claims and acts on particular archive locator "URLs", registered for individual annex keys (see datalad_next.types.archivist.ArchivistLocator). These locators identify another annex key that represents an archive (e.g., a tarball or a zip files) that contains the respective annex key as a member. This special remote trigger the extraction of such members from any candidate archive when retrieval of a key is requested.

This special remote cannot store or remove content. The desired usage is to register a locator "URL" for any relevant key via git annex addurl|registerurl or datalad addurls.

Configuration

The behavior of this special remote can be tuned via a number of configuration settings.

datalad.archivist.legacy-mode=yes|[no]

If enabled, all special remote operations fall back onto the legacy datalad-archives special remote implementation. This mode is only provided for backward-compatibility. This legacy implementation unconditionally downloads archive files completely, and keeps an internal cache of the full extracted archive around. The implied 200% (or more) storage cost overhead for obtaining a complete dataset can be prohibitive for datasets tracking large amount of data (in archive files).

Implementation details

CHECKPRESENT

When performing a non-download test for the (continued) presence of an annex key (as triggered via git annex fsck --fast or git annex checkpresentkey), the underlying archive containing a key will NOT be inspected. Instead, only the continued availability of the annex key for the containing archive will be tested. In other words: this implementation trust the archive member annotation to be correct/valid, and it also trusts the archive content to be unchanged. The latter will be generally the case, but may no with URL-style keys.

Not implementing such a trust-approach would have a number of consequences. Depending on where the archive is located (local/remote) and what format it is (fsspec-inspectable or not), we would need to download it completely in order to verify a matching archive member. Moreover, an archive might also reference another archive as a source, leading to a multiplication of transfer demands.

__getattribute__(name: str)[source]

Redirect top-level API calls to legacy implementation, if needed

checkpresent(key: str) bool[source]

Verifies continued availability of the archive referenced by the key

No content verification of the archive, or of the particular archive member is performed. See "Implementation details" of this class for a rational.

Returns:

True if the referenced archive key is present on any remote. False if not.

Return type:

bool

checkurl(url: str) bool[source]

Parses ArchivistLocator-style URLs

Returns True for any syntactically correct URL with all required properties.

The implementation is identical to claimurl().

claimurl(url: str) bool[source]

Returns True for ArchivistLocator-style URLs

Only a lexical check is performed. Any other URL will result in False to be returned.

initremote()[source]

This method does nothing, because the special remote requires no particular setup.

prepare()[source]

Prepare the special remote for requests by git-annex

If the special remote is instructed to run in "legacy mode", all subsequent operations will be processed by the datalad-archives special remote implementation!

remove(key: str)[source]

Raises UnsupportedRequest. This operation is not supported.

transfer_retrieve(key: str, localfilename: str)[source]

Retrieve an archive member from a (remote) archive

All registered locators for a requested key will be sorted by availability and size of the references archives. For each archive the most suitable handler will be initialized, and extraction of the identified member will be attempted. If that fails, the next handler is tried until all candidate handlers are exhausted. Depending on the archive availability and type, archives may need to be retrieved from remote sources.

transfer_store(key: str, filename: str)[source]

Raises UnsupportedRequest. This operation is not supported.

datalad_next.annexremotes.archivist.main()[source]

CLI entry point installed as git-annex-remote-archivist