datalad.api.create_sibling_ria

datalad.api.create_sibling_ria(url, name, *, dataset=None, storage_name=None, alias=None, post_update_hook=False, shared=None, group=None, storage_sibling=True, existing='error', new_store_ok=False, trust_level=None, recursive=False, recursion_limit=None, disable_storage__=None, push_url=None)

Creates a sibling to a dataset in a RIA store

Communication with a dataset in a RIA store is implemented via two siblings. A regular Git remote (repository sibling) and a git-annex special remote for data transfer (storage sibling) – with the former having a publication dependency on the latter. By default, the name of the storage sibling is derived from the repository sibling’s name by appending “-storage”.

The store’s base path is expected to not exist, be an empty directory, or a valid RIA store.

Notes

RIA URL format

Interactions with new or existing RIA stores require RIA URLs to identify the store or specific datasets inside of it.

The general structure of a RIA URL pointing to a store takes the form ria+[scheme]://<storelocation> (e.g., ria+ssh://[user@]hostname:/absolute/path/to/ria-store, or ria+file:///absolute/path/to/ria-store)

The general structure of a RIA URL pointing to a dataset in a store (for example for cloning) takes a similar form, but appends either the datasets UUID or a “~” symbol followed by the dataset’s alias name: ria+[scheme]://<storelocation>#<dataset-UUID> or ria+[scheme]://<storelocation>#~<aliasname>. In addition, specific version identifiers can be appended to the URL with an additional “@” symbol: ria+[scheme]://<storelocation>#<dataset-UUID>@<dataset-version>, where dataset-version refers to a branch or tag.

RIA store layout

A RIA store is a directory tree with a dedicated subdirectory for each dataset in the store. The subdirectory name is constructed from the DataLad dataset ID, e.g. 124/68afe-59ec-11ea-93d7-f0d5bf7b5561, where the first three characters of the ID are used for an intermediate subdirectory in order to mitigate files system limitations for stores containing a large number of datasets.

By default, a dataset in a RIA store consists of two components: A Git repository (for all dataset contents stored in Git) and a storage sibling (for dataset content stored in git-annex).

It is possible to selectively disable either component using storage-sibling 'off' or storage-sibling 'only', respectively. If neither component is disabled, a dataset’s subdirectory layout in a RIA store contains a standard bare Git repository and an annex/ subdirectory inside of it. The latter holds a Git-annex object store and comprises the storage sibling. Disabling the standard git-remote (storage-sibling='only') will result in not having the bare git repository, disabling the storage sibling (storage-sibling='off') will result in not having the annex/ subdirectory.

Optionally, there can be a further subdirectory archives with (compressed) 7z archives of annex objects. The storage remote is able to pull annex objects from these archives, if it cannot find in the regular annex object store. This feature can be useful for storing large collections of rarely changing data on systems that limit the number of files that can be stored.

Each dataset directory also contains a ria-layout-version file that identifies the data organization (as, for example, described above).

Lastly, there is a global ria-layout-version file at the store’s base path that identifies where dataset subdirectories themselves are located. At present, this file must contain a single line stating the version (currently “1”). This line MUST end with a newline character.

It is possible to define an alias for an individual dataset in a store by placing a symlink to the dataset location into an alias/ directory in the root of the store. This enables dataset access via URLs of format: ria+<protocol>://<storelocation>#~<aliasname>.

Compared to standard git-annex object stores, the annex/ subdirectories used as storage siblings follow a different layout naming scheme (‘dirhashmixed’ instead of ‘dirhashlower’). This is mostly noted as a technical detail, but also serves to remind git-annex powerusers to refrain from running git-annex commands directly in-store as it can cause severe damage due to the layout difference. Interactions should be handled via the ORA special remote instead.

Error logging

To enable error logging at the remote end, append a pipe symbol and an “l” to the version number in ria-layout-version (like so: 1|l\n).

Error logging will create files in an “error_log” directory whenever the git-annex special remote (storage sibling) raises an exception, storing the Python traceback of it. The logfiles are named according to the scheme <dataset id>.<annex uuid of the remote>.log showing “who” ran into this issue with which dataset. Because logging can potentially leak personal data (like local file paths for example), it can be disabled client-side by setting the configuration variable annex.ora-remote.<storage-sibling-name>.ignore-remote-config.

Parameters:
  • url (str or None) – URL identifying the target RIA store and access protocol. If push_url is given in addition, this is used for read access only. Otherwise it will be used for write access too and to create the repository sibling in the RIA store. Note, that HTTP(S) currently is valid for consumption only thus requiring to provide push_url.

  • name (str or None) – Name of the sibling. With recursive, the same name will be used to label all the subdatasets’ siblings.

  • dataset (Dataset or None, optional) – specify the dataset to process. If no dataset is given, an attempt is made to identify the dataset based on the current working directory. [Default: None]

  • storage_name (str or None, optional) – Name of the storage sibling (git-annex special remote). Must not be identical to the sibling name. If not specified, defaults to the sibling name plus ‘-storage’ suffix. If only a storage sibling is created, this setting is ignored, and the primary sibling name is used. [Default: None]

  • alias (str or None, optional) – Alias for the dataset in the RIA store. Add the necessary symlink so that this dataset can be cloned from the RIA store using the given ALIAS instead of its ID. With recursive=True, only the top dataset will be aliased. [Default: None]

  • post_update_hook (bool, optional) – Enable Git’s default post-update-hook for the created sibling. This is useful when the sibling is made accessible via a “dumb server” that requires running ‘git update-server-info’ to let Git interact properly with it. [Default: False]

  • shared (str or bool or None, optional) – If given, configures the permissions in the RIA store for multi- users access. Possible values for this option are identical to those of git init –shared and are described in its documentation. [Default: None]

  • group (str or None, optional) – Filesystem group for the repository. Specifying the group is crucial when shared=”group”. [Default: None]

  • storage_sibling ({'only'} or bool or None, optional) – By default, an ORA storage sibling and a Git repository sibling are created (True|’on’). Alternatively, creation of the storage sibling can be disabled (False|’off’), or a storage sibling created only and no Git sibling (‘only’). In the latter mode, no Git installation is required on the target host. [Default: True]

  • existing ({'skip', 'error', 'reconfigure'}, optional) – Action to perform, if a (storage) sibling is already configured under the given name and/or a target already exists. In this case, a dataset can be skipped (‘skip’), an existing target repository be forcefully re-initialized, and the sibling (re-)configured (‘reconfigure’), or the command be instructed to fail (‘error’). [Default: ‘error’]

  • new_store_ok (bool, optional) – When set, a new store will be created, if necessary. Otherwise, a sibling will only be created if the url points to an existing RIA store. [Default: False]

  • trust_level ({'trust', 'semitrust', 'untrust', None}, optional) – specify a trust level for the storage sibling. If not specified, the default git-annex trust level is used. ‘trust’ should be used with care (see the git-annex-trust man page). [Default: None]

  • recursive (bool, optional) – if set, recurse into potential subdatasets. [Default: False]

  • recursion_limit (int or None, optional) – limit recursion into subdatasets to the given number of levels. [Default: None]

  • disable_storage (bool, optional) – This option is deprecated. Use ‘–storage-sibling off’ instead. [Default: None]

  • push_url (str or None, optional) – URL identifying the target RIA store and access protocol for write access to the storage sibling. If given this will also be used for creation of the repository sibling in the RIA store. [Default: None]

  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]

  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]

  • result_renderer – select rendering mode command results. ‘tailored’ enables a command- specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the ‘generic’ result renderer; ‘generic’ renders each result in one line with key info like action, status, path, and an optional message); ‘json’ a complete JSON line serialization of the full result record; ‘json_pp’ like ‘json’, but pretty-printed spanning multiple lines; ‘disabled’ turns off result rendering entirely; ‘<template>’ reports any value(s) of any result properties in any format indicated by the template (e.g. ‘{path}’, compare with JSON output for all key-value choices). The template syntax follows the Python “format() language”. It is possible to report individual dictionary values, e.g. ‘{metadata[name]}’. If a 2nd-level key contains a colon, e.g. ‘music:Genre’, ‘:’ must be substituted by ‘#’ in the template, like so: ‘{metadata[music#Genre]}’. [Default: ‘tailored’]

  • result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]

  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]