datalad.api.addurls

datalad.api.addurls(urlfile, urlformat, filenameformat, *, dataset=None, input_type='ext', exclude_autometa=None, meta=None, key=None, message=None, dry_run=False, fast=False, ifexists=None, missing_value=None, save=True, version_urls=False, cfg_proc=None, jobs=None, drop_after=False, on_collision='error')

Create and update a dataset from a list of URLs.

Format specification

Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a comma- or tab-separated file or properties for JSON) are available as placeholders. If URL-FILE is a CSV or TSV file, a positional index can also be used (i.e., “{0}” for the first column). Note that a placeholder cannot contain a ‘:’ or ‘!’.

In addition, the FILENAME-FORMAT arguments has a few special placeholders.

  • _repindex

    The constructed file names must be unique across all fields rows. To avoid collisions, the special placeholder “_repindex” can be added to the formatter. Its value will start at 0 and increment every time a file name repeats.

  • _url_hostname, _urlN, _url_basename*

    Various parts of the formatted URL are available. Take “http://datalad.org/asciicast/seamless_nested_repos.sh” as an example.

    “datalad.org” is stored as “_url_hostname”. Components of the URL’s path can be referenced as “_urlN”. “_url0” and “_url1” would map to “asciicast” and “seamless_nested_repos.sh”, respectively. The final part of the path is also available as “_url_basename”.

    This name is broken down further. “_url_basename_root” and “_url_basename_ext” provide access to the root name and extension. These values are similar to the result of os.path.splitext, but, in the case of multiple periods, the extension is identified using the same length heuristic that git-annex uses. As a result, the extension of “file.tar.gz” would be “.tar.gz”, not “.gz”. In addition, the fields “_url_basename_root_py” and “_url_basename_ext_py” provide access to the result of os.path.splitext.

  • _url_filename*

    These are similar to _url_basename* fields, but they are obtained with a server request. This is useful if the file name is set in the Content-Disposition header.

Examples

Consider a file “avatars.csv” that contains:

who,ext,link
neurodebian,png,https://avatars3.githubusercontent.com/u/260793
datalad,png,https://avatars1.githubusercontent.com/u/8927200

To download each link into a file name composed of the ‘who’ and ‘ext’ fields, we could run:

$ datalad addurls -d avatar_ds avatars.csv '{link}' '{who}.{ext}'

The -d avatar_ds is used to create a new dataset in “$PWD/avatar_ds”.

If we were already in a dataset and wanted to create a new subdataset in an “avatars” subdirectory, we could use “//” in the FILENAME-FORMAT argument:

$ datalad addurls avatars.csv '{link}' 'avatars//{who}.{ext}'

If the information is represented as JSON lines instead of comma separated values or a JSON array, you can use a utility like jq to transform the JSON lines into an array that addurls accepts:

$ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'

Note

For users familiar with ‘git annex addurl’: A large part of this plugin’s functionality can be viewed as transforming data from URL-FILE into a “url filename” format that fed to ‘git annex addurl –batch –with-files’.

Parameters:
  • urlfile – A file that contains URLs or information that can be used to construct URLs. Depending on the value of –input-type, this should be a comma- or tab-separated file (with a header as the first row) or a JSON file (structured as a list of objects with string values). If ‘-’, read from standard input, taking the content as JSON when –input-type is at its default value of ‘ext’. Alternatively, an iterable of dicts can be given.

  • urlformat – A format string that specifies the URL for each entry. See the ‘Format Specification’ section above.

  • filenameformat – Like URL-FORMAT, but this format string specifies the file to which the URL’s content will be downloaded. The name should be a relative path and will be taken as relative to the top-level dataset, regardless of whether it is specified via dataset or inferred. The file name may contain directories. The separator “//” can be used to indicate that the left-side directory should be created as a new subdataset. See the ‘Format Specification’ section above.

  • dataset (Dataset or None, optional) – Add the URLs to this dataset (or possibly subdatasets of this dataset). An empty or non-existent directory is passed to create a new dataset. New subdatasets can be specified with FILENAME- FORMAT. [Default: None]

  • input_type ({'ext', 'csv', 'tsv', 'json'}, optional) – Whether URL-FILE should be considered a CSV file, TSV file, or JSON file. The default value, “ext”, means to consider URL-FILE as a JSON file if it ends with “.json” or a TSV file if it ends with “.tsv”. Otherwise, treat it as a CSV file. [Default: ‘ext’]

  • exclude_autometa – By default, metadata field=value pairs are constructed with each column in URL-FILE, excluding any single column that is specified via URL-FORMAT. This argument can be used to exclude columns that match a regular expression. If set to ‘*’ or an empty string, automatic metadata extraction is disabled completely. This argument does not affect metadata set explicitly with –meta. [Default: None]

  • meta – A format string that specifies metadata. It should be structured as “<field>=<value>”. As an example, “location={3}” would mean that the value for the “location” metadata field should be set the value of the fourth column. This option can be given multiple times. [Default: None]

  • key – A format string that specifies an annex key for the file content. In this case, the file is not downloaded; instead the key is used to create the file without content. The value should be structured as “[et:]<input backend>[-s<bytes>]–<hash>”. The optional “et:” prefix, which requires git-annex 8.20201116 or later, signals to toggle extension state of the input backend (i.e., MD5 vs MD5E). As an example, “et:MD5-s{size}–{md5sum}” would use the ‘md5sum’ and ‘size’ columns to construct the key, migrating the key from MD5 to MD5E, with an extension based on the file name. Note: If the input backend itself is an annex extension backend (i.e., a backend with a trailing “E”), the key’s extension will not be updated to match the extension of the corresponding file name. Thus, unless the input keys and file names are generated from git-annex, it is recommended to avoid using extension backends as input. If an extension is desired, use the plain variant as input and prepend “et:” so that git-annex will migrate from the plain backend to the extension variant. [Default: None]

  • message (None or str, optional) – Use this message when committing the URL additions. [Default: None]

  • dry_run (bool, optional) – Report which URLs would be downloaded to which files and then exit. [Default: False]

  • fast (bool, optional) – If True, add the URLs, but don’t download their content. WARNING: ONLY USE THIS OPTION IF YOU UNDERSTAND THE CONSEQUENCES. If the content of the URLs is not downloaded, then datalad will refuse to retrieve the contents with datalad get <file> by default because the content of the URLs is not verified. Add annex.security.allow-unverified-downloads = ACKTHPPT to your git config to bypass the safety check. Underneath, this passes the –fast flag to git annex addurl. [Default: False]

  • ifexists ({None, 'overwrite', 'skip'}, optional) – What to do if a constructed file name already exists. The default behavior is to proceed with the git annex addurl, which will fail if the file size has changed. If set to ‘overwrite’, remove the old file before adding the new one. If set to ‘skip’, do not add the new file. [Default: None]

  • missing_value (None or str, optional) – When an empty string is encountered, use this value instead. [Default: None]

  • save (bool, optional) – by default all modifications to a dataset are immediately saved. Giving this option will disable this behavior. [Default: True]

  • version_urls (bool, optional) – Try to add a version ID to the URL. This currently only has an effect on HTTP URLs for AWS S3 buckets. s3:// URL versioning is not yet supported, but any URL that already contains a “versionId=” parameter will be used as is. [Default: False]

  • cfg_proc – Pass this cfg_proc value when calling create to make datasets. [Default: None]

  • jobs (int or None or {'auto'}, optional) – how many parallel jobs (where possible) to use. “auto” corresponds to the number defined by ‘datalad.runtime.max-annex-jobs’ configuration item NOTE: This option can only parallelize input retrieval (get) and output recording (save). DataLad does NOT parallelize your scripts for you. [Default: None]

  • drop_after (bool, optional) – drop files after adding to annex. [Default: False]

  • on_collision ({'error', 'error-if-different', 'take-first', 'take-last'}, optional) – What to do when more than one row produces the same file name. By default an error is triggered. “error-if-different” suppresses that error if rows for a given file name collision have the same URL and metadata. “take-first” or “take-last” indicate to instead take the first row or last row from each set of colliding rows. [Default: ‘error’]

  • on_failure ({'ignore', 'continue', 'stop'}, optional) – behavior to perform on failure: ‘ignore’ any failure is reported, but does not cause an exception; ‘continue’ if any failure occurs an exception will be raised at the end, but processing other actions will continue for as long as possible; ‘stop’: processing will stop on first failure and an exception is raised. A failure is any result with status ‘impossible’ or ‘error’. Raised exception is an IncompleteResultsError that carries the result dictionaries of the failures in its failed attribute. [Default: ‘continue’]

  • result_filter (callable or None, optional) – if given, each to-be-returned status dictionary is passed to this callable, and is only returned if the callable’s return value does not evaluate to False or a ValueError exception is raised. If the given callable supports **kwargs it will additionally be passed the keyword arguments of the original API call. [Default: None]

  • result_renderer – select rendering mode command results. ‘tailored’ enables a command- specific rendering style that is typically tailored to human consumption, if there is one for a specific command, or otherwise falls back on the the ‘generic’ result renderer; ‘generic’ renders each result in one line with key info like action, status, path, and an optional message); ‘json’ a complete JSON line serialization of the full result record; ‘json_pp’ like ‘json’, but pretty-printed spanning multiple lines; ‘disabled’ turns off result rendering entirely; ‘<template>’ reports any value(s) of any result properties in any format indicated by the template (e.g. ‘{path}’, compare with JSON output for all key-value choices). The template syntax follows the Python “format() language”. It is possible to report individual dictionary values, e.g. ‘{metadata[name]}’. If a 2nd-level key contains a colon, e.g. ‘music:Genre’, ‘:’ must be substituted by ‘#’ in the template, like so: ‘{metadata[music#Genre]}’. [Default: ‘tailored’]

  • result_xfm ({'datasets', 'successdatasets-or-none', 'paths', 'relpaths', 'metadata'} or callable or None, optional) – if given, each to-be-returned result status dictionary is passed to this callable, and its return value becomes the result instead. This is different from result_filter, as it can perform arbitrary transformation of the result value. This is mostly useful for top- level command invocations that need to provide the results in a particular format. Instead of a callable, a label for a pre-crafted result transformation can be given. [Default: None]

  • return_type ({'generator', 'list', 'item-or-list'}, optional) – return value behavior switch. If ‘item-or-list’ a single value is returned instead of a one-item return value list, or a list in case of multiple return values. None is return in case of an empty list. [Default: ‘list’]