Create and update a dataset from a list of URLs.

class datalad.plugin.addurls.Addurls[source]

Bases: datalad.interface.base.Interface

Create and update a dataset from a list of URLs.

Format specification

Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a comma- or tab-separated file or properties for JSON) are available as placeholders. If URL-FILE is a CSV or TSV file, a positional index can also be used (i.e., “{0}” for the first column). Note that a placeholder cannot contain a ‘:’ or ‘!’.

In addition, the FILENAME-FORMAT arguments has a few special placeholders.

  • _repindex

    The constructed file names must be unique across all fields rows. To avoid collisions, the special placeholder “_repindex” can be added to the formatter. Its value will start at 0 and increment every time a file name repeats.

  • _url_hostname, _urlN, _url_basename*

    Various parts of the formatted URL are available. Take “” as an example.

    “” is stored as “_url_hostname”. Components of the URL’s path can be referenced as “_urlN”. “_url0” and “_url1” would map to “asciicast” and “”, respectively. The final part of the path is also available as “_url_basename”.

    This name is broken down further. “_url_basename_root” and “_url_basename_ext” provide access to the root name and extension. These values are similar to the result of os.path.splitext, but, in the case of multiple periods, the extension is identified using the same length heuristic that git-annex uses. As a result, the extension of “file.tar.gz” would be “.tar.gz”, not “.gz”. In addition, the fields “_url_basename_root_py” and “_url_basename_ext_py” provide access to the result of os.path.splitext.

  • _url_filename*

    These are similar to _url_basename* fields, but they are obtained with a server request. This is useful if the file name is set in the Content-Disposition header.


Consider a file “avatars.csv” that contains:


To download each link into a file name composed of the ‘who’ and ‘ext’ fields, we could run:

$ datalad addurls -d avatar_ds --fast avatars.csv '{link}' '{who}.{ext}'

The -d avatar_ds is used to create a new dataset in “$PWD/avatar_ds”.

If we were already in a dataset and wanted to create a new subdataset in an “avatars” subdirectory, we could use “//” in the FILENAME-FORMAT argument:

$ datalad addurls --fast avatars.csv '{link}' 'avatars//{who}.{ext}'

If the information is represented as JSON lines instead of comma separated values or a JSON array, you can use a utility like jq to transform the JSON lines into an array that addurls accepts:

$ ... | jq --slurp . | datalad addurls - '{link}' '{who}.{ext}'


For users familiar with ‘git annex addurl’: A large part of this plugin’s functionality can be viewed as transforming data from URL-FILE into a “url filename” format that fed to ‘git annex addurl –batch –with-files’.

class EnsureChoice(*values)


Ensure an input is element of a set of possible values

class EnsureDataset


Despite its name, this constraint does not actually ensure that the argument is a valid dataset, because for procedural reasons this would typically duplicate subsequent checks and processing. However, it can be used to achieve uniform documentation of dataset arguments.

class EnsureNone


Ensure an input is of value None

class EnsureStr(min_len=0)


Ensure an input is a string.

No automatic conversion is attempted.

class Parameter(constraints=None, doc=None, args=None, **kwargs)

Bases: object

This class shall serve as a representation of a parameter.

get_autodoc(name, indent=' ', width=70, default=None, has_default=False)

Docstring for the parameter to be used in lists of parameters

Return type:string or list of strings (if indent is None)
datasetmethod(name=None, dataset_argname='dataset')

Decorator for return value evaluation of datalad commands.

Note, this decorator is only compatible with commands that return status dict sequences!

Two basic modes of operation are supported: 1) “generator mode” that yields individual results, and 2) “list mode” that returns a sequence of results. The behavior can be selected via the kwarg return_type. Default is “list mode”.

This decorator implements common functionality for result rendering/output, error detection/handling, and logging.

Result rendering/output can be triggered via the datalad.api.result-renderer configuration variable, or the result_renderer keyword argument of each decorated command. Supported modes are: ‘default’ (one line per result with action, status, path, and an optional message); ‘json’ (one object per result, like git-annex), ‘json_pp’ (like ‘json’, but pretty-printed spanning multiple lines), ‘tailored’ custom output formatting provided by each command class (if any).

Error detection works by inspecting the status item of all result dictionaries. Any occurrence of a status other than ‘ok’ or ‘notneeded’ will cause an IncompleteResultsError exception to be raised that carries the failed actions’ status dictionaries in its failed attribute.

Status messages will be logged automatically, by default the following association of result status and log channel will be used: ‘ok’ (debug), ‘notneeded’ (debug), ‘impossible’ (warning), ‘error’ (error). Logger instances included in the results are used to capture the origin of a status report.

Parameters:func (function) – __call__ method of a subclass of Interface, i.e. a datalad command definition
class datalad.plugin.addurls.AnnexKeyParser(format_fn, format_string)[source]

Bases: object

Parse a full annex key into subparts.

The key may have an “et:” prefix appended, which signals that the backend’s extension state should be toggled.

See <>.

  • format_fn (callable) – Function that takes a format string and a row and returns the full key.
  • format_string (str) – Format string for the full key.

Format the key with the fields in row and parse it.

  • A dictionary with the following keys that match their counterparts in
  • the output of `git annex examinekey –json` (“key” (the full annex)
  • key), “backend”, and “keyname”. If the key had an “et (” prefix, there)
  • is also a “target_backend” key.
Raises:ValueError if the formatted value doesn’t look like a valid key
class datalad.plugin.addurls.BatchedRegisterUrl(ds, repo=None)[source]

Bases: datalad.plugin.addurls.RegisterUrl

Like RegisterUrl, but use batched commands underneath.

examinekey(parsed_key, filename, migrate=False)[source]
fromkey(key, filename)[source]
registerurl(key, url)[source]
class datalad.plugin.addurls.Formatter(idx_to_name=None, missing_value=None)[source]

Bases: string.Formatter

Formatter that gives precedence to custom keys.

The first positional argument to the format call should be a mapping whose keys are exposed as placeholders (e.g., “{key1}.py”).

  • idx_to_name (dict) – A mapping from a positional index to a key. If not provided, “{N}” elements are not supported.
  • missing (str, optional) – When column lookup results in an empty string, use this value in its place.
convert_field(value, conversion)[source]
format(format_string, *args, **kwargs)[source]
get_value(key, args, kwargs)[source]

Look for key’s value in args[0] mapping first.

class datalad.plugin.addurls.RegisterUrl(ds, repo=None)[source]

Bases: object

Create files (without content) from user-supplied keys and register URLs.

examinekey(parsed_key, filename, migrate=False)[source]
fromkey(key, filename)[source]
registerurl(key, url)[source]
class datalad.plugin.addurls.RepFormatter(*args, **kwargs)[source]

Bases: datalad.plugin.addurls.Formatter

Extend Formatter to support a {_repindex} placeholder.

format(*args, **kwargs)[source]
get_value(key, args, kwargs)[source]

Look for key’s value in args[0] mapping first.

datalad.plugin.addurls.add_extra_filename_values(filename_format, rows, urls, dry_run)[source]

Extend rows with values for special formatting fields.


Process metadata arguments.

Parameters:args (iterable of str) – Formatted metadata arguments for ‘git-annex metadata –set’.
Return type:A dict mapping field names to values.
datalad.plugin.addurls.extract(rows, colidx_to_name=None, url_format='{0}', filename_format='{1}', exclude_autometa=None, meta=None, key=None, dry_run=False, missing_value=None)[source]

Extract and format information from rows.

  • rows (list of dict) –
  • colidx_to_name (dict, optional) – Mapping from a position index to a column name.
  • other parameters match those described in AddUrls. (All) –

  • A tuple where the first item is a list with a dict of extracted information
  • for each row in stream and the second item a list subdataset paths,
  • sorted breadth-first.

Remove illegal names from fields.

Note: This is like filter(is_legal_metafield, fields) but the dropped values are logged.

datalad.plugin.addurls.fmt_to_name(format_string, num_to_name)[source]

Try to map a format string to a single name.

  • format_string (string) –
  • num_to_name (dict) – A dictionary that maps from an integer to a column name. This enables mapping the format string to an integer to a name.

  • A placeholder name if format_string consists of a single
  • placeholder and no other text. Otherwise, None is returned.

datalad.plugin.addurls.get_file_parts(filename, prefix='name')[source]

Assign a name to various parts of a file.

  • filename (str) – A file name (no leading path is permitted).
  • prefix (str) – Prefix to prepend to the key names.

Return type:

A dict mapping each part to a value.


Yield field names in format_string.


Convert “//” marker in filename to a list of subpaths.

>>> from datalad.plugin.addurls import get_subpaths
>>> get_subpaths("p1/p2//p3/p4//file")
('p1/p2/p3/p4/file', ['p1/p2', 'p1/p2/p3/p4'])

Note: With Python 3, the subpaths could be generated with

itertools.accumulate(filename.split(“//”)[:-1], os.path.join)
Parameters:filename (str) – File name with “//” marking subpaths.
  • A tuple of the filename with any “//” collapsed to a single
  • separator and a list of subpaths (str).

Assign a name to various parts of the URL.

Parameters:url (str) –
  • A dict with keys _url_hostname and, for a path with N+1 parts,
  • ’_url0’ through ‘_urlN’ . There is also a _url_basename key for
  • the rightmost part of the path.

Test whether name is a valid metadata field.

The set of permitted characters is taken from git-annex’s MetaData.hs:legalField.


Sort paths by directory level and then alphabetically.

Parameters:paths (iterable of str) –
Return type:Generator of sorted paths.