datalad addurls

Synopsis

datalad addurls [-h] [-d DATASET] [-t TYPE] [-x REGEXP] [-m FORMAT] [--message MESSAGE] [-n] [--fast] [--ifexists ACTION] [--missing-value VALUE] [--nosave] [--version-urls] URL-FILE URL-FORMAT FILENAME-FORMAT

Description

Create and update a dataset from a list of URLs.

Format specification

Several arguments take format strings. These are similar to normal Python format strings where the names from URL-FILE (column names for a CSV or properties for JSON) are available as placeholders. If URL-FILE is a CSV file, a positional index can also be used (i.e., “{0}” for the first column). Note that a placeholder cannot contain a ‘:’ or ‘!’.

In addition, the FILENAME-FORMAT arguments has a few special placeholders.

  • _repindex

    The constructed file names must be unique across all fields rows. To avoid collisions, the special placeholder “_repindex” can be added to the formatter. Its value will start at 0 and increment every time a file name repeats.

  • _url_hostname, _urlN, _url_basename*

    Various parts of the formatted URL are available. Take “http://datalad.org/asciicast/seamless_nested_repos.sh” as an example.

    “datalad.org” is stored as “_url_hostname”. Components of the URL’s path can be referenced as “_urlN”. “_url0” and “_url1” would map to “asciicast” and “seamless_nested_repos.sh”, respectively. The final part of the path is also available as “_url_basename”.

    This name is broken down further. “_url_basename_root” and “_url_basename_ext” provide access to the root name and extension. These values are similar to the result of os.path.splitext, but, in the case of multiple periods, the extension is identified using the same length heuristic that git-annex uses. As a result, the extension of “file.tar.gz” would be “.tar.gz”, not “.gz”. In addition, the fields “_url_basename_root_py” and “_url_basename_ext_py” provide access to the result of os.path.splitext.

  • _url_filename*

    These are similar to _url_basename* fields, but they are obtained with a server request. This is useful if the file name is set in the Content-Disposition header.

Examples

Consider a file “avatars.csv” that contains:

who,ext,link
neurodebian,png,https://avatars3.githubusercontent.com/u/260793
datalad,png,https://avatars1.githubusercontent.com/u/8927200

To download each link into a file name composed of the ‘who’ and ‘ext’ fields, we could run:

$ datalad addurls -d avatar_ds --fast avatars.csv '{link}' '{who}.{ext}'

The -d avatar_ds is used to create a new dataset in “$PWD/avatar_ds”.

If we were already in a dataset and wanted to create a new subdataset in an “avatars” subdirectory, we could use “//” in the FILENAME-FORMAT argument:

$ datalad addurls --fast avatars.csv '{link}' 'avatars//{who}.{ext}'

NOTE

For users familiar with ‘git annex addurl’: A large part of this plugin’s functionality can be viewed as transforming data from URL-FILE into a “url filename” format that fed to ‘git annex addurl –batch –with-files’.

Options

URL-FILE

A file that contains URLs or information that can be used to construct URLs. Depending on the value of –input-type, this should be a CSV file (with a header as the first row) or a JSON file (structured as a list of objects with string values).

URL-FORMAT

A format string that specifies the URL for each entry. See the ‘Format Specification’ section above.

FILENAME-FORMAT

Like URL-FORMAT, but this format string specifies the file to which the URL’s content will be downloaded. The file name may contain directories. The separator “//” can be used to indicate that the left-side directory should be created as a new subdataset. See the ‘Format Specification’ section above.

-h, –help, –help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-d DATASET, –dataset DATASET

Add the URLs to this dataset (or possibly subdatasets of this dataset). An empty or non-existent directory is passed to create a new dataset. New subdatasets can be specified with FILENAME-FORMAT. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path)

-t TYPE, –input-type TYPE

Whether URL-FILE should be considered a CSV file or a JSON file. The default value, “ext”, means to consider URL-FILE as a JSON file if it ends with “.json”. Otherwise, treat it as a CSV file. Constraints: value must be one of (‘ext’, ‘csv’, ‘json’) [Default: ‘ext’]

-x REGEXP, –exclude-autometa REGEXP

By default, metadata field=value pairs are constructed with each column in URL- FILE, excluding any single column that is specified via URL-FORMAT. This argument can be used to exclude columns that match a regular expression. If set to ‘*’ or an empty string, automatic metadata extraction is disabled completely. This argument does not affect metadata set explicitly with –meta. [Default: None]

-m FORMAT, –meta FORMAT

A format string that specifies metadata. It should be structured as “<field>=<value>”. As an example, “location={3}” would mean that the value for the “location” metadata field should be set the value of the fourth column. This option can be given multiple times. [Default: None]

–message MESSAGE

Use this message when committing the URL additions. Constraints: value must be NONE, or value must be a string [Default: None]

-n, –dry-run

Report which URLs would be downloaded to which files and then exit. [Default: False]

–fast

If True, add the URLs, but don’t download their content. Underneath, this passes the –fast flag to git annex addurl. [Default: False]

–ifexists ACTION

What to do if a constructed file name already exists. The default behavior is to proceed with the git annex addurl, which will fail if the file size has changed. If set to ‘overwrite’, remove the old file before adding the new one. If set to ‘skip’, do not add the new file. Constraints: value must be NONE, or value must be one of (‘overwrite’, ‘skip’) [Default: None]

–missing-value VALUE

When an empty string is encountered, use this value instead. Constraints: value must be NONE, or value must be a string [Default: None]

–nosave

by default all modifications to a dataset are immediately saved. Giving this option will disable this behavior. [Default: True]

–version-urls

Try to add a version ID to the URL. This currently only has an effect on URLs for AWS S3 buckets. [Default: False]

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.