datalad meta-conduct

Synopsis

datalad meta-conduct [-h] [-m MAX_WORKERS] [-p {process|thread|sequential}] [--pipeline-help] [--version] CONFIGURATION [ARGUMENTS [ARGUMENTS ...]]

Description

Conduct the execution of a processing pipeline

A processing pipeline is a metalad-specific application of the Unix shell philosophy, have a number of small programs that do one thing, but that one thing, very well.

Processing pipelines consist of:

  • A provider, that provides data that should be processed
  • A list of processors. A processor reads data, either from the previous processor or the provider and performs computations on the data and return a result that is processed by the next processor. The computation may have side-effect, e.g. store metadata.

The provider is usually executed in the current processes’ main thread. Processors are usually executed in concurrent processes, i.e. workers. The maximum number of workers is given by the parameter MAX_WORKERS.

Which provider and which processors are used is defined in an “configuration”, which is given as JSON-serialized dictionary.

Examples

Run ‘metalad_example_dataset’ extractor on the top dataset and all subdatasets. Add the resulting metadata in aggregatedmode. This command uses the provided pipelinedefinition ‘extract_metadata’.:

% datalad meta-conduct extract_metadata traverser.top_level_dir=<dataset path> traverser.item_type=dataset traverser.traverse_sub_datasets=True extractor.extractor_type=dataset extractor.extractor_name=metalad_example_dataset adder.aggregate=True

Run metalad_example_file extractor on all files of the root dataset and the subdatasets. Automatically get the content, if it is not present. Drop content that was automatically fetched after its metadata has been added.:

% datalad meta-conduct extract_metadata_autoget_autodrop traverser.top_level_dir=<dataset path> traverser.item_type=file traverser.traverse_sub_datasets=True extractor.extractor_type=file extractor.extractor_name=metalad_example_file adder.aggregate=True

Options

CONFIGURATION

Path to a file with contains the pipeline configuration as JSON-serialized object. If the path is “-”, the configuration is read from standard input.

ARGUMENTS

Constructor arguments for pipeline elements, i.e. provider, processors, and consumer. The arguments have to be prefixed with the name of the pipeline element, followed by “.”, the keyname, a “=”, and the value. The pipeline element arguments are identified by the pattern “<name>.<key>=<value>”.

-h, –help, –help-np

show this help message. –help-np forcefully disables the use of a pager for displaying the help message

-m MAX_WORKERS, –max-workers MAX_WORKERS

maximum number of workers. Constraints: value must be convertible to type ‘int’ or value must be NONE

-p {process|thread|sequential}, –processing-mode {process|thread|sequential}

Specify how elements are executed, either in subprocesses, in threads, or sequentially in the main thread. The respective values are “process”, “thread”, and “sequential”, (default: “process”). Constraints: value must be one of (‘process’, ‘thread’, ‘sequential’) [Default: ‘process’]

–pipeline-help

Show documentation for the elements in the pipeline and exit.

–version

show the module and its version which provides the command

Authors

datalad is developed by The DataLad Team and Contributors <team@datalad.org>.