Conduct

Specification scope and status

This specification provides an overview over meta-conduct concepts and its current implementation.

Purpose and Design

Meta-conduct allows to execute a number of metadata operations in a pipeline. Pipelines composed of:

  • one Provider
  • a number of Processors
  • and an optional Consumer

The task of the provider is to supply the elements that the processors should operate on, for example, all files in a dataset. Processors perform certain operations on the provided elements, for example, extract metadata. And consumer will consume the data generated by processors. A consumer could for example add metadata to the metadata store of a dataset.

The provider, processors, and the consumer are described by a Pipeline Definition provided to conduct during invocation. The pipeline definition is a JSON-encoded description of the provider, processors and consumers that should be used.

If desired conduct will parallelize the execution of multiple processor pipelines. More precisely, it allows to execute processors in concurrent processes or threads (using python’s concurrent module). Provider must yield elements that can be processed independently. Conduct will execute all providers in the order in which they are defined in the pipeline description on each element yielded by the provider. Conduct will parallelize the execution of those individual process-pipelines by default. Conduct will then hand the results to the consumer, of which only one exists, if any. That means the consumer will aggregate all results that were generated by the concurrently executed processor-pipelines.

(Note: you don’t have to use a consumer to process results. An alternative would be to use a processor that finalizes the data processing, for example, by storing metadata in metadata stores.)

Data Handling

All results that are generated by the pipeline elements are collected in a Pipeline Result, indexed by the name of the pipeline elements that created them. Downstream elements are responsible for selecting the correct names.