datalad_metalad.extractors.custom¶

MetadataRecord extractor for custom (JSON-LD) metadata contained in a dataset

One or more source files with metadata can be specified via the ‘datalad.metadata.custom-dataset-source’ configuration variable. The content of these files must be a JSON object, and a metadata dictionary is built by updating it with the content of the JSON objects in the order in which they are given.

By default a single file is read: ‘.metadata/dataset.json’

class datalad_metalad.extractors.custom.CustomMetadataExtractor[source]¶

Bases: datalad_metalad.extractors.base.MetadataExtractor

get_required_content(dataset, process_type, status)[source]¶

Report records for dataset content that must be available locally

Any implementation can yield records in the given status that correspond to dataset content that must be available locally for an extractor to perform its work. It is acceptable to not yield such a record, or no records at all. In such case, the extractor is expected to handle the case of non-available content in some sensible way internally.

The parameters are identical to those of MetadataExtractor.__call__().

Any content corresponding to a yielded record will be obtained automatically before metadata extraction is initiated. Hence any extractor reporting accurately can expect all relevant content to be present locally.

Instead of a status record, it is also possible to return custom dictionaries that must contain a ‘path’ key, containing the absolute path to the required file within the given dataset.

Example implementation:

for s in status:
    if s['path'].endswith('.pdf'):
        yield s

get_state(dataset)[source]¶

Report on extractor-related state and configuration

Extractors can reimplement this method to report arbitrary information in a dictionary. This information will be included in the metadata aggregate catalog in each dataset. Consequently, this information should be brief/compact and limited to essential facts on a comprehensive state of an extractor that “fully” determines its behavior. Only plain key-value items, with simple values, such a string int, float, or lists thereof, are supported.

Any change in the reported state in comparison to a recorded state for an existing metadata aggregate will cause a re-extraction of metadata. The nature of the state change does not matter, as the entire dictionary will be compared. Primarily, this is useful for reporting per-extractor version information (such as a version for the extractor output format, or critical version information on external software components employed by the extractor), and potential configuration settings that determine the behavior of on extractor.

State information can be dataset-specific. The respective Dataset object instance is passed via the method’s dataset argument.