Writing custom extractors¶
Metalad supports automated extraction of metadata through a common single interface which allows execution of metadata extractors.
An extractor, in MetaLad sense, is a class derived from one of the base extractor classes defined in datalad_metalad.extractors.base
(DatasetMetadataExtractor
or FileMetadataExtractor
).
It needs to implement several required methods, most notably extract()
.
Example extractors can be found in MetaLad source code:
Base class¶
There are two primary types of extractors, dataset-level extractors and file-level extractors.
Dataset-level extractors, by inheritance from the DatasetMetadataExtractor
class, can access the dataset on which they operate as self.dataset
.
Extractor functions may use this object to call any DataLad dataset methods. They can perform whatever operations they deem necessary to extract metadata from the dataset, for example, they could count the files in the dataset or look for a file named CITATION.cff
in the root directory of the dataset and return its content.
File-level extractors, by inheritance from the FileMetadataExtractor
, contain a Dataset
-object in the property self.dataset
and a FileInfo
-object in the property self.file_info
. FileInfo
is a dataclass with the properties type
, git_sha_sum
, byte_size
, state
, path
, and intra_dataset_path
fields. File-level extractors should return metadata that describes the file that is referenced by FileInfo
.
Required methods¶
Any extractor is expected to implement a number of abstract methods that define the interface through which MetaLad communicates with extractors.
get_id()
¶
This function should return a UUID
object containing a UUID which is not used by another metalad-extractor or metalad-filter.
Since it is meant to uniquely identify the extractor and the type of data it returns, its return value should be hardcoded, or generated in a deterministic way.
A UUID can be generated in several ways, e.g. with Python (import uuid; print(uuid.uuid4())
), or online generators
(e.g. https://www.uuidgenerator.net/ or by searching ‘UUID’ in DuckDuckGo).
While version 4 (random) UUIDs are probably sufficient, version 5 UUIDs (generated based on a given namespace and name) can also be used.
For example, extractors in packages developed under the datalad organisation could use <extractor_name>.datalad.org
.
Example generation with Python: uuid.uuid5(uuid.NAMESPACE_DNS, "bids_dataset.extractors.datalad.org")
.
Example:
def get_id(self) -> UUID:
return UUID("0c26f218-537e-5db8-86e5-bc12b95a679e")
get_version()
¶
This function should return a version string representing the extractor’s version.
The extractor version is meant to be included with the extracted metadata. It might be a version number, or a reference to a particular schema. While there are no requirements on the versioning scheme, it should be chosen carefully, and rules for version updating need to be considered. For example, when using semantic versioning style, one might decide to guarantee that attribute type and name will never change within a major release, while new attributes may be added with a minor release.
Example:
def get_version(self) -> str:
return "0.0.1"
get_data_output_category()
¶
This function should return a DataOutputCategory
object, which tells MetaLad what kind of (meta)data it is dealing with. The following output categories are available in the enumeration datalad_metalad.extractors.base.DataOutputCategory
:
IMMEDIATE
FILE
DIRECTORY
Output categories declare how the extractor delivers its results. If the output category is IMMEDIATE
, the result is returned by the extract()
-call. In case of FILE
, the extractor deposits the result in a file. This ie especially useful for extractors that are external programs. The category DIRECTORY
is not yet supported. If you write an extractor in Python, you would usually use the output category IMMEDIATE
and return extractor results from the extract()
-call.
Example:
def get_data_output_category(self) -> DataOutputCategory:
return DataOutputCategory.IMMEDIATE
get_required_content()
¶
This function is used in dataset-level extractors only. It will be called by MetaLad prior to metadata extraction. Its purpose is to allow the extractor to ensure that content that is required for metadata extraction is present (relevant, for example, if some of files to be inspected may be annexed).
The function should either return a boolean value (True | False
) or return a Generator
with
DataLad result records. In the case of a boolean value, the function should return True
if
it has obtained the required content, or confirmed its presence. If it returns False
,
metadata extraction will not proceed. Alternatively, yielding result records provides extractors with
the capability to signal more expressive messages or errors. If a result record is yielded with a failure
status (i.e. with status
equal to impossible
or error
) metadata extraction will not proceed.
This function can be a place to call dataset.get()
.
It is advisable to disable result rendering (result_renderer="disabled"
), because during metadata
extraction, users will typically want to redirect standard output to a file or another command.
Example 1:
def get_required_content(self) -> bool:
result = self.dataset.get("CITATION.cff", result_renderer="disabled")
return result[0]["status"] in ("ok", "notneeded")
Example 2:
from typing import Generator
def get_required_content(self) -> Generator:
yield self.dataset.get("CITATION.cff", result_renderer="disabled")
Example 3:
from typing import Generator
def get_required_content(self) -> Generator:
result = self.dataset.get('CITATION.cff', result_renderer='disabled')
failure_count = 0
result_dict = dict(
path=self.dataset.path,
type='dataset',
)
for r in res:
if r['status'] in ['error', 'impossible']:
failure_count += 1
if failure_count > 0:
result_dict.update({
'status': 'error'
'message': 'could not retrieve required content'
})
else:
result_dict.update({
'status': 'ok'
'message': 'required content retrieved'
})
yield result_dict
is_content_required()
¶
This function is used in file-extractors only.
It is a file-extractor counterpart to get_required_content()
.
Its purpose is to tell MetaLad whether the file content is required or not
(relevant for annexed files - extraction may depend on file content, or require only annex key).
If the function returns True
, MetaLad will get the file content.
If it returns False
, the get operation will not be performed.
extract()
¶
This function is used for actual metadata extraction. It has one parameter called output_location
. If the output category of the extractor is DataOutputCategory.IMMEDIATE
, this parameter will be None
. If the output category is DataOutputCategory.FILE
, this parameter will contain either a file name or a file
-object into which the extractor can write its output.
The function should return an datalad_metalad.extractors.base.ExtractorResult
object.
The ExtractorResult
is a dataclass object, containing the following fields:
extractor_version
: a version string representing the extractor’s version.extraction_parameter
: a dictionary containing parameters passed to the extractor by the calling command; can be obtained with:self.parameter or {}
.extraction_success
: eitherTrue
orFalse
.datalad_result_dict
: a dictionary with entries added to the DataLad result record produced by a MetaLad calling command. Result records are used by DataLad to inform generic error handling and decisions on how to proceed with subsequent operations. MetaLad commands always set the mandatory result record fieldsaction
andpath
; the minimally useful set of fields which should by the extractor is"status"
(one of:"ok"
,"notneeded"
,"impossible"
,"error"
) and"type"
("dataset"
or"file"
).immediate_data
(a dictionary, optional): if the output category of the extractor isIMMEDIATE
, then theimmediate_data
field should contain the result of the extraction process as a dictionary with freely-chosen keys. Contents of this dictionary should be JSON-serializable, becausedatalad meta-extract
will print the JSON-serialized extractor result to standard output.
Example:
def extract(self, _=None) -> ExtractorResult:
# Returns citation file content as metadata, altering only date
# load file, guaranteed to be present
with open(Path(self.dataset.path) / "CITATION.cff") as f:
yamlContent = yaml.safe_load(f)
# iso-format dates (nonexhaustive - publications have them too)
if "date-released" in yamlContent:
isodate = yamlContent["date-released"].isoformat()
yamlContent["date-released"] = isodate
return ExtractorResult(
extractor_version=self.get_version(),
extraction_parameter=self.parameter or {},
extraction_success=True,
datalad_result_dict={
"type": "dataset",
"status": "ok",
},
immediate_data=yamlContent,
)
Passing runtime parameter to extractors¶
When an extractor is executed via meta-extract
, you can pass runtime
parameter to it. The runtime parameters are given as key-value pairs after
the EXTRACTOR_NAME
-parameter in dataset level extraction commands, or
after the FILE
-parameter in file-level extraction commands. Each key-value
pair consists of two arguments, first the key, followed by the value.
The parameters are provided to dataset-level or file-level extractors in the
extractor property self.parameter
. The property contains a dictionary that
holds the given key-value pairs.
For example, the following call:
datalad meta-extract -d . metalad_example_file README.md key1 value1 key2 value2
Will place the following dictionary in the parameter
property of the
extractor instance:
{'key1': 'value1', 'key2': 'value2'}
Please not, if dataset level extraction should be performed and you want to provide extractor
parameter, you have to provide the --force-dataset-level
parameter to ensure
dataset-level extraction. i.e. to prevent meta-extract
from interpreting the
key of the first extractor argument as file name for a file-level extraction.
Please note also that only extractors that are derive from the classes FileMetadataExtractor
or DatasetMetadataExtractor
have a parameter
-property
and are able to read the parameters that are provided in the command line.
Use external programs for metadata extraction¶
Consider the situation where you have an external program, that is able to extract metadata from a dataset or a file. There might be many reasons, why you cannot create an equivalent extractor in Python. For example, the algorithm is unknown and you only have a binary version, a Python version might be too slow, you cannot afford the effort.
Metalad provides specific extractors that invoke external programs to perform
extraction, i.e. metalad_external_file
and metalad_external_dataset
.
Those extractors interact with external programs via standard input and
standard output in order to query them, for example, for their
UUID and their output category. The external programs are expected to support
execution with one of the following parameters:
--get-uuid
--get-version
--get-data-output-category
In addition external file-level extractor programs must support:
--is-content-required
--extract <dataset-path> <dataset-ref-commit> <file-path> <dataset-relative-file-path>
and the external dataset-level extractor programs must support:
--get-required
--extract <dataset-path> <dataset-ref-commit>
Usually the external extractor has to be wrapped into a thin layer that provides the interface that is outlined above.
Making extractors discoverable¶
To be discovered by meta_extract
, an extractor should be part of a DataLad extension.
In addition, to make it discoverable, you need to declare an entry point in the extension’s setup.cfg
file.
You can define the entrypoint name, and specify which extractor class it should point to.
It is recommended to give the extractor name a prefix, to reduce the risk of name collisions.
Example:
[options.entry_points]
# (...)
datalad.metadata.extractors =
hello_cff = datalad_helloworld.extractors.basic_dataset:CffExtractor
Tips¶
Using git methods to discover contents efficiently¶
Dataset-level extractors may need to check specific files to obtain information about specific files. If the files need to be listed, it may be more efficient to call git-ls-files or git-ls-tree instead of using pathlib methods (this limits the listing to files tracked by the dataset and helps avoid costly indexing if the .git directory). For example, a list of files with a given extension (including those in subfolders) can be created with:
files = list(self.dataset.repo.call_git_items_(["ls-files", "*.xyz"]))