datalad_metalad.extractors.runprov

Metadata extractor for provenance information in DataLad’s run records

Concept

  • Find all the commits with a run-record encoded in them
  • the commit SHA provides @id for the “activity”
  • pull out the author/date info for annotation purposes
  • pull out the run record (at the very least to report it straight up, but there can be more analysis of the input/output specs in the context of the repo state at that point)
  • pull out the diff: this gives us the filenames and shasums of everything touched by the “activity”. This info can then be used to look up which file was created by which activity and report that in the content metadata

Here is a sketch of the reported metadata structure:

{
  "@context": "http://openprovenance.org/prov.jsonld",
  "@graph": [
    # agents
    {
      "@id": "Name_Surname<email@example.com>",
      "@type": "agent"
    },
    ...
    # activities
    {
      "@id": "<GITSHA_of_run_record>",
      "@type": "activity",
      "atTime": "2019-05-01T12:10:55+02:00",
      "rdfs:comment": "[DATALAD RUNCMD] rm test.png",
      "prov:wasAssociatedWith": {
        "@id": "Name_Surname<email@example.com>",
      }
    },
    ...
    # entities
    {
      "@id": "SOMEKEY",
      "@type": "entity",
      "prov:wasGeneratedBy": {"@id": "<GITSHA_of_run_record>"}
    }
    ...
  ]
}
class datalad_metalad.extractors.runprov.RunProvenanceExtractor[source]

Bases: datalad_metalad.extractors.base.MetadataExtractor

datalad_metalad.extractors.runprov.yield_run_records(ds)[source]