Basic provenance tracking

It is often helpful to keep track of the origin of data files. When generating data from other data, it is also useful to know what process led to these new data and what inputs were used.

DataLad can be used to keep such a record…

We start with a dataset

~ % datalad create demo
[INFO   ] Creating a new annex repo at /demo/demo
create(ok): /demo/demo (dataset)
~ % cd demo

Let’s say we are taking a mosaic image composed of flowers from Wikimedia. We want extract some of them into individual files – maybe to use them in an art project later.

We can use git-annex to obtain this image straight from the web

~/demo % git annex addurl https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg --file sources/flowers.jpg
addurl sources/flowers.jpg (downloading https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg ...)
/demo/demo/.git/ann 100%[===================>]   4.28M  5.19MB/s    in 0.8s
2018-03-15 15:47:37 URL:https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg [4487679/4487679] -> "/demo/demo/.git/annex/tmp/URL-s4487679--https&c%%upload.wikimedia.org%wi-f0864ab780277edffb909382d1d1bb88" [1]
ok
(recording state in git...)

We save it in the dataset

~/demo % datalad save -m 'Added flower mosaic from wikimedia'
save(ok): /demo/demo (dataset)

Now we can use DataLad’s ‘run’ command to process this image and extract one of the mosaic tiles into its own JPEG file. Let’s extract the St. Bernard’s Lily from the upper left corner.

~/demo % datalad run convert -extract 1522x1522+0+0 sources/flowers.jpg st-bernard.jpg
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): st-bernard.jpg (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

All we have to do is prefix ANY command with ‘datalad run’. DataLad will inspect the dataset after the command has finished and save all modifications.

In order to reliably detect modifications, a dataset must not contain unsaved modifications prior to running a command. For example, if we try to extract the Scarlet Pimpernel image with unsaved changes…

~/demo % touch dirt
~/demo % datalad run convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg
run(impossible): /demo/demo (dataset) [unsaved modifications present, cannot detect changes by command]

It has to be clean

~/demo % rm dirt
~/demo % datalad run convert -extract 1522x1522+1470+1470 sources/flowers.jpg pimpernel.jpg
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): pimpernel.jpg (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Every processing step is saved in the dataset, including the exact command and the content that was changed.

~/demo % git show --stat
commit 73832b3af2a24d7cdaea964b934f1ede23a69c69 (HEAD -> master)
Author: DataLad Demo <demo@datalad.org>
Date:   Thu Mar 15 15:48:34 2018 +0100

    [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...

    === Do not change lines below ===
    {
     "pwd": ".",
     "cmd": [
      "convert",
      "-extract",
      "1522x1522+1470+1470",
      "sources/flowers.jpg",
      "pimpernel.jpg"
     ],
     "exit": 0,
     "chain": []
    }
    ^^^ Do not change lines above ^^^

 pimpernel.jpg | 1 +
 1 file changed, 1 insertion(+)

On top of that, the origin of any dataset content obtained from elsewhere is on record too

~/demo % git annex whereis sources/flowers.jpg
whereis sources/flowers.jpg (2 copies)
     00000000-0000-0000-0000-000000000001 -- web
     3b96f81f-2e68-4848-a30c-4bd31c555cb3 -- mih@meiner:~/demo [here]

  web: https://upload.wikimedia.org/wikipedia/commons/a/a5/Flower_poster_2.jpg
ok

Based on this information, we can always reconstruct how any data file came to be – across the entire life-time of a project

~/demo % git log --oneline @~3..@
73832b3 (HEAD -> master) [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
cce0c79 [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...
8a21b21 Added flower mosaic from wikimedia
~/demo % datalad diff --revision @~3..@
         added(file): pimpernel.jpg
         added(file): sources/flowers.jpg
         added(file): st-bernard.jpg

We can also rerun any previous commands with ‘datalad rerun’. Without any arguments, the command from the last commit will be executed.

~/demo % datalad rerun
unlock(ok): pimpernel.jpg (file)
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): pimpernel.jpg (file)
save(notneeded): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (notneeded: 1)
  unlock (ok: 1)
~/demo % git log --oneline --graph --name-only @~3..@
* 73832b3 (HEAD -> master) [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
| pimpernel.jpg
* cce0c79 [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...
| st-bernard.jpg
* 8a21b21 Added flower mosaic from wikimedia
  sources/flowers.jpg

In this case, a new commit isn’t created because the output file didn’t change. But let’s say we add a step that displaces the Lily’s pixels by a random amount.

~/demo % datalad run convert -spread 10 st-bernard.jpg st-bernard-displaced.jpg
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): st-bernard-displaced.jpg (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Now, if we rerun the previous command, a new commit is created because the output’s content changed.

~/demo % datalad rerun
unlock(ok): st-bernard-displaced.jpg (file)
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): st-bernard-displaced.jpg (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
  unlock (ok: 1)
~/demo % git log --graph --oneline --name-only @~2..
* 3b8c46b (HEAD -> master) [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
| st-bernard-displaced.jpg
* 40b2c50 [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
  st-bernard-displaced.jpg

(We don’t actually want the repeated ‘spread’ command, so let’s reset to the parent commit.)

~/demo % git reset --hard @^
HEAD is now at 40b2c50 [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...

We can also rerun multiple commits (with ‘–since’) and choose where HEAD is when we start rerunning from (with –onto). When both arguments are set to empty strings, it means ‘rerun all command with HEAD at the parent of the first commit a command’.

In other words, you can ‘replay’ the commands.

~/demo % datalad rerun --since= --onto= --branch=verify
unlock(notneeded): st-bernard.jpg (file) [not controlled by annex, nothing to unlock]
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): st-bernard.jpg (file)
save(ok): /demo/demo (dataset)
unlock(notneeded): pimpernel.jpg (file) [not controlled by annex, nothing to unlock]
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): pimpernel.jpg (file)
save(ok): /demo/demo (dataset)
unlock(notneeded): st-bernard-displaced.jpg (file) [not controlled by annex, nothing to unlock]
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): st-bernard-displaced.jpg (file)
save(ok): /demo/demo (dataset)
action summary:
  add (ok: 3)
  save (ok: 3)
  unlock (notneeded: 3)

Now we’re on a new branch, ‘verify’, that contains the replayed history.

~/demo % git log --oneline --graph master verify
* e58b078 (HEAD -> verify) [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
* 35623fd [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
* 9f8f54d [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...
| * 40b2c50 (master) [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
| * 73832b3 [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
| * cce0c79 [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...
|/
* 8a21b21 Added flower mosaic from wikimedia
* 14f64a7 [DATALAD] new dataset
* 3b50eb8 [DATALAD] Set default backend for all files to be MD5E

Let’s compare the two branches.

~/demo % datalad diff --revision master..verify
      modified(file): st-bernard-displaced.jpg

We can see that the step that involved a random component produced different results.

And these are just two branches, so you can compare them using normal Git operations. The next command, for example, marks which commits are ‘patch-equivalent’.

~/demo % git log --oneline --left-right --cherry-mark master...verify
> e58b078 (HEAD -> verify) [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
= 35623fd [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
= 9f8f54d [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...
< 40b2c50 (master) [DATALAD RUNCMD] convert -spread 10 st-bernard.jpg st-ber...
= 73832b3 [DATALAD RUNCMD] convert -extract 1522x1522+1470+1470 sou...
= cce0c79 [DATALAD RUNCMD] convert -extract 1522x1522+0+0 sources/f...

Notice that all commits are marked as equivalent (=) except the ‘random spread’ ones.