Track data from a webpage

With a few lines DataLad is set up to track data posted on a website, and obtain changes made in the future…

The website http://www.fmri-data-analysis.org/code provides code and data file for examples in a text book.

We will set up a dataset that DataLad uses to track the content linked from this webpage

Let’s create the dataset, and configure it to track any text file directly in Git. This will make it very convenient to see how source code changed over time.

~ % datalad create --text-no-annex demo
[INFO   ] Creating a new annex repo at /demo/demo
create(ok): /demo/demo (dataset)
~ % cd demo

DataLad’s crawler functionality is used to monitor the webpage. It’s configuration is stored in the dataset itself.

The crawler comes with a bunch of configuration templates. Here we are using one that extract all URLs that match a particular pattern, and obtains the linked data. In case of this webpage, all URLs of interest on that page seems to have ‘d=1’ suffix

~/demo % datalad crawl-init --save --template=simple_with_archives url=http://www.fmri-data-analysis.org/code 'a_href_match_=.*d=1$'
[INFO   ] Creating a pipeline to crawl data files from http://www.fmri-data-analysis.org/code
[INFO   ] Initiating special remote datalad-archives
[INFO   ] Not adding annex.largefiles=exclude=README* and exclude=LICENSE* to git annex calls because already defined to be (not(mimetype=text/*))
~/demo % datalad diff --revision @~1
         added(file): .datalad/crawl/crawl.cfg
~/demo % cat .datalad/crawl/crawl.cfg
[crawl:pipeline]
template = simple_with_archives
_url = http://www.fmri-data-analysis.org/code
_a_href_match_ = .*d=1$

With this configuration in place, we can ask DataLad to crawl the webpage.

~/demo % datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO   ] Creating a pipeline to crawl data files from http://www.fmri-data-analysis.org/code
[INFO   ] Not adding annex.largefiles=exclude=README* and exclude=LICENSE* to git annex calls because already defined to be (not(mimetype=text/*))
[INFO   ] Running pipeline [<function switch_branch at 0x7f9147061488>, [[<datalad.crawler.nodes.crawl_url.crawl_url object at 0x7f9135c6ad50>, a_href_match(query='.*d=1$'), <function fix_url at 0x7f914b7a9cf8>, <datalad.crawler.nodes.annex.Annexificator object at 0x7f9135c4b810>]], <function switch_branch at 0x7f9135c51de8>, [<function merge_branch at 0x7f9135c51050>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function _add_archive_content at 0x7f9135c51e60>]], <function switch_branch at 0x7f9135c51ed8>, <function merge_branch at 0x7f9135c51f50>, <function _finalize at 0x7f9135c74050>]
[INFO   ] Found branch non-dirty -- nothing was committed
[INFO   ] Checking out master into a new branch incoming
[INFO   ] Fetching 'http://www.fmri-data-analysis.org/code'
[INFO   ] Need to download 950 Bytes from http://www.fmri-data-analysis.org/code/figure_2_12.R?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 656 Bytes from http://www.fmri-data-analysis.org/code/figure_2_14.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 4.3 kB from http://www.fmri-data-analysis.org/code/figure_2_3.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 453.1 kB from http://www.fmri-data-analysis.org/code/figure_3_14.tgz?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 486 Bytes from http://www.fmri-data-analysis.org/code/figure_3_8.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 255 Bytes from http://www.fmri-data-analysis.org/code/figure_3_9.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 321.6 kB from http://www.fmri-data-analysis.org/code/figure_4_7.tgz?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 2.1 kB from http://www.fmri-data-analysis.org/code/figure_5_10.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 1.1 kB from http://www.fmri-data-analysis.org/code/figure_5_11.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 2.5 kB from http://www.fmri-data-analysis.org/code/figure_5_12.zip?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 1.5 kB from http://www.fmri-data-analysis.org/code/figure_5_3.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 110.4 kB from http://www.fmri-data-analysis.org/code/figure_8_11.tgz?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 1.7 kB from http://www.fmri-data-analysis.org/code/figure_8_2.m?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 3.2 kB from http://www.fmri-data-analysis.org/code/figure_9_1.R?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 9.8 kB from http://www.fmri-data-analysis.org/code/figure_9_2.R?attredirects=0&d=1. No progress indication will be reported
[INFO   ] Need to download 9.8 kB from http://www.fmri-data-analysis.org/code/figure_9_3.R?attredirects=0&d=1. No progress indication will be reported
                                                                                [INFO   ] Repository found dirty -- adding and committing
                                                                                [INFO   ] Checking out a new detached branch incoming-processed
[INFO   ] Initiating 1 merge of incoming using strategy theirs
                                                                                [INFO   ] Adding content of the archive ./figure_4_7.tgz into annex <AnnexRepo path=/demo/demo (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[INFO   ] Finished adding ./figure_4_7.tgz: Files processed: 4, +git: 1, +annex: 3
[INFO   ] Adding content of the archive ./figure_8_11.tgz into annex <AnnexRepo path=/demo/demo (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[INFO   ] Finished adding ./figure_8_11.tgz: Files processed: 7, renamed: 7, +git: 4, +annex: 3
[INFO   ] Adding content of the archive ./figure_5_12.zip into annex <AnnexRepo path=/demo/demo (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[INFO   ] Finished adding ./figure_5_12.zip: Files processed: 3, skipped: 1, renamed: 2, +git: 2
[INFO   ] Adding content of the archive ./figure_3_14.tgz into annex <AnnexRepo path=/demo/demo (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[INFO   ] Finished adding ./figure_3_14.tgz: Files processed: 6, renamed: 6, +annex: 6
[INFO   ] Repository found dirty -- adding and committing
                                                                                [INFO   ] Checking out an existing branch master
[INFO   ] Initiating 1 merge of incoming-processed using strategy None
[INFO   ] Found branch non-dirty -- nothing was committed
[INFO   ] House keeping: gc, repack and clean
[INFO   ] Finished running pipeline: URLs processed: 16, downloaded: 16, size: 923.4 kB,  Files processed: 40, skipped: 1, renamed: 15, +git: 19, +annex: 16,  Branches merged: incoming->incoming-processed
[INFO   ] Total stats: URLs processed: 16, downloaded: 16, size: 923.4 kB,  Files processed: 40, skipped: 1, renamed: 15, +git: 19, +annex: 16,  Branches merged: incoming->incoming-processed,  Datasets crawled: 1

All files have been obtained and are ready to use. Here is what DataLad recorded for this update

~/demo % git show @ -s
commit 3a8033d45cf7a96b523d927e02cf9d6a79f8e30e (HEAD -> master, incoming-processed)
Author: DataLad Demo <demo@datalad.org>
Date:   Fri Mar 16 08:41:22 2018 +0100

    [DATALAD] Added files from extracted archives

    Files processed: 24
     skipped: 1
     renamed: 15
     +git: 7
     +annex: 12
    Branches merged: incoming->incoming-processed

Any file from the webpage is available locally.

~/demo % ls
all_rois.txt    figure_4_7.sh  figure_9_3.R
dat.txt                 figure_5_10.m  flirt_thresh_zstat1.nii.gz
fair_abbrevs.txt   figure_5_11.m  fnirt_thresh_zstat1.nii.gz
fair_networks.txt  figure_5_12.m  mean_func.nii.gz
figure_2_12.R           figure_5_3.m   zstat1_0mm.nii.gz
figure_2_14.m           figure_8_11.R  zstat1_16mm.nii.gz
figure_2_3.m    figure_8_2.m   zstat1_32mm.nii.gz
figure_3_8.m    figure_9_1.R   zstat1_4mm.nii.gz
figure_3_9.m    figure_9_2.R   zstat1_8mm.nii.gz
~/demo % #

The webpage can be queried for potential updates at any time by re-running the ‘crawl’ command.

~/demo % datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO   ] Creating a pipeline to crawl data files from http://www.fmri-data-analysis.org/code
[INFO   ] Not adding annex.largefiles=exclude=README* and exclude=LICENSE* to git annex calls because already defined to be (not(mimetype=text/*))
[INFO   ] Running pipeline [<function switch_branch at 0x7f47abaf3488>, [[<datalad.crawler.nodes.crawl_url.crawl_url object at 0x7f479a6fcd50>, a_href_match(query='.*d=1$'), <function fix_url at 0x7f47b023acf8>, <datalad.crawler.nodes.annex.Annexificator object at 0x7f479a6dd810>]], <function switch_branch at 0x7f479a6e3de8>, [<function merge_branch at 0x7f479a6e3050>, [find_files(dirs=False, fail_if_none=True, regex='\\.(zip|tgz|tar(\\..+)?)$', topdir='.'), <function _add_archive_content at 0x7f479a6e3e60>]], <function switch_branch at 0x7f479a6e3ed8>, <function merge_branch at 0x7f479a6e3f50>, <function _finalize at 0x7f479a706050>]
[INFO   ] Found branch non-dirty -- nothing was committed
[INFO   ] Checking out an existing branch incoming
[INFO   ] Fetching 'http://www.fmri-data-analysis.org/code'
                                                                                [INFO   ] Found branch non-dirty -- nothing was committed
[INFO   ] Checking out an existing branch incoming-processed
[INFO   ] Found branch non-dirty -- nothing was committed
[INFO   ] Checking out an existing branch master
[INFO   ] Finished running pipeline: URLs processed: 16,  Files processed: 16, skipped: 16
[INFO   ] Total stats: URLs processed: 16,  Files processed: 16, skipped: 16,  Datasets crawled: 1

Files can be added, or removed from this dataset without impairing the ability to get updates from the webpage. DataLad keeps the necessary information in dedicated Git branches.

~/demo % git branch
  git-annex
  incoming
  incoming-processed
* master