Collection-of-files dataset (v1, tby-ds1
)
This convention defines the essential building blocks to describe a collection of files as a dataset. With few exceptions the convention is built on the https://schema.org vocabulary.
Here is an example of a fairly minimal, yet sensible, description of a dataset. The dataset has a few key properties (e.g., a licence), an author, and comprises two files. This information is expressed in three TSV files:
dataset@tby-ds1.tsv
name
demo
title
My demo dataset
description
This is a fictitious dataset.
license
CC-PDDC
homepage
last-updated
2023-07-27
authors@tby-ds1.tsv
name
email
Jane Doe
files@tby-ds1.tsv
path[POSIX]
size[bytes]
checksum[md5]
url
LICENSE
1300
529ff606a38b37a2e5478c1abfeca231
docs/README.md
1755
ef2979a70a8d95a24cd1402bd68e1c4a
Using the following, minimal JSON-LD context for compaction...
{
"afo": "http://purl.allotrope.org/ontologies/result#",
"dcterms": "https://purl.org/dc/terms/",
"nfo": "https://www.semanticdesktop.org/ontologies/2007/03/22/nfo/#",
"obo": "https://purl.obolibrary.org/obo/",
"schema": "https://schema.org/",
"xsd": "http://www.w3.org/2001/XMLSchema#"
}
... the information in the TSV tables is transformed into a single, fully annotated JSON-LD document on the dataset.
{
"@context": {
"afo": "http://purl.allotrope.org/ontologies/result#",
"dcterms": "https://purl.org/dc/terms/",
"nfo": "https://www.semanticdesktop.org/ontologies/2007/03/22/nfo/#",
"obo": "https://purl.obolibrary.org/obo/",
"schema": "https://schema.org/",
"xsd": "http://www.w3.org/2001/XMLSchema#"
},
"@type": "schema:Dataset",
"dcterms:hasPart": [
{
"@type": "schema:DigitalDocument",
"obo:NCIT_C171276": "529ff606a38b37a2e5478c1abfeca231",
"schema:contentUrl": "https://raw.githubusercontent.com/psychoinformatics-de/datalad-tabby/2738d8a12fb138d3fe107c6bee443c13c9f4f6ea/LICENSE",
"schema:name": {
"@type": "afo:AFR_0001928",
"@value": "LICENSE"
},
"nfo:fileSize": {
"@type": "xsd:integer",
"@value": "1300"
}
},
{
"@type": "schema:DigitalDocument",
"obo:NCIT_C171276": "ef2979a70a8d95a24cd1402bd68e1c4a",
"schema:contentUrl": "https://raw.githubusercontent.com/psychoinformatics-de/datalad-tabby/2738d8a12fb138d3fe107c6bee443c13c9f4f6ea/docs/README.md",
"schema:name": {
"@type": "afo:AFR_0001928",
"@value": "docs/README.md"
},
"nfo:fileSize": {
"@type": "xsd:integer",
"@value": "1755"
}
}
],
"schema:author": {
"@type": "schema:Person",
"schema:email": "jd@example.com",
"schema:name": "Jane Doe"
},
"schema:dateModified": "2023-07-27",
"schema:description": "This is a fictitious dataset.",
"schema:license": {
"@id": "https://spdx.org/licenses/CC-PDDC"
},
"schema:mainEntityOfPage": "http://docs.datalad.org/projects/tabby/en/latest",
"schema:name": "demo",
"schema:title": "My demo dataset"
}
Sheet types
Sheet dataset
Context
Licenses are declared using the identifiers given at https://spdx.org/licenses as a standard vocabulary.
{
"dcterms": "https://purl.org/dc/terms/",
"schema": "https://schema.org/",
"author": "schema:author",
"description": "schema:description",
"hasPart": "dcterms:hasPart",
"homepage": "schema:mainEntityOfPage",
"identifier": "schema:identifier",
"keywords": "schema:keywords",
"last-updated": "schema:dateModified",
"license": {
"@id": "schema:license",
"@type": "@vocab",
"@context": {
"@vocab": "https://spdx.org/licenses/"
}
},
"name": "schema:name",
"title": "schema:title",
"version": "schema:version"
}
Default (JSON) data
Information on authors and files is included, if they exist.
{
"author": "@tabby-optional-many-authors@tby-ds1",
"hasPart": "@tabby-optional-many-files@tby-ds1"
}
Sheet files
Context
File paths are annotated to be names of any described entity, including a definition of the path convention used (e.g., POSIX).
{
"afo": "http://purl.allotrope.org/ontologies/result#",
"nfo": "https://www.semanticdesktop.org/ontologies/2007/03/22/nfo/#",
"obo": "https://purl.obolibrary.org/obo/",
"schema": "https://schema.org/",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"size[bytes]": {
"@id": "nfo:fileSize",
"@type": "xsd:integer"
},
"checksum[md5]": "obo:NCIT_C171276",
"path[POSIX]": {
"@id": "schema:name",
"@type": "afo:AFR_0001928"
},
"url": "schema:contentUrl"
}
Overrides
Any entity is declared to be of type https://schema.org/DigitalDocument. A given md5sum is used as a node identifier.
{
"@type": "schema:DigitalDocument"
}