The tabby format
tabby is a metadata format that can be used to describe arbitrary things. It does not prescribe a specific terminology.
A tabby metadata record comprises linked components stored in individual files that are assembled and interpreted to form a single JSON or JSON-LD document. The different types of components are:
- sheet
The main building block of a tabby record. A sheet contains key-value pairs in a table(-like) format. See Sheet types for details.
- context
A record-global, or sheet-specific JSON-LD context that defines the terminology and semantics of information in sheets. This is an optional component. See Defining context for details.
- override
A JSON-format specification of data transformations to be applied to the information in sheets, when assembling them into a complete JSON(-LD) document This is an optional component. See Metadata enrichment (overrides) for details.
File naming
A tabby metadata record comprises one or more files that share a common, arbitrary name prefix. In addition to this prefix, each file path contains the name of a sheet. More generally, the name of all files comprising a tabby metadata record follow the scheme:
<record-id>{_,/}<sheet-name>.<extension>
where
<record-id>
is a character string that is common to all files comprising a single tabby metadata record. This ID is an arbitrary, case-sensitive string.For cross-platform compatibility reasons it is recommended to limit the identifier to lower-case ASCII values, and possibly alphanumeric characters.
The
<record-id>
is specified as a_
-delimited file name prefix. Alternatively, the<record-id>
can be identified via the name of the parent directory. In this case, no tabby record file component in this directory can include a_
character in its file name, and only a single tabby record can be represented in that directory.<sheet-name>
is a unique name for a particular sheet component of atabby
metadata record. The name must be lower-case and only consist of the character set[@a-z0-9-]
(@
, letter, digits, and dash/hyphen).<extension>
uniquely identifies the nature of the record component:tsv
orjson
for a sheetctx.jsonld
for a contextoverride.json
for an override specification
Sheet convention declaration
Sheet names support a dedicated syntax for declaring that a sheet conforms to a
particular set of conventions (see Conventions for sheet formats and semantics). The convention
label is appended to the sheet name as a suffix, delimited by the @
character. For example, declaring that the dataset
sheet implements
mystandard
conventions, use dataset@mystandard
as the full sheet name.
Note that for all practical purposes the conventions suffix is a part part of the whole sheet name and has to be included anywhere a sheet name is used.
Sheet types
Tabular metadata is represented in plain-text files, typically in tab-separated-value (TSV) format (but see the layout specifications for a JSON alternative). tabby defines how this tabular information is converted to structured data that can be represented in formats like JSON.
Two different sheet layouts are distinguished and are described below. The particular layout cannot necessarily be inferred from the file content. Instead, the specific semantic is declared when Connecting sheets.
No data type conversion of any kind is performed when reading data from sheets in TSV-format. Any item is represented as a plain-text string.
The single
layout
This layout represents a key-value mapping, equivalent to a JSON object.
Each table row corresponds to a particular key, which is given as the value of
the first column. Empty rows, or rows where the value in the first column is
missing or starts with #
are skipped. When a particular key is repeated in a
subsequent row, the information in the latter row replaces the information in
the former.
The value corresponding to a key is given in the second column. If there are
values in subsequent columns, all values starting from column 2 until the last
column with a value are gathered into a list (JSON-array equivalent). Any missing
values in this list are represented as None
(JSON's null
). If no value
is declared for a table row, the entire row (i.e., the key) is skipped)
Single-item list values are compacted by removing the containing list and assigning the only items directly as the value.
When a JSON-file is present (either in addition to a TSV-table or standalone), it must contain a JSON-object. Keys and values in this object are updated with the information read from a corresponding TSV-file, if one exists. Data read from JSON or TSV files are interpreted in the exact same fashion. The only difference is that JSON-native data types are preserved.
The many
layout
This layout represents a series of homogeneous key-value mappings, equivalent to JSON-array comprising homogeneously structured JSON-objects.
Any table row that is empty, or where the value in the first column starts with
#
is skipped.
The first non-skipped row defines a common set of keys. Every subsequent row represents an entity that is described with values corresponding to these common keys.
When two columns are associated with the exact same key, their associated values are gathered into a list (JSON-array equivalent), with missing values being skipped.
When a row contains columns with values beyond the column corresponding to the last defined key, all these values are gathered into a list (JSON-array equivalent) that include the values from the column of the last defined key, and is assigned as the value corresponding to that last key.
Single-item list values are compacted by removing the containing list and assigning the only items directly as the value.
When a JSON-file is present (either in addition to a TSV-table or standalone),
it must contain ether a JSON-object or a JSON-array. If it contains a
JSON-object, this object is used as the template for any object read from a
corresponding TSV-file's rows (in the same fashion as TSV-based information
updates a JSON-object in the single
layout). If it contains an array, all
elements in the array are interpreted as items in the many
table, which
precede any items read from rows in a corresponding TSV-file, if one exists.
Otherwise, data read from JSON or TSV files are interpreted in the exact same
fashion. The only difference is that JSON-native data types are preserved.
Metadata record entry point (root)
Each tabby metadata record comprises, at minimum, one sheet
in single
layout. A system using tabby records should establish a
convention to identify this root record via a particular name, such as
dataset
.
A minimal metadata record on a dataset about "penguins" can be represented in a
single file, such as penguins_dataset.tsv
, or penguins_dataset.json
.
Connecting sheets
Information from individual sheets can be nested to create more complex data structures than what the two basic sheet layouts can represent individually. This is supported by two dedicated import statements:
@tabby-single-<sheetname>
@tabby-many-<sheetname>
where <sheetname>
is the name of a tabby metadata record component, with
which the corresponding file name can be constructed. For example, using
@tabby-many-authors
in the TSV file penguins_dataset.tsv
, links the
information in the file penguins_authors.tsv
located in the same directory.
The difference between the @tabby-single-...
and the @tabby-many-...
statements is how the linked sheets are being interpreted, and correspond to
the two basic sheet layouts.
These import statements can be used in any value field in any of the two sheet layouts. This includes value list (array) items.
Imports are not file-format specific, hence the sheet name must not include a file extension. An imported sheet can always be in TSV-format, JSON-format, or a combination of both formats.
Two additional import statements are supported:
@tabby-optional-single-<sheetname>
@tabby-optional-many-<sheetname>
They work exactly as their non-optional counterparts. The only difference is that when a sheet imported in this way is not present, the import is skipped rather than causing an error.
Defining context
Typically, the sheets of a tabby metadata record use simple terms like
license
for keys and equally simple values like 1.5
for values. While
this simplicity is useful for assembling a metadata record (possibly manually),
it is insufficient for yielding precise, machine-readable records with
comprehensively defined semantics. For that, each and every term, like
license
, must have a proper definition, and quantitative values, like
1.5
, must come with information on the underlying concepts and possibly
associated units.
Providing the necessary context is possible by amending a metadata record with
JSON-LD @context
records that can be supplied, for each sheet separately,
via side-car files. Such a side-car files share the file name of the annotated
sheet without the extension, and a .ctx.jsonld
suffix. For example, a
context for penguins_authors.tsv
would be read from
penguins_authors.ctx.jsonld
in the same directory.
In addition, a tabby record may include a record-global context specification
at <prefix>.ctx.jsonld
or <prefix>/ctx.jsonld
. This defines a default
context for any sheet. Sheet-specific context definitions amend/override this
record-global default for a given sheet.
The content of any context file must be a valid JSON-LD context.
Metadata enrichment (overrides)
When the tabular components of a tabby metadata record are not detailed enough or precise enough, it is possible to enrich the record with additional information, without having to edit the sheets. This is done via an overrides specification in a JSON side-car file.
The type of metadata enrichment described here is based on purely lexical operations that manipulate (string) values. For other types of metadata enrichment see Defining context or consider JSON-LD framing.
The override side-car file has the file name of the annotated sheet without
the extension, plus a .override.json
suffix. For example, overrides for
penguins_authors.tsv
would be read from penguins_authors.override.json
in the same directory.
An override specification comprises of a single JSON object (key-value mapping), where a key indicates the target for injection or replacement, and the value is either a JSON literal, a format-string, or a JSON array (list) of these two types.
Any string value is assumed to be a format-string, compliant with the Python
Format String Syntax, and will be interpolated using the key-value mapping
for the respective object read from sheet. Therefore the brace characters
{}
need to be quote in case a particular string is to be treated as a
literal value.
The full override record is built before it is applied, at once, to the respective object read from a sheet.
When declaring an override for a many
sheet, the override is applied
individually to each object (row) defined in that sheet.
For uniformity, any metadata value is represented as a multi-value list
(array) at the point of interpolation override specifications. A single item
value for the key name
therefore has to be referenced as {name[0]}
, not
just {name}
. See Metadata enrichment (overrides) for examples.