Provenance index#
swh-graph can be used to generate the “Provenance Index”, which is a set of Parquet tables that allow efficiently computing the set of revisions any given content (ie. file) is in.
Primer on Parquet#
Parquet is a column-oriented file format that allows efficient batch requests. Column-oriented means that instead of storing rows sequentially, rows are grouped by batches of about 1M rows (by default), and a Parquet file stores all values of column 1 of these rows, followed by all values of column 2 of these rows, etc. This allows more efficient compression, and avoids decoding columns we are not interested in.
While it does not provide row-level indexes like “regular databases” (aka. OLTP) like PostgreSQL, it provides two mechanism to filter out row groups before decoding them:
statistics, which store the maximum and minimum value of a column within a row group or a page (1MB), useful for sorted/sequential data
bloom filters, which allow probabilistically filtering out most row groups when looking for a specific set of values, useful for high-cardinality non-sequential data.
Frontier directories#
The naive implementation of the provenance index would be to store, for each content, the set of all revisions and releases that contain it, possibly sorted by date.
However, this would take unrealistic amounts of spaces, so we use an intermediate layer to keep the combinatorial explosion under control. This layer is a set of directories called “frontier directories”, which are computed through a heuristic aimed at finding directories that are both in many revisions and contain many contents.
Currently, the heuristic is defined as: a directory D
is frontier for a
revision/release R
if and only iff it follows all three of these conditions:
there is at least one content
C
directly inD
for all content
C
inD
, the date of first occurrence ofC
predates the authorship date ofR
D
is not the root directory ofR
Tables#
Node id - SWHID map#
The nodes
table allows mapping the ids (space-efficient but unstable)
used by other tables to SWHIDs (stable but 21-bytes long) contains three columns:
id
, a 64-bits unsigned integer (with DELTA_BINARY_PACKED encoding)type
, a string amongcnt
,dir
,rev
, orrel
sha1_git
, a 20-bytes binary string
type
and sha1_git
together represent a SWHID (v1)
Within each file of the table, rows are ordered in such a way that consecutive rows
have id
in a mostly monotonic order, allowing fast access (50ms) thanks to the page
index.
The sha1_git
column has Bloom filters, allowing decent access performance (600ms,
as opposed to 20s without).
content-in-directory#
For each content, this table stores the list of directories that contain it. Columns:
cnt
, a 64-bits id of the contentdir
, a 64-bits id of the directorypath
, a byte string of the filesystem path from the directory to the content (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)
Within each file of the table, rows are ordered in such a way that consecutive rows
have dir
in a mostly monotonic order.
directory-in-revision#
For each directory that is frontier for any revision, this table stores the list of all revisions the directory is in. Columns:
dir
, a 64-bits id of the directorydir_max_author_date
, the date of the newest content in the directoryrevrel
, a 64-bits id of the revisionrevrel_author_date
, the authorship date of the revisionpath
, a byte string of the filesystem path from the revision to the directory (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)
Within each file of the table, rows are ordered in such a way that consecutive rows
have revrel
in a mostly monotonic order.
content-in-revision#
Contains the list of revisions that each content is in, that would not be covered by the other two tables (ie. the content is contained from a revision, without any path going through a frontier directory).
Columns:
cnt
, a 64-bits id of the directoryrevrel
, a 64-bits id of the revisionrevrel_author_date
, the authorship date of the revisionpath
, a byte string of the filesystem path from the revision to the directory (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)
Within each file of the table, rows are ordered in such a way that consecutive rows
have revrel
in a mostly monotonic order.
To simplify index generation, this table contains some (content, revision)
relationships where the content
is reachable from the revision
through
a frontier directory. This happens when there are multiple paths from the revision
to the content
and only some of them go through a frontier directory.