Provenance index#

swh-graph can be used to generate the “Provenance Index”, which is a set of Parquet tables that allow efficiently computing the set of revisions any given content (ie. file) is in.

Primer on Parquet#

Parquet is a column-oriented file format that allows efficient batch requests. Column-oriented means that instead of storing rows sequentially, rows are grouped by batches of about 1M rows (by default), and a Parquet file stores all values of column 1 of these rows, followed by all values of column 2 of these rows, etc. This allows more efficient compression, and avoids decoding columns we are not interested in.

While it does not provide row-level indexes like “regular databases” (aka. OLTP) like PostgreSQL, it provides two mechanism to filter out row groups before decoding them:

  • statistics, which store the maximum and minimum value of a column within a row group or a page (1MB), useful for sorted/sequential data

  • bloom filters, which allow probabilistically filtering out most row groups when looking for a specific set of values, useful for high-cardinality non-sequential data.

Frontier directories#

The naive implementation of the provenance index would be to store, for each content, the set of all revisions and releases that contain it, possibly sorted by date.

However, this would take unrealistic amounts of spaces, so we use an intermediate layer to keep the combinatorial explosion under control. This layer is a set of directories called “frontier directories”, which are computed through a heuristic aimed at finding directories that are both in many revisions and contain many contents.

Currently, the heuristic is defined as: a directory D is frontier for a revision/release R if and only iff it follows all three of these conditions:

  • there is at least one content C directly in D

  • for all content C in D, the date of first occurrence of C predates the authorship date of R

  • D is not the root directory of R

Tables#

Node id - SWHID map#

The nodes table allows mapping the ids (space-efficient but unstable) used by other tables to SWHIDs (stable but 21-bytes long) contains three columns:

  • id, a 64-bits unsigned integer (with DELTA_BINARY_PACKED encoding)

  • type, a string among cnt, dir, rev, or rel

  • sha1_git, a 20-bytes binary string

type and sha1_git together represent a SWHID (v1)

Within each file of the table, rows are ordered in such a way that consecutive rows have id in a mostly monotonic order, allowing fast access (50ms) thanks to the page index.

The sha1_git column has Bloom filters, allowing decent access performance (600ms, as opposed to 20s without).

content-in-directory#

For each content, this table stores the list of directories that contain it. Columns:

  • cnt, a 64-bits id of the content

  • dir, a 64-bits id of the directory

  • path, a byte string of the filesystem path from the directory to the content (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)

Within each file of the table, rows are ordered in such a way that consecutive rows have dir in a mostly monotonic order.

directory-in-revision#

For each directory that is frontier for any revision, this table stores the list of all revisions the directory is in. Columns:

  • dir, a 64-bits id of the directory

  • dir_max_author_date, the date of the newest content in the directory

  • revrel, a 64-bits id of the revision

  • revrel_author_date, the authorship date of the revision

  • path, a byte string of the filesystem path from the revision to the directory (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)

Within each file of the table, rows are ordered in such a way that consecutive rows have revrel in a mostly monotonic order.

content-in-revision#

Contains the list of revisions that each content is in, that would not be covered by the other two tables (ie. the content is contained from a revision, without any path going through a frontier directory).

Columns:

  • cnt, a 64-bits id of the directory

  • revrel, a 64-bits id of the revision

  • revrel_author_date, the authorship date of the revision

  • path, a byte string of the filesystem path from the revision to the directory (with DELTA_BYTE_ARRAY encoding; TODO: add/replace with zstd compression?)

Within each file of the table, rows are ordered in such a way that consecutive rows have revrel in a mostly monotonic order.

To simplify index generation, this table contains some (content, revision) relationships where the content is reachable from the revision through a frontier directory. This happens when there are multiple paths from the revision to the content and only some of them go through a frontier directory.