Topological order#

The swh_graph_topology crate supports topologically sorting nodes, and writing both a topological order and the generation number of each node (aka. depth of the sub-DAG of nodes reachable from that node). Both can be either forward (ori->snp->rel->rev->dir->cnt) or backward (cnt->dir->rev->rel->snp->ori).

Software Heritage provides precomputed versions of this dataset.

Definitions#

In a forward topological order, each node is present after any of its ancestors and before any of its descendants. For example, a snapshot node is after any origin that contains it, and before any revision it contains; and a revision is after all of its child revisions (which are its predecessors/ancestors in the forward graph) and before all of its parent revisions (which are its successors/descendants in the forward graph).

Forward generation numbers, or forward depths, are defined as the maximum length of a path from a root (typically an origin) to a particular node. An origin’s forward depth is always 0, a snapshot’s is always 1, a release’s is almost always 2 (except for releases pointed by other releases), and a revision’s is 3 if it does not have any child revision (which are predecessors/ancestors in the forward graph). The largest generation number in a complete graph (or subgraph with no dangling edges) is always a content object (though it can technically be the empty directory too).

Symmetrically, in a backward topological order, each node is present after any of its descendants and before any of its ancestors. And backward generation numbers, or backward depths, are the maximum length of a path from a particular node to a leaf (typically a content in the full graph, or a root revision in the history-and-hosting graph). A content’s backward generation number is always 0, and so is the empty directory’s. The largest generation number in a complete graph is almost always an origin, though it can also be an orphan node, ie. a node present in no origin.

File formats#

Depths are a plain array of u32, which can be used as a map from node ids (consecutive integers from 0 to graph.num_nodes()) to each node’s depth.

Generations are written to a compact format using two bitstreams: one for node ids and one for offsets. Integers in the bitstreams are gamma-encoded. They are written using this algorithm:

  • Write 0 to the offsets bitstream, gamma-encoded (for backward compatibility)

  • for each generation:

    • Initialize previous_node and num_written_bits to 0

    • For each node in the generation:

      • Write node - previous_node to the nodes bitstream, gamma-encoded

      • Increment num_written_bits by the length of the gamma-encoding of node - previous_node

    • Write num_written_bits to the offset bitstream, gamma-encoded

  • Write 0, gamma-encoded, to mark the end of the file

Computing the cumulative-sum of the offset bitstream allows computing the start and end of an arbitrary generation in the nodes bitstream.

Given the start and end of a generation in the nodes bitstream, the generation is the computed cumulative sum of each gamma-encoded difference in the generation.

Note: We could actually write node - previous_node - 1 instead of node - previous_node in the nodes bitstream to save bits (eg. if previous_node + 1 = node, the gamma-encoding would take only 1 bit instead of 4), but don’t for backward compatibility.

Implementation#

This is a BFS-like traversal implemented in topology/src/bin/generations.rs. Generations are written to a bitstream while traversing, and an array with the depths is populated, then written at the end.

Writing and reading are implemented in topology/src/generations.rs and available in the public API