Topological order#
The swh_graph_topology crate supports topologically sorting nodes, and writing both a topological order and the generation number of each node (aka. depth of the sub-DAG of nodes reachable from that node). Both can be either forward (ori->snp->rel->rev->dir->cnt) or backward (cnt->dir->rev->rel->snp->ori).
Software Heritage provides precomputed versions of this dataset.
Definitions#
In a forward topological order, each node is present after any of its ancestors and before any of its descendants. For example, a snapshot node is after any origin that contains it, and before any revision it contains; and a revision is after all of its child revisions (which are its predecessors/ancestors in the forward graph) and before all of its parent revisions (which are its successors/descendants in the forward graph).
Forward generation numbers, or forward depths, are defined as the maximum length of a path from a root (typically an origin) to a particular node. An origin’s forward depth is always 0, a snapshot’s is always 1, a release’s is almost always 2 (except for releases pointed by other releases), and a revision’s is 3 if it does not have any child revision (which are predecessors/ancestors in the forward graph). The largest generation number in a complete graph (or subgraph with no dangling edges) is always a content object (though it can technically be the empty directory too).
Symmetrically, in a backward topological order, each node is present after any of its descendants and before any of its ancestors. And backward generation numbers, or backward depths, are the maximum length of a path from a particular node to a leaf (typically a content in the full graph, or a root revision in the history-and-hosting graph). A content’s backward generation number is always 0, and so is the empty directory’s. The largest generation number in a complete graph is almost always an origin, though it can also be an orphan node, ie. a node present in no origin.
File formats#
Depths are a plain array of u32, which can be used as a map from node ids (consecutive integers
from 0 to graph.num_nodes()) to each node’s depth.
Generations are written to a compact format using two bitstreams: one for node ids and one for offsets. Integers in the bitstreams are gamma-encoded. They are written using this algorithm:
Write
0to the offsets bitstream, gamma-encoded (for backward compatibility)for each generation:
Initialize
previous_nodeandnum_written_bitsto 0For each
nodein the generation:Write
node - previous_nodeto the nodes bitstream, gamma-encodedIncrement
num_written_bitsby the length of the gamma-encoding ofnode - previous_node
Write
num_written_bitsto the offset bitstream, gamma-encoded
Write
0, gamma-encoded, to mark the end of the file
Computing the cumulative-sum of the offset bitstream allows computing the start and end of an arbitrary generation in the nodes bitstream.
Given the start and end of a generation in the nodes bitstream, the generation is the computed cumulative sum of each gamma-encoded difference in the generation.
Note: We could actually write node - previous_node - 1 instead of node - previous_node
in the nodes bitstream to save bits (eg. if previous_node + 1 = node, the gamma-encoding
would take only 1 bit instead of 4), but don’t for backward compatibility.
Implementation#
This is a BFS-like traversal implemented in topology/src/bin/generations.rs.
Generations are written to a bitstream while traversing, and an array with the depths is populated,
then written at the end.
Writing and reading are implemented in topology/src/generations.rs and available in
the public API