Descendant and path counts#

The swh_graph_topology crate supports precomputing estimates of the size of the sub-DAG of nodes reachable from any node.

The currently implemented counts are:

  • all-paths count: the number of path from a particular node to any other node; which is also the number of nodes if the DAG was expanded as a tree without deduplicating

  • paths-to-leaves count: the number of path from a particular node to any leaf; which is also the number of leaves if the DAG was expanded as a tree without deduplicating

  • descendant counts: an approximation of the number of nodes in the sub-DAG of nodes reachable from a particular node

They each come in two flavor: forward and backward. Forward counts are computed by following only arcs from a node to its successor: origin -> snapshot, snapshot -> revision, child revision -> parent revision (not a typo), revision -> directory, parent directory to child directory, directory -> content, etc. And backward is the opposite.

File formats#

We support three different formats. In order of preference:

  • Bitfieldvec, for a compact representation with fast random access. Not available for path counts.

  • plain array of floating-point numbers, for fast random access in non-Rust languages

  • Parquet, for use in batch processing software like Spark or Datafusion. Parquet files contain at least three columns: swhid (the SWHID of each node), node (its id as u64), and at least one count.

Implementation#

All counts computations are based on the swh_graph_stdlib::labeling::MapReduce pattern, which propagates counts recursively through the graph, in topological order.

For path counts:

  • For each node in topological order:

    • if the node is a root:

      • map: set paths_from_roots[node] = all_paths[node] = 1

    • else:

      • map: initialize paths_from_roots[node] = 0 and all_paths[node] = 1

      • reduce: for each successor succ of node:

        • increment paths_from_roots[node] by paths_from_roots[succ]

        • increment all_paths[node] by all_paths[succ]

Then write the resulting arrays in the various formats.

For descendants, the naive implementation would be similar but use sets of nodes (with union instead of addition); but this does not fit in memory on typical machines. Instead, we use HyperLogLog counters (with merging instead of addition). HyperLogLog counters are probabilistic, so they do not provide an exact result, only approximate it with a known relative standard-deviation.

For the 2025-10-08 full graph, we are able to precompute counts with 20% relative standard deviation, as computing them fits on a machine with 4TB of RAM. Better standard deviations would require over 6TB of RAM, which is not tractable.

We are also planning to provide precomputed counts with significantly better standard deviation for the history-hosting graph and the history-hosting layer of the full graph.