.. _swh-graph-libs-counts: Descendant and path counts ========================== The `swh_graph_topology crate `_ supports precomputing estimates of the size of the sub-DAG of nodes reachable from any node. The currently implemented counts are: * all-paths count: the number of path from a particular node to any other node; which is also the number of nodes if the DAG was expanded as a tree without deduplicating * paths-to-leaves count: the number of path from a particular node to any leaf; which is also the number of leaves if the DAG was expanded as a tree without deduplicating * descendant counts: an **approximation** of the number of nodes in the sub-DAG of nodes reachable from a particular node They each come in two flavor: forward and backward. Forward counts are computed by following only arcs from a node to its successor: origin -> snapshot, snapshot -> revision, child revision -> parent revision (not a typo), revision -> directory, parent directory to child directory, directory -> content, etc. And backward is the opposite. File formats ------------ We support three different formats. In order of preference: * `Bitfieldvec `_, for a compact representation with fast random access. Not available for path counts. * plain array of floating-point numbers, for fast random access in non-Rust languages * Parquet, for use in batch processing software like Spark or Datafusion. Parquet files contain at least three columns: ``swhid`` (the :ref:`SWHID ` of each node), ``node`` (its id as ``u64``), and at least one count. Implementation -------------- All counts computations are based on the `swh_graph_stdlib::labeling::MapReduce `_ pattern, which propagates counts recursively through the graph, in topological order. For path counts: * For each node in topological order: * if the node is a root: * map: set ``paths_from_roots[node] = all_paths[node] = 1`` * else: * map: initialize ``paths_from_roots[node] = 0`` and ``all_paths[node] = 1`` * reduce: for each successor ``succ`` of ``node``: * increment ``paths_from_roots[node]`` by ``paths_from_roots[succ]`` * increment ``all_paths[node]`` by ``all_paths[succ]`` Then write the resulting arrays in the various formats. For descendants, the naive implementation would be similar but use sets of nodes (with union instead of addition); but this does not fit in memory on typical machines. Instead, we use `HyperLogLog `_ counters (with merging instead of addition). HyperLogLog counters are probabilistic, so they do not provide an exact result, only approximate it with a known relative standard-deviation. For the 2025-10-08 full graph, we are able to precompute counts with 20% relative standard deviation, as computing them fits on a machine with 4TB of RAM. Better standard deviations would require over 6TB of RAM, which is not tractable. We are also `planning `__ to provide precomputed counts with significantly better standard deviation for the :ref:`history-hosting graph ` and the :ref:`history-hosting layer ` of the full graph.