swh.dataset.exporters.edges module#

We cannot solely rely on the object IDs that are read in the journal, as some nodes that are referred to as destinations in the edge file might not be present in the archive (e.g a rev_entry referring to a revision that we do not have crawled yet).

The most efficient way of getting all the nodes that are mentioned in the edges file is therefore to use sort(1) on the gigantic edge files to get all the unique node IDs, while using the disk as a temporary buffer.

This pipeline does, in order:

concatenate and write all the compressed edges files in graph.edges.csv.zst (using the fact that ZST compression is an additive function) ;

deflate the edges ;

count the number of edges and write it in graph.edges.count.txt ;

count the number of occurrences of each edge type and write them in graph.edges.stats.txt ;

concatenate all the (deflated) nodes from the export with the destination edges, and sort the output to get the list of unique graph nodes ;

count the number of unique graph nodes and write it in graph.nodes.count.txt ;

count the number of occurrences of each node type and write them in graph.nodes.stats.txt ;

compress and write the resulting nodes in graph.nodes.csv.zst.

swh.dataset.exporters.edges module#

This Page