swh.graph.webgraph module#

WebGraph driver

exception swh.graph.webgraph.CompressionSubprocessError(message: str, log_path: Path)[source]#

Bases: Exception

class swh.graph.webgraph.CompressionStep(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

EXTRACT_NODES = -20#
EXTRACT_LABELS = -10#
NODE_STATS = 0#
EDGE_STATS = 3#
LABEL_STATS = 6#
MPH = 10#
CONVERT_MPH = 20#
BV = 30#
BV_OFFSETS = 40#
BV_EF = 50#
BFS_ROOTS = 55#
BFS = 60#
PERMUTE_AND_SIMPLIFY_BFS = 70#
BFS_EF = 80#
BFS_DCF = 90#
LLP = 100#
COMPOSE_ORDERS = 110#
PERMUTE_LLP = 120#
OFFSETS = 130#
EF = 135#
OBL = 140#
STATS = 150#
TRANSPOSE = 160#
TRANSPOSE_OFFSETS = 165#
TRANSPOSE_OBL = 170#
TRANSPOSE_EF = 175#
MAPS = 180#
EXTRACT_PERSONS = 190#
PERSONS_STATS = 195#
MPH_PERSONS = 200#
CONVERT_MPH_PERSONS = 205#
NODE_PROPERTIES = 210#
MPH_LABELS = 220#
PTHASH_LABELS = 223#
PTHASH_LABELS_ORDER = 226#
FCL_LABELS = 230#
EDGE_LABELS = 240#
EDGE_LABELS_TRANSPOSE = 245#
EDGE_LABELS_OBL = 250#
EDGE_LABELS_TRANSPOSE_OBL = 260#
EDGE_LABELS_EF = 270#
EDGE_LABELS_TRANSPOSE_EF = 280#
CLEAN_TMP = 300#
swh.graph.webgraph.do_step(step, conf) List[RunResult][source]#
swh.graph.webgraph.compress(graph_name: str, in_dir: ~pathlib.Path, out_dir: ~pathlib.Path, steps: ~typing.Set[~swh.graph.webgraph.CompressionStep] = {CompressionStep.BFS, CompressionStep.BFS_DCF, CompressionStep.BFS_EF, CompressionStep.BFS_ROOTS, CompressionStep.BV, CompressionStep.BV_EF, CompressionStep.BV_OFFSETS, CompressionStep.CLEAN_TMP, CompressionStep.COMPOSE_ORDERS, CompressionStep.CONVERT_MPH, CompressionStep.CONVERT_MPH_PERSONS, CompressionStep.EDGE_LABELS, CompressionStep.EDGE_LABELS_EF, CompressionStep.EDGE_LABELS_OBL, CompressionStep.EDGE_LABELS_TRANSPOSE, CompressionStep.EDGE_LABELS_TRANSPOSE_EF, CompressionStep.EDGE_LABELS_TRANSPOSE_OBL, CompressionStep.EDGE_STATS, CompressionStep.EF, CompressionStep.EXTRACT_LABELS, CompressionStep.EXTRACT_NODES, CompressionStep.EXTRACT_PERSONS, CompressionStep.FCL_LABELS, CompressionStep.LABEL_STATS, CompressionStep.LLP, CompressionStep.MAPS, CompressionStep.MPH, CompressionStep.MPH_LABELS, CompressionStep.MPH_PERSONS, CompressionStep.NODE_PROPERTIES, CompressionStep.NODE_STATS, CompressionStep.OBL, CompressionStep.OFFSETS, CompressionStep.PERMUTE_AND_SIMPLIFY_BFS, CompressionStep.PERMUTE_LLP, CompressionStep.PERSONS_STATS, CompressionStep.PTHASH_LABELS, CompressionStep.PTHASH_LABELS_ORDER, CompressionStep.STATS, CompressionStep.TRANSPOSE, CompressionStep.TRANSPOSE_EF, CompressionStep.TRANSPOSE_OBL, CompressionStep.TRANSPOSE_OFFSETS}, conf: ~typing.Dict[str, str] = {}, progress_cb: ~typing.Callable[[int, ~swh.graph.webgraph.CompressionStep], None] = <function <lambda>>)[source]#

graph compression pipeline driver from nodes/edges files to compressed on-disk representation

Parameters:
  • graph_name – graph base name, relative to in_dir

  • in_dir – input directory, where the uncompressed graph can be found

  • out_dir – output directory, where the compressed graph will be stored

  • steps – compression steps to run (default: all steps)

  • conf

    compression configuration, supporting the following keys (all are optional, so an empty configuration is fine and is the default)

    • batch_size: batch size for WebGraph transformations; defaults to 1 billion

    • classpath: java classpath, defaults to swh-graph JAR only

    • java: command to run java VM, defaults to “java”

    • java_tool_options: value for JAVA_TOOL_OPTIONS environment variable; defaults to various settings for high memory machines

    • logback: path to a logback.xml configuration file; if not provided a temporary one will be created and used

    • max_ram: maximum RAM to use for compression; defaults to available virtual memory

    • tmp_dir: temporary directory, defaults to the “tmp” subdir of out_dir

    • object_types: comma-separated list of object types to extract (eg. ori,snp,rel,rev). Defaults to *.

  • progress_cb – a callable taking a percentage and step as argument, which is called every time a step starts.