swh.graph.webgraph module

WebGraph driver

class swh.graph.webgraph.CompressionStep(value)[source]

Bases: enum.Enum

An enumeration.

EXTRACT_NODES = 1
MPH = 2
BV = 3
BFS = 4
PERMUTE_BFS = 5
TRANSPOSE_BFS = 6
SIMPLIFY = 7
LLP = 8
PERMUTE_LLP = 9
OBL = 10
COMPOSE_ORDERS = 11
STATS = 12
TRANSPOSE = 13
TRANSPOSE_OBL = 14
MAPS = 15
EXTRACT_PERSONS = 16
MPH_PERSONS = 17
NODE_PROPERTIES = 18
MPH_LABELS = 19
FCL_LABELS = 20
EDGE_LABELS = 21
EDGE_LABELS_OBL = 22
EDGE_LABELS_TRANSPOSE_OBL = 23
CLEAN_TMP = 24
swh.graph.webgraph.do_step(step, conf)[source]
swh.graph.webgraph.compress(graph_name: str, in_dir: pathlib.Path, out_dir: pathlib.Path, steps: typing.Set[swh.graph.webgraph.CompressionStep] = {<CompressionStep.EDGE_LABELS: 21>, <CompressionStep.EDGE_LABELS_TRANSPOSE_OBL: 23>, <CompressionStep.EDGE_LABELS_OBL: 22>, <CompressionStep.OBL: 10>, <CompressionStep.BFS: 4>, <CompressionStep.LLP: 8>, <CompressionStep.SIMPLIFY: 7>, <CompressionStep.TRANSPOSE_BFS: 6>, <CompressionStep.TRANSPOSE_OBL: 14>, <CompressionStep.BV: 3>, <CompressionStep.NODE_PROPERTIES: 18>, <CompressionStep.MPH_LABELS: 19>, <CompressionStep.EXTRACT_NODES: 1>, <CompressionStep.MPH_PERSONS: 17>, <CompressionStep.FCL_LABELS: 20>, <CompressionStep.EXTRACT_PERSONS: 16>, <CompressionStep.PERMUTE_LLP: 9>, <CompressionStep.PERMUTE_BFS: 5>, <CompressionStep.MAPS: 15>, <CompressionStep.CLEAN_TMP: 24>, <CompressionStep.TRANSPOSE: 13>, <CompressionStep.COMPOSE_ORDERS: 11>, <CompressionStep.STATS: 12>, <CompressionStep.MPH: 2>}, conf: typing.Dict[str, str] = {})[source]

graph compression pipeline driver from nodes/edges files to compressed on-disk representation

Parameters
  • graph_name – graph base name, relative to in_dir

  • in_dir – input directory, where the uncompressed graph can be found

  • out_dir – output directory, where the compressed graph will be stored

  • steps – compression steps to run (default: all steps)

  • conf

    compression configuration, supporting the following keys (all are optional, so an empty configuration is fine and is the default)

    • batch_size: batch size for WebGraph transformations; defaults to 1 billion

    • classpath: java classpath, defaults to swh-graph JAR only

    • java: command to run java VM, defaults to “java”

    • java_tool_options: value for JAVA_TOOL_OPTIONS environment variable; defaults to various settings for high memory machines

    • logback: path to a logback.xml configuration file; if not provided a temporary one will be created and used

    • max_ram: maximum RAM to use for compression; defaults to available virtual memory

    • tmp_dir: temporary directory, defaults to the “tmp” subdir of out_dir