swh.graph.webgraph module

WebGraph driver

class swh.graph.webgraph.CompressionStep(value)[source]

Bases: enum.Enum

An enumeration.

MPH = 1
BV = 2
BV_OBL = 3
BFS = 4
PERMUTE = 5
PERMUTE_OBL = 6
STATS = 7
TRANSPOSE = 8
TRANSPOSE_OBL = 9
MAPS = 10
CLEAN_TMP = 11
class swh.graph.webgraph.StepOption[source]

Bases: click.types.ParamType

click type for specifying a compression step on the CLI

parse either individual steps, specified as step names or integers, or step ranges

name = 'compression step'
convert(value, param, ctx) → Set[swh.graph.webgraph.CompressionStep][source]

Converts the value. This is not invoked for values that are None (the missing value).

swh.graph.webgraph.do_step(step, conf)[source]
swh.graph.webgraph.compress(graph_name: str, in_dir: pathlib.Path, out_dir: pathlib.Path, steps: Set[swh.graph.webgraph.CompressionStep] = {<CompressionStep.PERMUTE: 5>, <CompressionStep.PERMUTE_OBL: 6>, <CompressionStep.CLEAN_TMP: 11>, <CompressionStep.BV_OBL: 3>, <CompressionStep.BV: 2>, <CompressionStep.MPH: 1>, <CompressionStep.BFS: 4>, <CompressionStep.STATS: 7>, <CompressionStep.TRANSPOSE: 8>, <CompressionStep.TRANSPOSE_OBL: 9>, <CompressionStep.MAPS: 10>}, conf: Dict[str, str] = {})[source]

graph compression pipeline driver from nodes/edges files to compressed on-disk representation

Parameters
  • graph_name – graph base name, relative to in_dir

  • in_dir – input directory, where the uncompressed graph can be found

  • out_dir – output directory, where the compressed graph will be stored

  • steps – compression steps to run (default: all steps)

  • conf

    compression configuration, supporting the following keys (all are optional, so an empty configuration is fine and is the default)

    • batch_size: batch size for WebGraph transformations; defaults to 1 billion

    • classpath: java classpath, defaults to swh-graph JAR only

    • java: command to run java VM, defaults to “java”

    • java_tool_options: value for JAVA_TOOL_OPTIONS environment variable; defaults to various settings for high memory machines

    • logback: path to a logback.xml configuration file; if not provided a temporary one will be created and used

    • max_ram: maximum RAM to use for compression; defaults to available virtual memory

    • tmp_dir: temporary directory, defaults to the “tmp” subdir of out_dir