swh.graph.webgraph module#
WebGraph driver
- class swh.graph.webgraph.CompressionStep(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- EXTRACT_NODES = 1#
- MPH = 2#
- BV = 3#
- BFS = 4#
- PERMUTE_BFS = 5#
- TRANSPOSE_BFS = 6#
- SIMPLIFY = 7#
- LLP = 8#
- PERMUTE_LLP = 9#
- OBL = 10#
- COMPOSE_ORDERS = 11#
- STATS = 12#
- TRANSPOSE = 13#
- TRANSPOSE_OBL = 14#
- MAPS = 15#
- EXTRACT_PERSONS = 16#
- MPH_PERSONS = 17#
- NODE_PROPERTIES = 18#
- MPH_LABELS = 19#
- FCL_LABELS = 20#
- EDGE_LABELS = 21#
- EDGE_LABELS_OBL = 22#
- EDGE_LABELS_TRANSPOSE_OBL = 23#
- CLEAN_TMP = 24#
- swh.graph.webgraph.compress(graph_name: str, in_dir: ~pathlib.Path, out_dir: ~pathlib.Path, steps: ~typing.Set[~swh.graph.webgraph.CompressionStep] = {CompressionStep.BFS, CompressionStep.BV, CompressionStep.CLEAN_TMP, CompressionStep.COMPOSE_ORDERS, CompressionStep.EDGE_LABELS, CompressionStep.EDGE_LABELS_OBL, CompressionStep.EDGE_LABELS_TRANSPOSE_OBL, CompressionStep.EXTRACT_NODES, CompressionStep.EXTRACT_PERSONS, CompressionStep.FCL_LABELS, CompressionStep.LLP, CompressionStep.MAPS, CompressionStep.MPH, CompressionStep.MPH_LABELS, CompressionStep.MPH_PERSONS, CompressionStep.NODE_PROPERTIES, CompressionStep.OBL, CompressionStep.PERMUTE_BFS, CompressionStep.PERMUTE_LLP, CompressionStep.SIMPLIFY, CompressionStep.STATS, CompressionStep.TRANSPOSE, CompressionStep.TRANSPOSE_BFS, CompressionStep.TRANSPOSE_OBL}, conf: ~typing.Dict[str, str] = {}, progress_cb: ~typing.Callable[[int, ~swh.graph.webgraph.CompressionStep], None] = <function <lambda>>)[source]#
graph compression pipeline driver from nodes/edges files to compressed on-disk representation
- Parameters:
graph_name – graph base name, relative to in_dir
in_dir – input directory, where the uncompressed graph can be found
out_dir – output directory, where the compressed graph will be stored
steps – compression steps to run (default: all steps)
conf –
compression configuration, supporting the following keys (all are optional, so an empty configuration is fine and is the default)
batch_size: batch size for WebGraph transformations; defaults to 1 billion
classpath: java classpath, defaults to swh-graph JAR only
java: command to run java VM, defaults to “java”
java_tool_options: value for JAVA_TOOL_OPTIONS environment variable; defaults to various settings for high memory machines
logback: path to a logback.xml configuration file; if not provided a temporary one will be created and used
max_ram: maximum RAM to use for compression; defaults to available virtual memory
tmp_dir: temporary directory, defaults to the “tmp” subdir of out_dir
object_types: comma-separated list of object types to extract (eg.
ori,snp,rel,rev
). Defaults to*
.
progress_cb – a callable taking a percentage and step as argument, which is called every time a step starts.