Command-line interface#

swh graph#

Software Heritage graph tools.

swh graph [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>#

YAML configuration file

compress#

Compress a graph using WebGraph

Input: a directory containing a graph dataset in ORC format

Output: a directory containing a WebGraph compressed graph

Compression steps are: (1) extract_nodes, (2) mph, (3) bv, (4) bfs, (5) permute_bfs, (6) transpose_bfs, (7) simplify, (8) llp, (9) permute_llp, (10) obl, (11) compose_orders, (12) stats, (13) transpose, (14) transpose_obl, (15) maps, (16) extract_persons, (17) mph_persons, (18) node_properties, (19) mph_labels, (20) fcl_labels, (21) edge_labels, (22) edge_labels_obl, (23) edge_labels_transpose_obl, (24) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.

swh graph compress [OPTIONS]

Options

-i, --input-dataset <input_dataset>#

Required graph dataset directory, in ORC format

-o, --output-directory <output_directory>#

Required directory where to store compressed graph

-g, --graph-name <NAME>#

name of the output graph (default: ‘graph’)

-s, --steps <STEPS>#

run only these compression steps (default: all steps)

grpc-serve#

start the graph GRPC service

This command uses execve to execute the java GRPC service.

swh graph grpc-serve [OPTIONS]

Options

-p, --port <PORT>#

port to bind the server on (note: host is not configurable for now and will be 0.0.0.0). Defaults to 50091

-j, --java-home <JAVA_HOME>#

absolute path to the Java Runtime Environment (JRE)

-g, --graph <GRAPH>#

Required compressed graph basename

luigi#

Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.

The list of Luigi params should be prefixed with -- so they are not interpreted by the swh CLI. For example:

swh graph luigi \
        --base-directory ~/tmp/ \
        --dataset-name 2022-12-05_test ListOriginContributors \
        -- \
        RunAll \
        --local-scheduler

to pass RunAll --local-scheduler as Luigi params

Or, to compute a derived dataset:

swh graph luigi \
        --graph-base-directory /dev/shm/swh-graph/default/ \
        --base-directory /poolswh/softwareheritage/vlorentz/ \
        --athena-prefix swh \
        --dataset-name 2022-04-25 \
        --s3-athena-output-location s3://some-bucket/tmp/athena \
        -- \
        --log-level INFO \
        FindEarliestRevisions \
        --scheduler-url http://localhost:50092/ \
        --blob-filter citation
swh graph luigi [OPTIONS] [LUIGI_PARAM]...

Options

--base-directory <base_directory>#

Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: /poolswh/softwareheritage/.

--base-sensitive-directory <base_sensitive_directory>#

The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example: /poolswh/softwareheritage/.

--athena-prefix <athena_prefix>#

A prefix for the Athena Database that will be created and/or used. For example: swh.

--s3-prefix <s3_prefix>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--max-ram <max_ram>#

Value to pass to -Xmx for Java processes

--batch-size <batch_size>#

Default value for compression tasks handling objects in batch

--grpc-api <grpc_api>#

Default value for the <hostname>:<port> of the gRPC server

--s3-athena-output-location <s3_athena_output_location>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--graph-base-directory <graph_base_directory>#

Overrides the path of the graph to use. Defaults to the value of {base_directory}/{dataset_name}/{compressed}/. For example: /dev/shm/swh-graph/default/.

--export-base-directory <export_base_directory>#

Overrides the path of the export to use. Defaults to the value of --base-directory.

--dataset-name <dataset_name>#

Required Should be a date and optionally a flavor, which will be used as directory name. For example: 2022-04-25 or 2022-11-12_staging.

--export-name <export_name>#

Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example: 2022-04-25 or 2022-11-12_staging. Defaults to the value of –dataset-name

--previous-dataset-name <previous_dataset_name>#

When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.

--luigi-config <luigi_config>#

Extra options to add to luigi.cfg, following the same format. This overrides any option that would be other set automatically.

--retry-luigi-delay <retry_luigi_delay>#

Time to wait before re-running Luigi, if some tasks are pending but stuck.

Arguments

LUIGI_PARAM#

Optional argument(s)

rpc-serve#

run the graph RPC service

swh graph rpc-serve [OPTIONS]

Options

-h, --host <IP>#

host IP address to bind the server on

Default:

0.0.0.0

-p, --port <PORT>#

port to bind the server on

Default:

5009

-g, --graph <GRAPH>#

Required compressed graph basename