Command-line interface#
swh graph#
Software Heritage graph tools.
swh graph [OPTIONS] COMMAND [ARGS]...
Options
- -C, --config-file <config_file>#
YAML configuration file
compress#
Compress a graph using WebGraph
Input: a directory containing a graph dataset in ORC format
Output: a directory containing a WebGraph compressed graph
Compression steps are: (1) extract_nodes, (2) mph, (3) bv, (4) bfs, (5) permute_bfs, (6) transpose_bfs, (7) simplify, (8) llp, (9) permute_llp, (10) obl, (11) compose_orders, (12) stats, (13) transpose, (14) transpose_obl, (15) maps, (16) extract_persons, (17) mph_persons, (18) node_properties, (19) mph_labels, (20) fcl_labels, (21) edge_labels, (22) edge_labels_obl, (23) edge_labels_transpose_obl, (24) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.
swh graph compress [OPTIONS]
Options
- -i, --input-dataset <input_dataset>#
Required graph dataset directory, in ORC format
- -o, --output-directory <output_directory>#
Required directory where to store compressed graph
- -g, --graph-name <NAME>#
name of the output graph (default: ‘graph’)
- -s, --steps <STEPS>#
run only these compression steps (default: all steps)
download#
Downloads a compressed SWH graph to the given target directory
swh graph download [OPTIONS] TARGET
Options
- --s3-url <s3_url>#
S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’
- --s3-prefix <s3_prefix>#
Base directory of Software Heritage’s graphs on S3
- --name <name>#
Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
- -j, --parallelism <parallelism>#
Number of threads used to download/decompress files.
Arguments
- TARGET#
Required argument
grpc-serve#
start the graph GRPC service
This command uses execve to execute the Rust GRPC service.
swh graph grpc-serve [OPTIONS]
Options
- -p, --port <PORT>#
port to bind the server on (note: host is not configurable for now and will be 0.0.0.0). Defaults to 50091
- -g, --graph <GRAPH>#
Required compressed graph basename
luigi#
Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.
The list of Luigi params should be prefixed with --
so they are not interpreted
by the swh
CLI. For example:
swh graph luigi \
--base-directory ~/tmp/ \
--dataset-name 2022-12-05_test ListOriginContributors \
-- \
RunAll \
--local-scheduler
to pass RunAll --local-scheduler
as Luigi params
Or, to compute a derived dataset:
swh graph luigi \
--graph-base-directory /dev/shm/swh-graph/default/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--athena-prefix swh \
--dataset-name 2022-04-25 \
--s3-athena-output-location s3://some-bucket/tmp/athena \
-- \
--log-level INFO \
FindEarliestRevisions \
--scheduler-url http://localhost:50092/ \
--blob-filter citation
swh graph luigi [OPTIONS] [LUIGI_PARAM]...
Options
- --base-directory <base_directory>#
Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
/poolswh/softwareheritage/
.
- --base-sensitive-directory <base_sensitive_directory>#
The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example:
/poolswh/softwareheritage/
.
- --athena-prefix <athena_prefix>#
A prefix for the Athena Database that will be created and/or used. For example:
swh
.
- --s3-prefix <s3_prefix>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --max-ram <max_ram>#
Maximum RAM that some scripts will try not to exceed
- --batch-size <batch_size>#
Default value for compression tasks handling objects in batch
- --grpc-api <grpc_api>#
Default value for the <hostname>:<port> of the gRPC server
- --s3-athena-output-location <s3_athena_output_location>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --graph-base-directory <graph_base_directory>#
Overrides the path of the graph to use. Defaults to the value of
{base_directory}/{dataset_name}/{compressed}/
. For example:/dev/shm/swh-graph/default/
.
- --export-base-directory <export_base_directory>#
Overrides the path of the export to use. Defaults to the value of
--base-directory
.
- --dataset-name <dataset_name>#
Required Should be a date and optionally a flavor, which will be used as directory name. For example:
2022-04-25
or2022-11-12_staging
.
- --export-name <export_name>#
Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example:
2022-04-25
or2022-11-12_staging
. Defaults to the value of –dataset-name
- --previous-dataset-name <previous_dataset_name>#
When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.
- --luigi-config <luigi_config>#
Extra options to add to
luigi.cfg
, following the same format. This overrides any option that would be other set automatically.
- --retry-luigi-delay <retry_luigi_delay>#
Time to wait before re-running Luigi, if some tasks are pending but stuck.
Arguments
- LUIGI_PARAM#
Optional argument(s)
reindex#
Reindex a SWH GRAPH to the latest graph format.
GRAPH should be composed of the graph folder followed by the graph prefix (by default “graph”) eg. “graph_folder/graph”.
swh graph reindex [OPTIONS] GRAPH
Options
- --force#
Regenerate files even if they already exist. Implies –ef
- --ef#
Regenerate .ef files even if they already exist
- --debug#
Use debug executables instead of release executables
Arguments
- GRAPH#
Required argument
rpc-serve#
Run the graph RPC service.
swh graph rpc-serve [OPTIONS]
Options
- -h, --host <IP>#
host IP address to bind the server on
- Default:
'0.0.0.0'
- -p, --port <PORT>#
port to bind the server on
- Default:
5009
- -g, --graph <GRAPH>#
compressed graph basename