Command-line interface#
swh graph#
Software Heritage graph tools.
swh graph [OPTIONS] COMMAND [ARGS]...
Options
- -C, --config-file <config_file>#
YAML configuration file
compress#
Compress a graph using WebGraph
Input: a directory containing a graph dataset in ORC format
Output: a directory containing a WebGraph compressed graph
Compression steps are: (1) extract_nodes, (2) mph, (3) bv, (4) bfs, (5) permute_bfs, (6) transpose_bfs, (7) simplify, (8) llp, (9) permute_llp, (10) obl, (11) compose_orders, (12) stats, (13) transpose, (14) transpose_obl, (15) maps, (16) extract_persons, (17) mph_persons, (18) node_properties, (19) mph_labels, (20) fcl_labels, (21) edge_labels, (22) edge_labels_obl, (23) edge_labels_transpose_obl, (24) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.
swh graph compress [OPTIONS]
Options
- -i, --input-dataset <input_dataset>#
Required graph dataset directory, in ORC format
- -o, --output-directory <output_directory>#
Required directory where to store compressed graph
- -g, --graph-name <NAME>#
name of the output graph (default: ‘graph’)
- -s, --steps <STEPS>#
run only these compression steps (default: all steps)
download#
Downloads a compressed SWH graph to the given target directory
swh graph download [OPTIONS] TARGET
Options
- --s3-url <s3_url>#
S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’
- --s3-prefix <s3_prefix>#
Base directory of Software Heritage’s graphs on S3
- --name <name>#
Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html
- -j, --parallelism <parallelism>#
Number of threads used to download/decompress files.
Arguments
- TARGET#
Required argument
find-context#
Utility to get the fully qualified SWHID for a given core SWHID. Uses the graph traversal to find the shortest path to an origin, and retains the first seen revision or release as anchor for cnt and dir types.
swh graph find-context [OPTIONS]
Options
- -g, --graph-grpc-server <GRAPH_GRPC_SERVER>#
Graph RPC server address: as host:port
- Default:
'localhost:50091'
- -c, --content-swhid <CNTSWHID>#
SWHID of the content
- Default:
'swh:1:cnt:3b997e8ef2e38d5b31fb353214a54686e72f0870'
- -f, --filename <FILENAME>#
Name of file to search for
- Default:
''
- -o, --origin-url <ORIGINURL>#
URL of the origin where we look for a content
- Default:
''
- --all-origins, --no-all-origins#
Compute fqswhid for all origins
- --fqswhid, --no-fqswhid#
Compute fqswhid. If disabled, print only the origins.
- --trace, --no-trace#
Print nodes examined while building fully qualified SWHID.
- --random-origin, --no-random-origin#
Compute fqswhid for a random origin
grpc-serve#
start the graph GRPC service
This command uses execve to execute the Rust GRPC service.
swh graph grpc-serve [OPTIONS]
Options
- -p, --port <PORT>#
port to bind the server on (note: host is not configurable for now and will be 0.0.0.0). Defaults to 50091
- -g, --graph <GRAPH>#
Required compressed graph basename
list-datasets#
List graph datasets available for download.
Print the names of the Software Heritage graph datasets that can be downloaded with the following command:
$ swh graph download –name <dataset_name> <target_directory>
swh graph list-datasets [OPTIONS]
Options
- --s3-bucket <s3_bucket>#
S3 bucket name containing Software Heritage graph datasets. Defaults to ‘sotwareheritage’
luigi#
Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.
The list of Luigi params should be prefixed with --
so they are not interpreted
by the swh
CLI. For example:
swh graph luigi \
--base-directory ~/tmp/ \
--dataset-name 2022-12-05_test ListOriginContributors \
-- \
RunAll \
--local-scheduler
to pass RunAll --local-scheduler
as Luigi params
Or, to compute a derived dataset:
swh graph luigi \
--graph-base-directory /dev/shm/swh-graph/default/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--athena-prefix swh \
--dataset-name 2022-04-25 \
--s3-athena-output-location s3://some-bucket/tmp/athena \
-- \
--log-level INFO \
FindEarliestRevisions \
--scheduler-url http://localhost:50092/ \
--blob-filter citation
swh graph luigi [OPTIONS] [LUIGI_PARAM]...
Options
- --base-directory <base_directory>#
Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
/poolswh/softwareheritage/
.
- --base-sensitive-directory <base_sensitive_directory>#
The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example:
/poolswh/softwareheritage/
.
- --athena-prefix <athena_prefix>#
A prefix for the Athena Database that will be created and/or used. For example:
swh
.
- --s3-prefix <s3_prefix>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --max-ram <max_ram>#
Maximum RAM that some scripts will try not to exceed
- --batch-size <batch_size>#
Default value for compression tasks handling objects in batch
- --grpc-api <grpc_api>#
Default value for the <hostname>:<port> of the gRPC server
- --s3-athena-output-location <s3_athena_output_location>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --graph-base-directory <graph_base_directory>#
Overrides the path of the graph to use. Defaults to the value of
{base_directory}/{dataset_name}/{compressed}/
. For example:/dev/shm/swh-graph/default/
.
- --export-base-directory <export_base_directory>#
Overrides the path of the export to use. Defaults to the value of
--base-directory
.
- --dataset-name <dataset_name>#
Required Should be a date and optionally a flavor, which will be used as directory name. For example:
2022-04-25
or2022-11-12_staging
.
- --parent-dataset-name <parent_dataset_name>#
When generating a subdataset (eg.
2024-08-23-python3k
), this is the name of a full export (eg.2024-08-23
) the subdataset should be built from.
- --export-name <export_name>#
Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example:
2022-04-25
or2022-11-12_staging
. Defaults to the value of –dataset-name
- --parent-export-name <parent_export_name>#
When generating a subdataset (eg.
2024-08-23-python3k
), this is the name of a full export (eg.2024-08-23
) the subdataset should be built from. Defaults to the value of –parent-dataset-name
- --previous-dataset-name <previous_dataset_name>#
When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.
- --luigi-config <luigi_config>#
Extra options to add to
luigi.cfg
, following the same format. This overrides any option that would be other set automatically.
- --retry-luigi-delay <retry_luigi_delay>#
Time to wait before re-running Luigi, if some tasks are pending but stuck.
Arguments
- LUIGI_PARAM#
Optional argument(s)
reindex#
Reindex a SWH GRAPH to the latest graph format.
GRAPH should be composed of the graph folder followed by the graph prefix (by default “graph”) eg. “graph_folder/graph”.
swh graph reindex [OPTIONS] GRAPH
Options
- --force#
Regenerate files even if they already exist. Implies –ef
- --ef#
Regenerate .ef files even if they already exist
- --debug#
Use debug executables instead of release executables
Arguments
- GRAPH#
Required argument
rpc-serve#
Run the graph RPC service.
swh graph rpc-serve [OPTIONS]
Options
- -h, --host <IP>#
host IP address to bind the server on
- Default:
'0.0.0.0'
- -p, --port <PORT>#
port to bind the server on
- Default:
5009
- -g, --graph <GRAPH>#
compressed graph basename