Command-line interface#

swh graph#

Software Heritage graph tools.

swh graph [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>#: YAML configuration file

--profile <profile>#: Which Rust profile to use executables from, usually ‘release’ (the default) or ‘debug’.

compress#

Compress a graph using WebGraph

Input: a directory containing a graph dataset in ORC format

Output: a directory containing a WebGraph compressed graph

Compression steps are: (1) extract_nodes, (2) mph, (3) bv, (4) bfs, (5) permute_bfs, (6) transpose_bfs, (7) simplify, (8) llp, (9) permute_llp, (10) obl, (11) compose_orders, (12) stats, (13) transpose, (14) transpose_obl, (15) maps, (16) extract_persons, (17) mph_persons, (18) node_properties, (19) mph_labels, (20) fcl_labels, (21) edge_labels, (22) edge_labels_obl, (23) edge_labels_transpose_obl, (24) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.

swh graph compress [OPTIONS]

Options

-i, --input-dataset <input_dataset>#: graph dataset directory, in ORC format

-o, --output-directory <output_directory>#: directory where to store compressed graph

-g, --graph-name <NAME>#: name of the output graph (default: ‘graph’)

-s, --steps <STEPS>#: run only these compression steps (default: all steps)

--test-flavor, --test-flavour <test_flavor>#: Test flavo[u]r

download#

Downloads a compressed SWH graph to the given target directory

swh graph download [OPTIONS] TARGET_DIR

Options

--s3-url <s3_url>#: S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’

--s3-prefix <s3_prefix>#: Base directory of Software Heritage’s graphs on S3

--name <name>#: Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html

-j, --parallelism <parallelism>#: Number of threads used to download/decompress files.

Arguments

TARGET_DIR#: Required argument

find-context#

Utility to get the fully qualified SWHID for a given core SWHID. Uses the graph traversal to find the shortest path to an origin, and retains the first seen revision or release as anchor for cnt and dir types.

swh graph find-context [OPTIONS]

Options

-g, --graph-grpc-server <GRAPH_GRPC_SERVER>#

Graph RPC server address: as host:port

Default:: 'localhost:50091'

-c, --content-swhid <CNTSWHID>#

SWHID of the content

Default:: 'swh:1:cnt:3b997e8ef2e38d5b31fb353214a54686e72f0870'

-f, --filename <FILENAME>#

Name of file to search for

Default:: ''

-o, --origin-url <ORIGINURL>#

URL of the origin where we look for a content

Default:: ''

--all-origins, --no-all-origins#: Compute fqswhid for all origins

--fqswhid, --no-fqswhid#: Compute fqswhid. If disabled, print only the origins.

--trace, --no-trace#: Print nodes examined while building fully qualified SWHID.

--random-origin, --no-random-origin#: Compute fqswhid for a random origin

grpc-serve#

start the graph GRPC service

This command uses execve to execute the Rust GRPC service.

swh graph grpc-serve [OPTIONS]

Options

-p, --port <PORT>#: port to bind the server on (note: host is not configurable for now and will be 0.0.0.0). Defaults to 50091

-g, --graph <GRAPH>#: compressed graph basename

list-datasets#

List graph datasets available for download.

Print the names of the Software Heritage graph datasets that can be downloaded with the following command:

$ swh graph download –name <dataset_name> <target_directory>

The list may contain datasets that are not suitable for production, or not yet fully available. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html for the official list of datasets, along with release notes.

swh graph list-datasets [OPTIONS]

Options

--s3-bucket <s3_bucket>#: S3 bucket name containing Software Heritage graph datasets. Defaults to ‘sotwareheritage’

luigi#

Internal command of swh-graph. Use ‘swh export luigi’ instead.

Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.

The list of Luigi params should be prefixed with -- so they are not interpreted by the swh CLI. For example:

swh datasets luigi \
        --base-directory ~/tmp/ \
        --dataset-name 2022-12-05_test \
        -- \
        RunAll \
        --local-scheduler

to pass RunAll --local-scheduler as Luigi params

Or, to compute a derived dataset:

swh graph luigi \
        --graph-base-directory /dev/shm/swh-graph/default/ \
        --base-directory /poolswh/softwareheritage/vlorentz/ \
        --athena-prefix swh \
        --dataset-name 2022-04-25 \
        --s3-athena-output-location s3://some-bucket/tmp/athena \
        -- \
        --log-level INFO \
        FindEarliestRevisions \
        --scheduler-url http://localhost:50092/ \
        --blob-filter citation

swh graph luigi [OPTIONS] [LUIGI_PARAM]...

Options

--base-directory <base_directory>#: Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: /poolswh/softwareheritage/.

--athena-prefix <athena_prefix>#: A prefix for the Athena Database that will be created and/or used. For example: swh.

--s3-prefix <s3_prefix>#: The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--max-ram <max_ram>#: Maximum RAM that some scripts will try not to exceed

--batch-size <batch_size>#: Default value for compression tasks handling objects in batch

--grpc-api <grpc_api>#: Default value for the <hostname>:<port> of the gRPC server

--s3-athena-output-location <s3_athena_output_location>#: The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--graph-base-directory <graph_base_directory>#: Overrides the path of the graph to use. Defaults to the value of {base_directory}/{dataset_name}/{compressed}/. For example: /dev/shm/swh-graph/default/.

--export-base-directory <export_base_directory>#: Overrides the path of the export to use. Defaults to the value of --base-directory.

--dataset-name <dataset_name>#: Required Should be a date and optionally a flavor, which will be used as directory name. For example: 2022-04-25 or 2022-11-12_staging.

--parent-dataset-name <parent_dataset_name>#: When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from.

--export-name <export_name>#: Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example: 2022-04-25 or 2022-11-12_staging. Defaults to the value of –dataset-name

--parent-export-name <parent_export_name>#: When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from. Defaults to the value of –parent-dataset-name

--luigi-config <luigi_config>#: Extra options to add to luigi.cfg, following the same format. This overrides any option that would be other set automatically.

--retry-luigi-delay <retry_luigi_delay>#: Time to wait before re-running Luigi, if some tasks are pending but stuck.

Arguments

LUIGI_PARAM#: Optional argument(s)

reindex#

Reindex a SWH GRAPH to the latest graph format.

GRAPH should be composed of the graph folder followed by the graph prefix (by default “graph”) eg. “graph_folder/graph”.

swh graph reindex [OPTIONS] GRAPH

Options

--force#: Regenerate files even if they already exist. Implies –ef

--ef#: Regenerate .ef files even if they already exist

Arguments

GRAPH#: Required argument

rpc-serve#

Run the graph RPC service.

swh graph rpc-serve [OPTIONS]

Options

-h, --host <IP>#

host IP address to bind the server on

Default:: '0.0.0.0'

-p, --port <PORT>#

port to bind the server on

Default:: 5009

-g, --graph <GRAPH>#: compressed graph basename

Command-line interface#

swh graph#

compress#

download#

find-context#

grpc-serve#

list-datasets#

luigi#

reindex#

rpc-serve#

This Page