Command-line interface#

swh graph#

Software Heritage graph tools.

swh graph [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>#

YAML configuration file

compress#

Compress a graph using WebGraph

Input: a directory containing a graph dataset in ORC format

Output: a directory containing a WebGraph compressed graph

Compression steps are: (1) extract_nodes, (2) mph, (3) bv, (4) bfs, (5) permute_bfs, (6) transpose_bfs, (7) simplify, (8) llp, (9) permute_llp, (10) obl, (11) compose_orders, (12) stats, (13) transpose, (14) transpose_obl, (15) maps, (16) extract_persons, (17) mph_persons, (18) node_properties, (19) mph_labels, (20) fcl_labels, (21) edge_labels, (22) edge_labels_obl, (23) edge_labels_transpose_obl, (24) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.

swh graph compress [OPTIONS]

Options

-i, --input-dataset <input_dataset>#

Required graph dataset directory, in ORC format

-o, --output-directory <output_directory>#

Required directory where to store compressed graph

-g, --graph-name <NAME>#

name of the output graph (default: ‘graph’)

-s, --steps <STEPS>#

run only these compression steps (default: all steps)

download#

Downloads a compressed SWH graph to the given target directory

swh graph download [OPTIONS] TARGET

Options

--s3-url <s3_url>#

S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’

--s3-prefix <s3_prefix>#

Base directory of Software Heritage’s graphs on S3

--name <name>#

Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html

-j, --parallelism <parallelism>#

Number of threads used to download/decompress files.

Arguments

TARGET#

Required argument

find-context#

Utility to get the fully qualified SWHID for a given core SWHID. Uses the graph traversal to find the shortest path to an origin, and retains the first seen revision or release as anchor for cnt and dir types.

swh graph find-context [OPTIONS]

Options

-g, --graph-grpc-server <GRAPH_GRPC_SERVER>#

Graph RPC server address: as host:port

Default:

'localhost:50091'

-c, --content-swhid <CNTSWHID>#

SWHID of the content

Default:

'swh:1:cnt:3b997e8ef2e38d5b31fb353214a54686e72f0870'

-f, --filename <FILENAME>#

Name of file to search for

Default:

''

-o, --origin-url <ORIGINURL>#

URL of the origin where we look for a content

Default:

''

--all-origins, --no-all-origins#

Compute fqswhid for all origins

--fqswhid, --no-fqswhid#

Compute fqswhid. If disabled, print only the origins.

--trace, --no-trace#

Print nodes examined while building fully qualified SWHID.

--random-origin, --no-random-origin#

Compute fqswhid for a random origin

grpc-serve#

start the graph GRPC service

This command uses execve to execute the Rust GRPC service.

swh graph grpc-serve [OPTIONS]

Options

-p, --port <PORT>#

port to bind the server on (note: host is not configurable for now and will be 0.0.0.0). Defaults to 50091

-g, --graph <GRAPH>#

Required compressed graph basename

list-datasets#

List graph datasets available for download.

Print the names of the Software Heritage graph datasets that can be downloaded with the following command:

$ swh graph download –name <dataset_name> <target_directory>

swh graph list-datasets [OPTIONS]

Options

--s3-bucket <s3_bucket>#

S3 bucket name containing Software Heritage graph datasets. Defaults to ‘sotwareheritage’

luigi#

Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.

The list of Luigi params should be prefixed with -- so they are not interpreted by the swh CLI. For example:

swh graph luigi \
        --base-directory ~/tmp/ \
        --dataset-name 2022-12-05_test ListOriginContributors \
        -- \
        RunAll \
        --local-scheduler

to pass RunAll --local-scheduler as Luigi params

Or, to compute a derived dataset:

swh graph luigi \
        --graph-base-directory /dev/shm/swh-graph/default/ \
        --base-directory /poolswh/softwareheritage/vlorentz/ \
        --athena-prefix swh \
        --dataset-name 2022-04-25 \
        --s3-athena-output-location s3://some-bucket/tmp/athena \
        -- \
        --log-level INFO \
        FindEarliestRevisions \
        --scheduler-url http://localhost:50092/ \
        --blob-filter citation
swh graph luigi [OPTIONS] [LUIGI_PARAM]...

Options

--base-directory <base_directory>#

Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: /poolswh/softwareheritage/.

--base-sensitive-directory <base_sensitive_directory>#

The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example: /poolswh/softwareheritage/.

--athena-prefix <athena_prefix>#

A prefix for the Athena Database that will be created and/or used. For example: swh.

--s3-prefix <s3_prefix>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--max-ram <max_ram>#

Maximum RAM that some scripts will try not to exceed

--batch-size <batch_size>#

Default value for compression tasks handling objects in batch

--grpc-api <grpc_api>#

Default value for the <hostname>:<port> of the gRPC server

--s3-athena-output-location <s3_athena_output_location>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--graph-base-directory <graph_base_directory>#

Overrides the path of the graph to use. Defaults to the value of {base_directory}/{dataset_name}/{compressed}/. For example: /dev/shm/swh-graph/default/.

--export-base-directory <export_base_directory>#

Overrides the path of the export to use. Defaults to the value of --base-directory.

--dataset-name <dataset_name>#

Required Should be a date and optionally a flavor, which will be used as directory name. For example: 2022-04-25 or 2022-11-12_staging.

--parent-dataset-name <parent_dataset_name>#

When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from.

--export-name <export_name>#

Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example: 2022-04-25 or 2022-11-12_staging. Defaults to the value of –dataset-name

--parent-export-name <parent_export_name>#

When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from. Defaults to the value of –parent-dataset-name

--previous-dataset-name <previous_dataset_name>#

When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.

--luigi-config <luigi_config>#

Extra options to add to luigi.cfg, following the same format. This overrides any option that would be other set automatically.

--retry-luigi-delay <retry_luigi_delay>#

Time to wait before re-running Luigi, if some tasks are pending but stuck.

Arguments

LUIGI_PARAM#

Optional argument(s)

reindex#

Reindex a SWH GRAPH to the latest graph format.

GRAPH should be composed of the graph folder followed by the graph prefix (by default “graph”) eg. “graph_folder/graph”.

swh graph reindex [OPTIONS] GRAPH

Options

--force#

Regenerate files even if they already exist. Implies –ef

--ef#

Regenerate .ef files even if they already exist

--debug#

Use debug executables instead of release executables

Arguments

GRAPH#

Required argument

rpc-serve#

Run the graph RPC service.

swh graph rpc-serve [OPTIONS]

Options

-h, --host <IP>#

host IP address to bind the server on

Default:

'0.0.0.0'

-p, --port <PORT>#

port to bind the server on

Default:

5009

-g, --graph <GRAPH>#

compressed graph basename