Command-line interface#

swh datasets#

Software Heritage datasets tools.

swh datasets [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>#

YAML configuration file

download#

Downloads a compressed SWH graph to the given target directory

swh datasets download [OPTIONS] TARGET

Options

--s3-url <s3_url>#

S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’

--s3-prefix <s3_prefix>#

Base directory of Software Heritage’s graphs on S3

--name <name>#

Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html

-j, --parallelism <parallelism>#

Number of threads used to download/decompress files.

Arguments

TARGET#

Required argument

list-datasets#

List graph datasets available for download.

Print the names of the Software Heritage graph datasets that can be downloaded with the following command:

$ swh graph download –name <dataset_name> <target_directory>

The list may contain datasets that are not suitable for production, or not yet fully available. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html for the official list of datasets, along with release notes.

swh datasets list-datasets [OPTIONS]

Options

--s3-bucket <s3_bucket>#

S3 bucket name containing Software Heritage graph datasets. Defaults to ‘sotwareheritage’

luigi#

Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.

The list of Luigi params should be prefixed with -- so they are not interpreted by the swh CLI. For example:

swh graph luigi \
        --base-directory ~/tmp/ \
        --dataset-name 2022-12-05_test ListOriginContributors \
        -- \
        RunAll \
        --local-scheduler

to pass RunAll --local-scheduler as Luigi params

Or, to compute a derived dataset:

swh graph luigi \
        --graph-base-directory /dev/shm/swh-graph/default/ \
        --base-directory /poolswh/softwareheritage/vlorentz/ \
        --athena-prefix swh \
        --dataset-name 2022-04-25 \
        --s3-athena-output-location s3://some-bucket/tmp/athena \
        -- \
        --log-level INFO \
        FindEarliestRevisions \
        --scheduler-url http://localhost:50092/ \
        --blob-filter citation
swh datasets luigi [OPTIONS] [LUIGI_PARAM]...

Options

--base-directory <base_directory>#

Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: /poolswh/softwareheritage/.

--base-sensitive-directory <base_sensitive_directory>#

The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example: /poolswh/softwareheritage/.

--athena-prefix <athena_prefix>#

A prefix for the Athena Database that will be created and/or used. For example: swh.

--s3-prefix <s3_prefix>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--max-ram <max_ram>#

Maximum RAM that some scripts will try not to exceed

--batch-size <batch_size>#

Default value for compression tasks handling objects in batch

--grpc-api <grpc_api>#

Default value for the <hostname>:<port> of the gRPC server

--s3-athena-output-location <s3_athena_output_location>#

The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example: s3://softwareheritage/graph/.

--graph-base-directory <graph_base_directory>#

Overrides the path of the graph to use. Defaults to the value of {base_directory}/{dataset_name}/{compressed}/. For example: /dev/shm/swh-graph/default/.

--export-base-directory <export_base_directory>#

Overrides the path of the export to use. Defaults to the value of --base-directory.

--dataset-name <dataset_name>#

Required Should be a date and optionally a flavor, which will be used as directory name. For example: 2022-04-25 or 2022-11-12_staging.

--parent-dataset-name <parent_dataset_name>#

When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from.

--export-name <export_name>#

Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example: 2022-04-25 or 2022-11-12_staging. Defaults to the value of –dataset-name

--parent-export-name <parent_export_name>#

When generating a subdataset (eg. 2024-08-23-python3k), this is the name of a full export (eg. 2024-08-23) the subdataset should be built from. Defaults to the value of –parent-dataset-name

--previous-dataset-name <previous_dataset_name>#

When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.

--luigi-config <luigi_config>#

Extra options to add to luigi.cfg, following the same format. This overrides any option that would be other set automatically.

--retry-luigi-delay <retry_luigi_delay>#

Time to wait before re-running Luigi, if some tasks are pending but stuck.

Arguments

LUIGI_PARAM#

Optional argument(s)

rpc-serve#

Run the graph RPC service.

swh datasets rpc-serve [OPTIONS]

Options

-h, --host <IP>#

host IP address to bind the server on

Default:

'0.0.0.0'

-p, --port <PORT>#

port to bind the server on

Default:

5009

-g, --graph <GRAPH>#

compressed graph basename