Command-line interface#
swh datasets#
Software Heritage datasets tools.
swh datasets [OPTIONS] COMMAND [ARGS]...
Options
- -C, --config-file <config_file>#
YAML configuration file
download#
Downloads a compressed SWH graph to the given target directory
swh datasets download [OPTIONS] TARGET
Options
- --s3-url <s3_url>#
S3 directory containing the graph to download. Defaults to ‘{s3_prefix}/{name}/compressed/’
- --s3-prefix <s3_prefix>#
Base directory of Software Heritage’s graphs on S3
- --name <name>#
Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html
- -j, --parallelism <parallelism>#
Number of threads used to download/decompress files.
Arguments
- TARGET#
Required argument
list-datasets#
List graph datasets available for download.
Print the names of the Software Heritage graph datasets that can be downloaded with the following command:
$ swh graph download –name <dataset_name> <target_directory>
The list may contain datasets that are not suitable for production, or not yet fully available. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html for the official list of datasets, along with release notes.
swh datasets list-datasets [OPTIONS]
Options
- --s3-bucket <s3_bucket>#
S3 bucket name containing Software Heritage graph datasets. Defaults to ‘sotwareheritage’
luigi#
Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.
The list of Luigi params should be prefixed with --
so they are not interpreted
by the swh
CLI. For example:
swh graph luigi \
--base-directory ~/tmp/ \
--dataset-name 2022-12-05_test ListOriginContributors \
-- \
RunAll \
--local-scheduler
to pass RunAll --local-scheduler
as Luigi params
Or, to compute a derived dataset:
swh graph luigi \
--graph-base-directory /dev/shm/swh-graph/default/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--athena-prefix swh \
--dataset-name 2022-04-25 \
--s3-athena-output-location s3://some-bucket/tmp/athena \
-- \
--log-level INFO \
FindEarliestRevisions \
--scheduler-url http://localhost:50092/ \
--blob-filter citation
swh datasets luigi [OPTIONS] [LUIGI_PARAM]...
Options
- --base-directory <base_directory>#
Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
/poolswh/softwareheritage/
.
- --base-sensitive-directory <base_sensitive_directory>#
The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example:
/poolswh/softwareheritage/
.
- --athena-prefix <athena_prefix>#
A prefix for the Athena Database that will be created and/or used. For example:
swh
.
- --s3-prefix <s3_prefix>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --max-ram <max_ram>#
Maximum RAM that some scripts will try not to exceed
- --batch-size <batch_size>#
Default value for compression tasks handling objects in batch
- --grpc-api <grpc_api>#
Default value for the <hostname>:<port> of the gRPC server
- --s3-athena-output-location <s3_athena_output_location>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/
.
- --graph-base-directory <graph_base_directory>#
Overrides the path of the graph to use. Defaults to the value of
{base_directory}/{dataset_name}/{compressed}/
. For example:/dev/shm/swh-graph/default/
.
- --export-base-directory <export_base_directory>#
Overrides the path of the export to use. Defaults to the value of
--base-directory
.
- --dataset-name <dataset_name>#
Required Should be a date and optionally a flavor, which will be used as directory name. For example:
2022-04-25
or2022-11-12_staging
.
- --parent-dataset-name <parent_dataset_name>#
When generating a subdataset (eg.
2024-08-23-python3k
), this is the name of a full export (eg.2024-08-23
) the subdataset should be built from.
- --export-name <export_name>#
Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example:
2022-04-25
or2022-11-12_staging
. Defaults to the value of –dataset-name
- --parent-export-name <parent_export_name>#
When generating a subdataset (eg.
2024-08-23-python3k
), this is the name of a full export (eg.2024-08-23
) the subdataset should be built from. Defaults to the value of –parent-dataset-name
- --previous-dataset-name <previous_dataset_name>#
When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.
- --luigi-config <luigi_config>#
Extra options to add to
luigi.cfg
, following the same format. This overrides any option that would be other set automatically.
- --retry-luigi-delay <retry_luigi_delay>#
Time to wait before re-running Luigi, if some tasks are pending but stuck.
Arguments
- LUIGI_PARAM#
Optional argument(s)
rpc-serve#
Run the graph RPC service.
swh datasets rpc-serve [OPTIONS]
Options
- -h, --host <IP>#
host IP address to bind the server on
- Default:
'0.0.0.0'
- -p, --port <PORT>#
port to bind the server on
- Default:
5009
- -g, --graph <GRAPH>#
compressed graph basename