Command-line interface#
swh datasets#
Software Heritage datasets tools.
Usage
swh datasets [OPTIONS] COMMAND [ARGS]...
Options
- -C, --config-file <config_file>#
YAML configuration file
download-export#
Downloads a SWH ORC dataset to the given target directory
Usage
swh datasets download-export [OPTIONS] TARGET_DIR
Options
- -j, --parallelism <parallelism>#
Number of threads used to download/decompress files.
- --name <name>#
Required Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html
- --s3-prefix <s3_prefix>#
Base directory of Software Heritage’s graphs on S3
Arguments
- TARGET_DIR#
Required argument
download-graph#
Downloads a compressed SWH graph to the given target directory
Usage
swh datasets download-graph [OPTIONS] TARGET_DIR
Options
- -j, --parallelism <parallelism>#
Number of threads used to download/decompress files.
- --name <name>#
Required Name of the dataset to download. This is an ISO8601 date, optionally with a suffix. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html
- --s3-prefix <s3_prefix>#
Base directory of Software Heritage’s graphs on S3
Arguments
- TARGET_DIR#
Required argument
list#
List datasets available for download.
Print the names of the Software Heritage datasets (exports or compressed graphs) that can be downloaded with the following commands:
$ swh datasets download-graph –name <dataset_name> <target_directory> $ swh datasets download-export –name <dataset_name> <target_directory>
The list may contain datasets that are not suitable for production, or not yet fully available. See https://docs.softwareheritage.org/devel/swh-export/graph/dataset.html for the official list of datasets, along with release notes.
Usage
swh datasets list [OPTIONS]
Options
- --s3-bucket <s3_bucket>#
S3 bucket name containing Software Heritage datasets. Defaults to ‘sotwareheritage’
luigi#
Calls Luigi with the given task and params, and automatically configures paths based on –base-directory and –dataset-name.
The list of Luigi params should be prefixed with -- so they are not interpreted
by the swh CLI. For example:
swh graph luigi \
--base-directory ~/tmp/ \
--dataset-name 2022-12-05_test ListOriginContributors \
-- \
RunAll \
--local-scheduler
to pass RunAll --local-scheduler as Luigi params
Or, to compute a derived dataset:
swh graph luigi \
--graph-base-directory /dev/shm/swh-graph/default/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--athena-prefix swh \
--dataset-name 2022-04-25 \
--s3-athena-output-location s3://some-bucket/tmp/athena \
-- \
--log-level INFO \
FindEarliestRevisions \
--scheduler-url http://localhost:50092/ \
--blob-filter citation
Usage
swh datasets luigi [OPTIONS] [LUIGI_PARAM]...
Options
- --base-directory <base_directory>#
Required The base directory where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
/poolswh/softwareheritage/.
- --base-sensitive-directory <base_sensitive_directory>#
The base directory for any data that should not be publicly available (eg. because it contains people’s names). For example:
/poolswh/softwareheritage/.
- --athena-prefix <athena_prefix>#
A prefix for the Athena Database that will be created and/or used. For example:
swh.
- --s3-prefix <s3_prefix>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/.
- --max-ram <max_ram>#
Maximum RAM that some scripts will try not to exceed
- --batch-size <batch_size>#
Default value for compression tasks handling objects in batch
- --grpc-api <grpc_api>#
Default value for the <hostname>:<port> of the gRPC server
- --s3-athena-output-location <s3_athena_output_location>#
The base S3 “directory” where all datasets and compressed graphs are. Its subdirectories should be named after a date (and optional flavor). For example:
s3://softwareheritage/graph/.
- --graph-base-directory <graph_base_directory>#
Overrides the path of the graph to use. Defaults to the value of
{base_directory}/{dataset_name}/{compressed}/. For example:/dev/shm/swh-graph/default/.
- --export-base-directory <export_base_directory>#
Overrides the path of the export to use. Defaults to the value of
--base-directory.
- --dataset-name <dataset_name>#
Required Should be a date and optionally a flavor, which will be used as directory name. For example:
2022-04-25or2022-11-12_staging.
- --parent-dataset-name <parent_dataset_name>#
When generating a subdataset (eg.
2024-08-23-python3k), this is the name of a full export (eg.2024-08-23) the subdataset should be built from.
- --export-name <export_name>#
Should be a date and optionally a flavor, which will be used as directory name for the export (not the compressed graph). For example:
2022-04-25or2022-11-12_staging. Defaults to the value of –dataset-name
- --parent-export-name <parent_export_name>#
When generating a subdataset (eg.
2024-08-23-python3k), this is the name of a full export (eg.2024-08-23) the subdataset should be built from. Defaults to the value of –parent-dataset-name
- --previous-dataset-name <previous_dataset_name>#
When regenerating a derived dataset, this can be set to the name of a previous dataset the derived dataset was generated for. Some results from the previous generated dataset will be reused to speed-up regeneration.
- --base-rust-executable-dir <base_rust_executable_dir>#
Where to search for Rust executables that are not in $PATH
- --luigi-config <luigi_config>#
Extra options to add to
luigi.cfg, following the same format. This overrides any option that would be other set automatically.
- --retry-luigi-delay <retry_luigi_delay>#
Time to wait before re-running Luigi, if some tasks are pending but stuck.
Arguments
- LUIGI_PARAM#
Optional argument(s)