Command-line interface¶
Shared command-line interface¶
swh¶
Command line interface for Software Heritage.
swh [OPTIONS] COMMAND [ARGS]...
Options
-
-l
,
--log-level
<log_levels>
¶ Log level (defaults to INFO). Can override the log level for a specific module, by using the specific.module:LOGLEVEL syntax (e.g. –log-level swh.core:DEBUG will enable DEBUG logging for swh.core).
-
--log-config
<log_config>
¶ Python yaml logging configuration file.
-
--sentry-dsn
<sentry_dsn>
¶ DSN of the Sentry instance to report to
db¶
Software Heritage database generic tools.
swh db [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
create¶
Create a database for the Software Heritage <module>.
and potentially execute superuser-level initialization steps.
Example:
swh db create -d swh-test storage
If you want to specify non-default postgresql connection parameters, please provide them using standard environment variables or by the mean of a properly crafted libpq connection URI. See psql(1) man page (section ENVIRONMENTS) for details.
Note: this command requires a postgresql connection with superuser permissions.
Example:
PGPORT=5434 swh db create indexer swh db create -d postgresql://superuser:passwd@pghost:5433/swh-storage storage
swh db create [OPTIONS] MODULE
Options
-
-d
,
--db-name
<db_name>
¶ Database name.
- Default
softwareheritage-dev
-
-T
,
--template
<template>
¶ Template database from which to build this database.
- Default
template1
Arguments
-
MODULE
¶
Required argument
init¶
Initialize a database for the Software Heritage <module>.
Example:
swh db init -d swh-test storage
If you want to specify non-default postgresql connection parameters, please provide them using standard environment variables. See psql(1) man page (section ENVIRONMENTS) for details.
Examples:
PGPORT=5434 swh db init indexer swh db init -d postgresql://user:passwd@pghost:5433/swh-storage storage swh db init –flavor read_replica -d swh-storage storage
swh db init [OPTIONS] MODULE
Options
-
-d
,
--db-name
<db_name>
¶ Database name.
- Default
softwareheritage-dev
-
--flavor
<flavor>
¶ Database flavor.
Arguments
-
MODULE
¶
Required argument
init-admin¶
Execute superuser-level initialization steps (e.g pg extensions, admin functions, …)
Example:
PGPASSWORD=… swh db init-admin -d swh-test scheduler
If you want to specify non-default postgresql connection parameters, please provide them using standard environment variables or by the mean of a properly crafted libpq connection URI. See psql(1) man page (section ENVIRONMENTS) for details.
Note: this command requires a postgresql connection with superuser permissions (e.g postgres, swh-admin, …)
Example:
PGPORT=5434 swh db init-admin scheduler swh db init-admin -d postgresql://superuser:passwd@pghost:5433/swh-scheduler scheduler
swh db init-admin [OPTIONS] MODULE
Options
-
-d
,
--db-name
<db_name>
¶ Database name.
- Default
softwareheritage-dev
Arguments
-
MODULE
¶
Required argument
deposit¶
Deposit main command
swh deposit [OPTIONS] COMMAND [ARGS]...
admin¶
Server administration tasks (manipulate user or collections)
swh deposit admin [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Optional extra configuration file.
-
--platform
<platform>
¶ development or production platform
- Options
development | production
collection¶
Manipulate collections.
swh deposit admin collection [OPTIONS] COMMAND [ARGS]...
swh deposit admin collection create [OPTIONS]
Options
-
--name
<name>
¶ Required Collection’s name
List existing collections.
This entrypoint is not paginated yet as there is not a lot of entry.
swh deposit admin collection list [OPTIONS]
deposit¶
Manipulate deposit.
swh deposit admin deposit [OPTIONS] COMMAND [ARGS]...
Reschedule the deposit loading
This will:
check the deposit’s status to something reasonable (failed or done). That means that the checks have passed alright but something went wrong during the loading (failed: loading failed, done: loading ok, still for some reasons as in bugs, we need to reschedule it)
reset the deposit’s status to ‘verified’ (prior to any loading but after the checks which are fine) and removes the different archives’ identifiers (swh-id, …)
trigger back the loading task through the scheduler
swh deposit admin deposit reschedule [OPTIONS]
Options
-
--deposit-id
<deposit_id>
¶ Required Deposit identifier
user¶
Manipulate user.
swh deposit admin user [OPTIONS] COMMAND [ARGS]...
Create a user with some needed information (password, collection)
If the collection does not exist, the collection is then created alongside.
The password is stored encrypted using django’s utilities.
swh deposit admin user create [OPTIONS]
Options
-
--username
<username>
¶ Required User’s name
-
--password
<password>
¶ Required Desired user’s password (plain).
-
--firstname
<firstname>
¶ User’s first name
-
--lastname
<lastname>
¶ User’s last name
-
--email
<email>
¶ User’s email
-
--collection
<collection>
¶ User’s collection
-
--provider-url
<provider_url>
¶ Provider URL
-
--domain
<domain>
¶ The domain
Check if user exists.
swh deposit admin user exists [OPTIONS] USERNAME
Arguments
-
USERNAME
¶
Required argument
List existing users.
This entrypoint is not paginated yet as there is not a lot of entry.
swh deposit admin user list [OPTIONS]
metadata-only¶
Deposit metadata only upload
swh deposit metadata-only [OPTIONS]
Options
-
--url
<url>
¶ (Optional) Deposit server api endpoint. By default, https://deposit.softwareheritage.org/1
-
--username
<username>
¶ Required (Mandatory) User’s name
-
--password
<password>
¶ Required (Mandatory) User’s associated password
-
--metadata
<metadata_path>
¶ Required Path to xml metadata file
-
-f
,
--format
<output_format>
¶ Output format results.
- Options
logging | yaml | json
status¶
Deposit’s status
swh deposit status [OPTIONS]
Options
-
--url
<url>
¶ (Optional) Deposit server api endpoint. By default, https://deposit.softwareheritage.org/1
-
--username
<username>
¶ Required (Mandatory) User’s name
-
--password
<password>
¶ Required (Mandatory) User’s associated password
-
--deposit-id
<deposit_id>
¶ Required Deposit identifier.
-
-f
,
--format
<output_format>
¶ Output format results.
- Options
logging | yaml | json
upload¶
Software Heritage Public Deposit Client
Create/Update deposit through the command line.
More documentation can be found at https://docs.softwareheritage.org/devel/swh-deposit/getting-started.html.
swh deposit upload [OPTIONS]
Options
-
--url
<url>
¶ (Optional) Deposit server api endpoint. By default, https://deposit.softwareheritage.org/1
-
--username
<username>
¶ Required (Mandatory) User’s name
-
--password
<password>
¶ Required (Mandatory) User’s associated password
-
--archive
<archive>
¶ (Optional) Software archive to deposit
-
--metadata
<metadata>
¶ (Optional) Path to xml metadata file. If not provided, this will use a file named <archive>.metadata.xml
-
--archive-deposit
,
--no-archive-deposit
¶
Deprecated (ignored)
-
--metadata-deposit
,
--no-metadata-deposit
¶
Deprecated (ignored)
-
--collection
<collection>
¶ (Optional) User’s collection. If not provided, this will be fetched.
-
--slug
<slug>
¶ (Deprecated) (Optional) External system information identifier. If not provided, it will be generated
-
--create-origin
<create_origin>
¶ (Optional) Origin url to attach information to. To be used alongside –name and –author. This will be generated alongside the metadata to provide to the deposit server.
-
--partial
,
--no-partial
¶
(Optional) The deposit will be partial, other deposits will have to take place to finalize it.
-
--deposit-id
<deposit_id>
¶ (Optional) Update an existing partial deposit with its identifier
-
--swhid
<swhid>
¶ (Optional) Update existing completed deposit (status done) with new metadata
-
--replace
,
--no-replace
¶
(Optional) Update by replacing existing metadata to a deposit
-
--verbose
,
--no-verbose
¶
Verbose mode
-
--name
<name>
¶ Software name
-
--author
<author>
¶ Software author(s), this can be repeated as many times as there are authors
-
-f
,
--format
<output_format>
¶ Output format results.
- Options
logging | yaml | json
fs¶
Software Heritage virtual file system
swh fs [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file (default: /home/jenkins/.config/swh/global.yml)
mount¶
Mount the Software Heritage virtual file system at PATH.
If specified, objects referenced by the given SWHIDs will be prefetched and used to populate the virtual file system (VFS). Otherwise the VFS will be populated on-demand, when accessing its content.
swh fs mount [OPTIONS] PATH [SWHID]...
Options
-
-f
,
--foreground
,
-d
,
--daemon
¶
whether to run FUSE attached to the console (foreground) or daemonized in the background (default: daemon)
Arguments
-
PATH
¶
Required argument
-
[SWHID]...
¶
Optional argument(s)
graph¶
Software Heritage graph tools.
swh graph [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ YAML configuration file
api-client¶
client for the graph REST service
swh graph api-client [OPTIONS]
Options
-
--host
<host>
¶ Graph server host
-
--port
<port>
¶ Graph server port
cachemount¶
Cache the mmapped files of the compressed graph in a tmpfs.
This command creates a new directory at the path given by CACHE that has the same structure as the compressed graph basename, except it copies the files that require mmap access (*.graph) but uses symlinks from the source for all the other files (.map, .bin, …).
The command outputs the path to the memory cache directory (particularly useful when relying on the default value).
swh graph cachemount [OPTIONS]
Options
-
-g
,
--graph
<GRAPH>
¶ Required compressed graph basename
-
-c
,
--cache
<CACHE>
¶ Memory cache path (defaults to /dev/shm/swh-graph/default)
compress¶
Compress a graph using WebGraph
Input: a pair of files g.nodes.csv.gz, g.edges.csv.gz
Output: a directory containing a WebGraph compressed graph
Compression steps are: (1) mph, (2) bv, (3) bv_obl, (4) bfs, (5) permute, (6) permute_obl, (7) stats, (8) transpose, (9) transpose_obl, (10) maps, (11) clean_tmp. Compression steps can be selected by name or number using –steps, separating them with commas; step ranges (e.g., 3-9, 6-, etc.) are also supported.
swh graph compress [OPTIONS]
Options
-
-g
,
--graph
<GRAPH>
¶ Required input graph basename
-
-o
,
--outdir
<DIR>
¶ Required directory where to store compressed graph
-
-s
,
--steps
<STEPS>
¶ run only these compression steps (default: all steps)
map¶
Manage swh-graph on-disk maps
swh graph map [OPTIONS] COMMAND [ARGS]...
dump¶
Dump a binary SWHID<->node map to textual format.
swh graph map dump [OPTIONS] FILENAME
Options
-
-t
,
--type
<map_type>
¶ Required type of map to dump
- Options
swhid2node | node2swhid
Arguments
-
FILENAME
¶
Required argument
lookup¶
Lookup identifiers using on-disk maps.
Depending on the identifier type lookup either a SWHID into a SWHID->node (and return the node integer identifier) or, vice-versa, lookup a node integer identifier into a node->SWHID (and return the SWHID). The desired behavior is chosen depending on the syntax of each given identifier.
Identifiers can be passed either directly on the command line or on standard input, separate by blanks. Logical lines (as returned by readline()) in stdin will be preserved in stdout.
swh graph map lookup [OPTIONS] [IDENTIFIERS]...
Options
-
-g
,
--graph
<GRAPH>
¶ Required compressed graph basename
Arguments
-
IDENTIFIERS
¶
Optional argument(s)
restore¶
Restore a binary SWHID<->node map from textual format.
swh graph map restore [OPTIONS] FILENAME
Options
-
-t
,
--type
<map_type>
¶ Required type of map to dump
- Options
swhid2node | node2swhid
-
-l
,
--length
<length>
¶ map size in number of logical records (required for node2swhid maps)
Arguments
-
FILENAME
¶
Required argument
write¶
Write a map to disk sequentially.
read from stdin a textual SWHID->node mapping (for swhid2node, or a simple sequence of SWHIDs for node2swhid) and write it to disk in the requested binary map format
note that no sorting is applied, so the input should already be sorted as required by the chosen map type (by SWHID for swhid2node, by int for node2swhid)
swh graph map write [OPTIONS] FILENAME
Options
-
-t
,
--type
<map_type>
¶ Required type of map to write
- Options
swhid2node | node2swhid
Arguments
-
FILENAME
¶
Required argument
identify¶
Compute the Software Heritage persistent identifier (SWHID) for the given source code object(s).
For more details about SWHIDs see:
Tip: you can pass “-” to identify the content of standard input.
swh identify [OPTIONS] OBJECTS...
Options
-
--dereference
,
--no-dereference
¶
follow (or not) symlinks for OBJECTS passed as arguments (default: follow)
-
--filename
,
--no-filename
¶
show/hide file name (default: show)
-
-t
,
--type
<obj_type>
¶ type of object to identify (default: auto)
- Options
auto | content | directory | origin | snapshot
-
-x
,
--exclude
<PATTERN>
¶ Exclude directories using glob patterns (e.g., ‘*.git’ to exclude all .git directories)
-
-v
,
--verify
<SWHID>
¶ reference identifier to be compared with computed one
Arguments
-
OBJECTS
¶
Required argument(s)
indexer¶
Software Heritage Indexer tools.
The Indexer is used to mine the content of the archive and extract derived information from archive source code artifacts.
swh indexer [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
journal-client¶
Listens for new objects from the SWH Journal, and schedules tasks to run relevant indexers (currently, only origin-intrinsic-metadata) on these new objects.
swh indexer journal-client [OPTIONS]
Options
-
-s
,
--scheduler-url
<scheduler_url>
¶ URL of the scheduler API
-
--origin-metadata-task-type
<origin_metadata_task_type>
¶ Name of the task running the origin metadata indexer.
-
--broker
<brokers>
¶ Kafka broker to connect to.
-
--prefix
<prefix>
¶ Prefix of Kafka topic names to read from.
-
--group-id
<group_id>
¶ Consumer/group id for reading from Kafka.
-
-m
,
--stop-after-objects
<stop_after_objects>
¶ Maximum number of objects to replay. Default is to run forever.
mapping¶
Manage Software Heritage Indexer mappings.
swh indexer mapping [OPTIONS] COMMAND [ARGS]...
rpc-serve¶
Starts a Software Heritage Indexer RPC HTTP server.
swh indexer rpc-serve [OPTIONS] CONFIG_PATH
Options
-
--host
<host>
¶ Host to run the server
-
--port
<port>
¶ Binding port of the server
-
--debug
,
--nodebug
¶
Indicates if the server should run in debug mode
Arguments
-
CONFIG_PATH
¶
Required argument
schedule¶
Manipulate Software Heritage Indexer tasks.
Via SWH Scheduler’s API.
swh indexer schedule [OPTIONS] COMMAND [ARGS]...
Options
-
-s
,
--scheduler-url
<scheduler_url>
¶ URL of the scheduler API
-
-i
,
--indexer-storage-url
<indexer_storage_url>
¶ URL of the indexer storage API
-
-g
,
--storage-url
<storage_url>
¶ URL of the (graph) storage API
-
--dry-run
,
--no-dry-run
¶
List only what would be scheduled.
reindex_origin_metadata¶
Schedules indexing tasks for origins that were already indexed.
swh indexer schedule reindex_origin_metadata [OPTIONS]
Options
-
-b
,
--batch-size
<origin_batch_size>
¶ Number of origins per task
- Default
10
-
-t
,
--tool-id
<tool_ids>
¶ Restrict search of old metadata to this/these tool ids.
-
-m
,
--mapping
<mappings>
¶ Mapping(s) that should be re-scheduled (eg. ‘npm’, ‘gemspec’, ‘maven’)
-
--task-type
<task_type>
¶ Name of the task type to schedule.
- Default
index-origin-metadata
lister¶
Software Heritage Lister tools.
swh lister [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
run¶
Trigger a full listing run for a particular forge instance. The output of this listing results in “oneshot” tasks in the scheduler db with a priority defined by the user
swh lister run [OPTIONS] [OPTIONS]...
Options
-
-l
,
--lister
<lister>
¶ Lister to run
- Options
bitbucket | cgit | cran | debian | gitea | github | gitlab | gnu | launchpad | npm | packagist | phabricator | pypi
Arguments
-
OPTIONS
¶
Optional argument(s)
loader¶
Loader cli tools
swh loader [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
objstorage¶
Software Heritage Objstorage tools.
swh objstorage [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
import¶
Import a local directory in an existing objstorage.
swh objstorage import [OPTIONS] DIRECTORY...
Arguments
-
DIRECTORY
¶
Required argument(s)
replay¶
Fill a destination Object Storage using a journal stream.
This is typically used for a mirror configuration, by reading a Journal and retrieving objects from an existing source ObjStorage.
There can be several ‘replayers’ filling a given ObjStorage as long as they
use the same group-id
. You can use the KAFKA_GROUP_INSTANCE_ID
environment variable to use KIP-345 static group membership.
This service retrieves object ids to copy from the ‘content’ topic. It will only copy object’s content if the object’s description in the kafka nmessage has the status:visible set.
--exclude-sha1-file
may be used to exclude some hashes to speed-up the
replay in case many of the contents are already in the destination
objstorage. It must contain a concatenation of all (sha1) hashes,
and it must be sorted.
This file will not be fully loaded into memory at any given time,
so it can be arbitrarily large.
--check-dst
sets whether the replayer should check in the destination
ObjStorage before copying an object. You can turn that off if you know
you’re copying to an empty ObjStorage.
swh objstorage replay [OPTIONS]
Options
-
-n
,
--stop-after-objects
<stop_after_objects>
¶ Stop after processing this many objects. Default is to run forever.
-
--exclude-sha1-file
<exclude_sha1_file>
¶ File containing a sorted array of hashes to be excluded.
-
--check-dst
,
--no-check-dst
¶
Check whether the destination contains the object before copying.
scanner¶
Software Heritage Scanner tools.
Configuration file:
swh scanner [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ YAML configuration file
db¶
Manage local knowledge base for swh-scanner
swh scanner db [OPTIONS] COMMAND [ARGS]...
import¶
Create SQLite database of known SWHIDs from a textual list of SWHIDs
swh scanner db import [OPTIONS]
Options
-
-i
,
--input
<INPUT_FILE>
¶ Required A file containing SWHIDs
-
-o
,
--output
<OUTPUT_DB_FILE>
¶ Required The name of the generated sqlite database
-
-s
,
--chunk-size
<SIZE>
¶ The chunk size
- Default
10000
serve¶
Start an API service using the sqlite database generated with the “db import” option.
swh scanner db serve [OPTIONS]
Options
-
-h
,
--host
<HOST>
¶ The host of the API server
- Default
127.0.0.1
-
-p
,
--port
<PORT>
¶ The port of the API server
- Default
5011
-
-f
,
--db-file
<DB_FILE>
¶ An sqlite database file (it can be generated with: ‘swh scanner db import’)
- Default
SWHID_DB.sqlite
scan¶
Scan a source code project to discover files and directories already present in the archive
swh scanner scan [OPTIONS] ROOT_PATH
Options
-
-u
,
--api-url
<API_URL>
¶ URL for the api request
-
-x
,
--exclude
<PATTERN>
¶ Exclude directories using glob patterns (e.g., ‘*.git’ to exclude all .git directories)
-
-f
,
--output-format
<out_fmt>
¶ The output format
- Default
text
- Options
text | json | ndjson | sunburst
-
-i
,
--interactive
¶
Show the result in a dashboard
Arguments
-
ROOT_PATH
¶
Required argument
scheduler¶
Software Heritage Scheduler tools.
Use a local scheduler instance by default (plugged to the main scheduler db).
swh scheduler [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
-
-d
,
--database
<database>
¶ Scheduling database DSN (imply cls is ‘local’)
-
-u
,
--url
<url>
¶ Scheduler’s url access (imply cls is ‘remote’)
-
--no-stdout
¶
Do NOT output logs on the console
celery-monitor¶
Monitoring of Celery
swh scheduler celery-monitor [OPTIONS] COMMAND [ARGS]...
Options
-
--timeout
<timeout>
¶ Timeout for celery remote control
-
--pattern
<pattern>
¶ Celery destination pattern
list-running¶
List running tasks on the lister workers
swh scheduler celery-monitor list-running [OPTIONS]
Options
-
--format
<format>
¶ Output format
- Options
pretty | csv
ping-workers¶
Check which workers respond to the celery remote control
swh scheduler celery-monitor ping-workers [OPTIONS]
journal-client¶
Keep the the origin visits stats table up to date from a swh kafka journal
swh scheduler journal-client [OPTIONS]
Options
-
-m
,
--stop-after-objects
<stop_after_objects>
¶ Maximum number of objects to replay. Default is to run forever.
origin¶
Manipulate listed origins.
swh scheduler origin [OPTIONS] COMMAND [ARGS]...
grab-next¶
Grab the next COUNT origins to visit using the TYPE loader from the listed origins table.
swh scheduler origin grab-next [OPTIONS] TYPE COUNT
Options
-
-p
,
--policy
<policy>
¶ Scheduling policy
-
-f
,
--fields
<fields>
¶ Listed origin fields to print on output
-
--with-header
,
--without-header
¶
Print the CSV header?
Arguments
-
TYPE
¶
Required argument
-
COUNT
¶
Required argument
schedule-next¶
Send the next COUNT origin visits of the TYPE loader to the scheduler as one-shot tasks.
swh scheduler origin schedule-next [OPTIONS] TYPE COUNT
Options
-
-p
,
--policy
<policy>
¶ Scheduling policy
Arguments
-
TYPE
¶
Required argument
-
COUNT
¶
Required argument
update-metrics¶
Update the scheduler metrics on listed origins.
- Examples:
swh scheduler origin update-metrics swh scheduler origin update-metrics –lister github swh scheduler origin update-metrics –lister phabricator –instance llvm
swh scheduler origin update-metrics [OPTIONS]
Options
-
--lister
<lister>
¶ Only update metrics for this lister
-
--instance
<instance>
¶ Only update metrics for this lister instance
rpc-serve¶
Starts a swh-scheduler API HTTP server.
swh scheduler rpc-serve [OPTIONS]
Options
-
--host
<host>
¶ Host to run the scheduler server api
-
--port
<port>
¶ Binding port of the server
-
--debug
,
--nodebug
¶
Indicates if the server should run in debug mode. Defaults to True if log-level is DEBUG, False otherwise.
simulator¶
Scheduler simulator.
swh scheduler simulator [OPTIONS] COMMAND [ARGS]...
fill-test-data¶
Fill the scheduler with test data for simulation purposes.
swh scheduler simulator fill-test-data [OPTIONS]
Options
-
-n
,
--num-origins
<num_origins>
¶ Number of listed origins to add
run¶
Run the scheduler simulator.
By default, the simulation runs forever. You can cap the simulated runtime with the –runtime option, and you can always press Ctrl+C to interrupt the running simulation.
‘task_scheduler’ is the “classic” task-based scheduler; ‘origin_scheduler’ is the new origin-visit-aware simulator. The latter uses –policy to decide which origins to schedule first based on information from listers.
swh scheduler simulator run [OPTIONS]
Options
-
-s
,
--scheduler
<scheduler>
¶ Scheduler to simulate
- Options
task_scheduler | origin_scheduler
-
-p
,
--policy
<policy>
¶ Scheduling policy to simulate (only for origin_scheduler)
-
-t
,
--runtime
<runtime>
¶ Simulated runtime
-
-P
,
--plots
,
--no-plots
¶
Show results as plots (with plotille)
-
-o
,
--csv
<csvfile>
¶ Export results in a CSV file
start-listener¶
Starts a swh-scheduler listener service.
This service is responsible for listening at task lifecycle events and handle their workflow status in the database.
swh scheduler start-listener [OPTIONS]
start-runner¶
Starts a swh-scheduler runner service.
This process is responsible for checking for ready-to-run tasks and schedule them.
swh scheduler start-runner [OPTIONS]
Options
-
-p
,
--period
<period>
¶ Period (in s) at witch pending tasks are checked and executed. Set to 0 (default) for a one shot.
task¶
Manipulate tasks.
swh scheduler task [OPTIONS] COMMAND [ARGS]...
add¶
Schedule one task from arguments.
The first argument is the name of the task type, further ones are positional and keyword argument(s) of the task, in YAML format. Keyword args are of the form key=value.
Usage sample:
swh-scheduler –database ‘service=swh-scheduler’ task add list-pypi
swh-scheduler –database ‘service=swh-scheduler’ task add list-debian-distribution –policy=oneshot distribution=stretch
Note: if the priority is not given, the task won’t have the priority set, which is considered as the lowest priority level.
swh scheduler task add [OPTIONS] TYPE [OPTIONS]...
Options
-
-p
,
--policy
<policy>
¶ - Options
recurring | oneshot
-
-P
,
--priority
<priority>
¶ - Options
low | normal | high
-
-n
,
--next-run
<next_run>
¶
Arguments
-
TYPE
¶
Required argument
-
OPTIONS
¶
Optional argument(s)
archive¶
Archive task/task_run whose (task_type is ‘oneshot’ and task_status is ‘completed’) or (task_type is ‘recurring’ and task_status is ‘disabled’).
With –dry-run flag set (default), only list those.
swh scheduler task archive [OPTIONS]
Options
-
-b
,
--before
<before>
¶ Task whose ended date is anterior will be archived. Default to current month’s first day.
-
-a
,
--after
<after>
¶ Task whose ended date is after the specified date will be archived. Default to prior month’s first day.
-
--batch-index
<batch_index>
¶ Batch size of tasks to read from db to archive
-
--bulk-index
<bulk_index>
¶ Batch size of tasks to bulk index
-
--batch-clean
<batch_clean>
¶ Batch size of task to clean after archival
-
--dry-run
,
--no-dry-run
¶
Default to list only what would be archived.
-
--verbose
¶
Verbose mode
-
--cleanup
,
--no-cleanup
¶
Clean up archived tasks (default)
-
--start-from
<start_from>
¶ (Optional) default page to start from.
list¶
List tasks.
swh scheduler task list [OPTIONS]
Options
-
-i
,
--task-id
<ID>
¶ List only tasks whose id is ID.
-
-t
,
--task-type
<TYPE>
¶ List only tasks of type TYPE
-
-l
,
--limit
<limit>
¶ The maximum number of tasks to fetch.
-
-s
,
--status
<STATUS>
¶ List tasks whose status is STATUS.
- Options
next_run_not_scheduled | next_run_scheduled | completed | disabled
-
-p
,
--policy
<policy>
¶ List tasks whose policy is POLICY.
- Options
recurring | oneshot
-
-P
,
--priority
<priority>
¶ List tasks whose priority is PRIORITY.
- Options
all | low | normal | high
-
-b
,
--before
<DATETIME>
¶ Limit to tasks supposed to run before the given date.
-
-a
,
--after
<DATETIME>
¶ Limit to tasks supposed to run after the given date.
-
-r
,
--list-runs
¶
Also list past executions of each task.
list-pending¶
List the tasks that are going to be run.
You can override the number of tasks to fetch
swh scheduler task list-pending [OPTIONS] TASK_TYPES...
Options
-
-l
,
--limit
<limit>
¶ The maximum number of tasks to fetch
-
-b
,
--before
<before>
¶ List all jobs supposed to run before the given date
Arguments
-
TASK_TYPES
¶
Required argument(s)
respawn¶
Respawn tasks.
Respawn tasks given by their ids (see the ‘task list’ command to find task ids) at the given date (immediately by default).
Eg.
swh-scheduler task respawn 1 3 12
swh scheduler task respawn [OPTIONS] TASK_IDS...
Options
-
-n
,
--next-run
<DATETIME>
¶ Re spawn the selected tasks at this date
Arguments
-
TASK_IDS
¶
Required argument(s)
schedule¶
Schedule tasks from a CSV input file.
The following columns are expected, and can be set through the -c option:
type: the type of the task to be scheduled (mandatory)
args: the arguments passed to the task (JSON list, defaults to an empty list)
kwargs: the keyword arguments passed to the task (JSON object, defaults to an empty dict)
next_run: the date at which the task should run (datetime, defaults to now)
The CSV can be read either from a named file, or from stdin (use - as filename).
Use sample:
cat scheduling-task.txt | python3 -m swh.scheduler.cli –database ‘service=swh-scheduler-dev’ task schedule –columns type –columns kwargs –columns policy –delimiter ‘;’ -
swh scheduler task schedule [OPTIONS] FILE
Options
-
-c
,
--columns
<columns>
¶ columns present in the CSV file
- Options
type | args | kwargs | policy | next_run
-
-d
,
--delimiter
<delimiter>
¶
Arguments
-
FILE
¶
Required argument
schedule_origins¶
Schedules tasks for origins that are already known.
The first argument is the name of the task type, further ones are keyword argument(s) of the task in the form key=value, where value is in YAML format.
Usage sample:
swh-scheduler –database ‘service=swh-scheduler’ task schedule_origins index-origin-metadata
swh scheduler task schedule_origins [OPTIONS] TYPE [OPTIONS]...
Options
-
-b
,
--batch-size
<origin_batch_size>
¶ Number of origins per task
- Default
10
-
--page-token
<page_token>
¶ Only schedule tasks for origins whose ID is greater
- Default
0
-
--limit
<limit>
¶ Limit the tasks scheduling up to this number of tasks
-
-g
,
--storage-url
<storage_url>
¶ URL of the (graph) storage API
-
--dry-run
,
--no-dry-run
¶
List only what would be scheduled.
Arguments
-
TYPE
¶
Required argument
-
OPTIONS
¶
Optional argument(s)
task-type¶
Manipulate task types.
swh scheduler task-type [OPTIONS] COMMAND [ARGS]...
add¶
Create a new task type
swh scheduler task-type add [OPTIONS] TYPE TASK_NAME DESCRIPTION
Options
-
-i
,
--default-interval
<default_interval>
¶ Default interval (“90 days” by default)
-
--min-interval
<min_interval>
¶ Minimum interval (default interval if not set)
-
-i
,
--max-interval
<max_interval>
¶ Maximal interval (default interval if not set)
-
-f
,
--backoff-factor
<backoff_factor>
¶ Backoff factor
Arguments
-
TYPE
¶
Required argument
-
TASK_NAME
¶
Required argument
-
DESCRIPTION
¶
Required argument
list¶
swh scheduler task-type list [OPTIONS]
Options
-
-v
,
--verbose
¶
Verbose mode
-
-t
,
--task_type
<task_type>
¶ List task types of given type
-
-n
,
--task_name
<task_name>
¶ List task types of given backend task name
register¶
Register missing task-type entries in the scheduler.
According to declared tasks in each loaded worker (e.g. lister, loader, …) plugins.
swh scheduler task-type register [OPTIONS]
Options
-
-p
,
--plugins
<plugins>
¶ Registers task-types for provided plugins. Defaults to all
- Options
all | loader.svn | loader.mercurial | loader.mercurial_from_disk | loader.git | loader.git_disk | loader.archive | loader.cran | loader.debian | loader.deposit | loader.nixguix | loader.npm | loader.pypi | lister.bitbucket | lister.cgit | lister.cran | lister.debian | lister.gitea | lister.github | lister.gitlab | lister.gnu | lister.launchpad | lister.npm | lister.packagist | lister.phabricator | lister.pypi | deposit.worker
search¶
Software Heritage Search tools.
swh search [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
journal-client¶
swh search journal-client [OPTIONS] COMMAND [ARGS]...
objects¶
Listens for new objects from the SWH Journal, and schedules tasks to run relevant indexers (currently, origin and origin_visit) on these new objects.
swh search journal-client objects [OPTIONS]
Options
-
-m
,
--stop-after-objects
<stop_after_objects>
¶ Maximum number of objects to replay. Default is to run forever.
-
-o
,
--object-type
<object_type>
¶ Default list of object types to subscribe to
-
-p
,
--prefix
<prefix>
¶ Topic prefix to use (e.g swh.journal.indexed)
storage¶
Software Heritage Storage tools.
swh storage [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file.
-
--check-config
<check_config>
¶ Check the configuration of the storage at startup for read or write access; if set, override the value present in the configuration file if any. Defaults to ‘read’ for the ‘backfill’ command, and ‘write’ for ‘rpc-server’ and ‘replay’ commands.
- Options
no | read | write
backfill¶
Run the backfiller
The backfiller list objects from a Storage and produce journal entries from there.
Typically used to rebuild a journal or compensate for missing objects in a journal (eg. due to a downtime of this later).
The configuration file requires the following entries: - brokers: a list of kafka endpoints (the journal) in which entries will be
added.
storage_dbconn: URL to connect to the storage DB.
prefix: the prefix of the topics (topics will be <prefix>.<object_type>).
client_id: the kafka client ID.
swh storage backfill [OPTIONS] OBJECT_TYPE
Options
-
--start-object
<start_object>
¶
-
--end-object
<end_object>
¶
-
--dry-run
¶
Arguments
-
OBJECT_TYPE
¶
Required argument
replay¶
Fill a Storage by reading a Journal.
There can be several ‘replayers’ filling a Storage as long as they use the same group-id.
swh storage replay [OPTIONS]
Options
-
-n
,
--stop-after-objects
<stop_after_objects>
¶ Stop after processing this many objects. Default is to run forever.
rpc-serve¶
Software Heritage Storage RPC server.
Do NOT use this in a production environment.
swh storage rpc-serve [OPTIONS]
Options
-
--host
<IP>
¶ Host ip address to bind the server on
- Default
0.0.0.0
-
--port
<PORT>
¶ Binding port of the server
- Default
5002
-
--debug
,
--no-debug
¶
Indicates if the server should run in debug mode
vault¶
Software Heritage Vault tools.
swh vault [OPTIONS] COMMAND [ARGS]...
rpc-serve¶
Software Heritage Vault RPC server.
swh vault rpc-serve [OPTIONS]
Options
-
-C
,
--config-file
<CONFIGFILE>
¶ Configuration file.
-
--host
<IP>
¶ Host ip address to bind the server on
- Default
0.0.0.0
-
--port
<PORT>
¶ Binding port of the server
-
--debug
,
--no-debug
¶
Indicates if the server should run in debug mode
web¶
Software Heritage web client
swh web [OPTIONS] COMMAND [ARGS]...
Options
-
-C
,
--config-file
<config_file>
¶ Configuration file (default: /home/jenkins/.config/swh/global.yml)
auth¶
Authenticate Software Heritage users with OpenID Connect.
This CLI tool eases the retrieval of a bearer token to authenticate a user querying the Software Heritage Web API.
swh web auth [OPTIONS] COMMAND [ARGS]...
Options
-
--oidc-server-url
<oidc_server_url>
¶ URL of OpenID Connect server (default to “https://auth.softwareheritage.org/auth/”)
-
--realm-name
<realm_name>
¶ Name of the OpenID Connect authentication realm (default to “SoftwareHeritage”)
-
--client-id
<client_id>
¶ OpenID Connect client identifier in the realm (default to “swh-web”)
generate-token¶
Generate a new bearer token for Web API authentication.
Login with USERNAME, create a new OpenID Connect session and get bearer token.
User will be prompted for his password and token will be printed to standard output.
The created OpenID Connect session is an offline one so the provided token has a much longer expiration time than classical OIDC sessions (usually several dozens of days).
swh web auth generate-token [OPTIONS] USERNAME
Arguments
-
USERNAME
¶
Required argument
login¶
Alias for ‘generate-token’
swh web auth login [OPTIONS] USERNAME
Arguments
-
USERNAME
¶
Required argument
search¶
Search a query (as a list of keywords) into the Software Heritage archive.
The search results are printed to CSV format, one result per line, using a tabulation as the field delimiter.
swh web search [OPTIONS] KEYWORD...
Options
-
--limit
<limit>
¶ maximum number of results to show
- Default
10
-
--only-visited
¶
if true, only return origins with at least one visit by Software heritage
- Default
False
-
--url-encode
,
--no-url-encode
¶
if true, escape origin URLs in results with percent encoding (RFC 3986)
- Default
False
Arguments
-
KEYWORD...
¶
Required argument(s)
Database initialization utilities¶
swh db-init¶
Initialize a database for the Software Heritage <module>.
Example:
swh db init -d swh-test storage
If you want to specify non-default postgresql connection parameters, please provide them using standard environment variables. See psql(1) man page (section ENVIRONMENTS) for details.
Examples:
PGPORT=5434 swh db init indexer swh db init -d postgresql://user:passwd@pghost:5433/swh-storage storage swh db init –flavor read_replica -d swh-storage storage
swh db-init [OPTIONS] MODULE
Options
-
-d
,
--db-name
<db_name>
¶ Database name.
- Default
softwareheritage-dev
-
--flavor
<flavor>
¶ Database flavor.
Arguments
-
MODULE
¶
Required argument