Command-line interface

swh indexer

Software Heritage Indexer tools.

The Indexer is used to mine the content of the archive and extract derived information from archive source code artifacts.

swh indexer [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>

Configuration file.

journal-client

Listens for new objects from the SWH Journal, and schedules tasks to run relevant indexers (currently, only origin-intrinsic-metadata) on these new objects.

swh indexer journal-client [OPTIONS]

Options

-s, --scheduler-url <scheduler_url>

URL of the scheduler API

--origin-metadata-task-type <origin_metadata_task_type>

Name of the task running the origin metadata indexer.

--broker <brokers>

Kafka broker to connect to.

--prefix <prefix>

Prefix of Kafka topic names to read from.

--group-id <group_id>

Consumer/group id for reading from Kafka.

-m, --stop-after-objects <stop_after_objects>

Maximum number of objects to replay. Default is to run forever.

mapping

Manage Software Heritage Indexer mappings.

swh indexer mapping [OPTIONS] COMMAND [ARGS]...

list

Prints the list of known mappings.

swh indexer mapping list [OPTIONS]

list-terms

Prints the list of known CodeMeta terms, and which mappings support them.

swh indexer mapping list-terms [OPTIONS]

Options

--exclude-mapping <exclude_mapping>

Exclude the given mapping from the output

--concise

Don’t print the list of mappings supporting each term.

translate

Prints the list of known mappings.

swh indexer mapping translate [OPTIONS] MAPPING_NAME FILE

Arguments

MAPPING_NAME

Required argument

FILE

Required argument

rpc-serve

Starts a Software Heritage Indexer RPC HTTP server.

swh indexer rpc-serve [OPTIONS] CONFIG_PATH

Options

--host <host>

Host to run the server

--port <port>

Binding port of the server

--debug, --nodebug

Indicates if the server should run in debug mode

Arguments

CONFIG_PATH

Required argument

schedule

Manipulate Software Heritage Indexer tasks.

Via SWH Scheduler’s API.

swh indexer schedule [OPTIONS] COMMAND [ARGS]...

Options

-s, --scheduler-url <scheduler_url>

URL of the scheduler API

-i, --indexer-storage-url <indexer_storage_url>

URL of the indexer storage API

-g, --storage-url <storage_url>

URL of the (graph) storage API

--dry-run, --no-dry-run

List only what would be scheduled.

reindex_origin_metadata

Schedules indexing tasks for origins that were already indexed.

swh indexer schedule reindex_origin_metadata [OPTIONS]

Options

-b, --batch-size <origin_batch_size>

Number of origins per task

Default

10

-t, --tool-id <tool_ids>

Restrict search of old metadata to this/these tool ids.

-m, --mapping <mappings>

Mapping(s) that should be re-scheduled (eg. ‘npm’, ‘gemspec’, ‘maven’)

--task-type <task_type>

Name of the task type to schedule.

Default

index-origin-metadata