Command-line interface#

swh scrubber#

main command group of the datastore scrubber

Expected config format:

scrubber_db:
    cls: local
    db: "service=..."    # libpq DSN

# for storage checkers + origin locator only:
storage:
    cls: postgresql     # cannot be remote for checkers, as they need direct
                        # access to the pg DB
    db": "service=..."  # libpq DSN
    objstorage:
        cls: memory

# for journal checkers only:
journal:
    # see https://docs.softwareheritage.org/devel/apidoc/swh.journal.client.html
    # for the full list of options
    sasl.mechanism: SCRAM-SHA-512
    security.protocol: SASL_SSL
    sasl.username: ...
    sasl.password: ...
    group_id: ...
    privileged: True
    message.max.bytes: 524288000
    brokers:
      - "broker1.journal.softwareheritage.org:9093
      - "broker2.journal.softwareheritage.org:9093
      - "broker3.journal.softwareheritage.org:9093
      - "broker4.journal.softwareheritage.org:9093
      - "broker5.journal.softwareheritage.org:9093
    object_types: [directory, revision, snapshot, release]
    auto_offset_reset: earliest
swh scrubber [OPTIONS] COMMAND [ARGS]...

Options

-C, --config-file <config_file>#

Configuration file.

check#

group of commands which read from data stores and report errors.

swh scrubber check [OPTIONS] COMMAND [ARGS]...

journal#

Reads a complete kafka journal, and reports corrupt objects to the scrubber DB.

swh scrubber check journal [OPTIONS]

storage#

Reads a swh-storage instance, and reports corrupt objects to the scrubber DB.

This runs a single thread; parallelism is achieved by running this command multiple times, on disjoint ranges.

All objects of type object_type are ordered, and split into the given number of partitions. When running in parallel, the number of partitions should be the same for all workers or they may work on overlapping or non-exhaustive ranges.

Then, this process will check all partitions in the given [start_partition_id, end_partition_id) range. When running in parallel, these ranges should be set so that processes over the whole [0, nb_partitions) range.

For example in order to have 8 threads checking revisions in parallel and with 64k checkpoints (to recover on crashes), the CLI should be ran 8 times with these parameters:

--object-type revision --nb-partitions 65536 --start-partition-id 0 --end-partition-id 8192
--object-type revision --nb-partitions 65536 --start-partition-id 8192 --end-partition-id 16384
--object-type revision --nb-partitions 65536 --start-partition-id 16384 --end-partition-id 24576
--object-type revision --nb-partitions 65536 --start-partition-id 24576 --end-partition-id 32768
--object-type revision --nb-partitions 65536 --start-partition-id 32768 --end-partition-id 40960
--object-type revision --nb-partitions 65536 --start-partition-id 40960 --end-partition-id 49152
--object-type revision --nb-partitions 65536 --start-partition-id 49152 --end-partition-id 57344
--object-type revision --nb-partitions 65536 --start-partition-id 57344 --end-partition-id 65536
swh scrubber check storage [OPTIONS]

Options

--object-type <object_type>#
Options:

snapshot | revision | release | directory

--start-partition-id <start_partition_id>#
--end-partition-id <end_partition_id>#
--nb-partitions <nb_partitions>#

fix#

For each known corrupt object reported in the scrubber DB, looks up origins that may contain this object, and records them; so they can be used later for recovery.

swh scrubber fix [OPTIONS]

Options

--start-object <start_object>#
--end-object <end_object>#

locate#

For each known corrupt object reported in the scrubber DB, looks up origins that may contain this object, and records them; so they can be used later for recovery.

swh scrubber locate [OPTIONS]

Options

--start-object <start_object>#
--end-object <end_object>#