Command-line interface#
swh scrubber#
main command group of the datastore scrubber
Expected config format:
scrubber_db:
cls: local
db: "service=..." # libpq DSN
# for storage checkers + origin locator only:
storage:
cls: postgresql # cannot be remote for checkers, as they need direct
# access to the pg DB
db": "service=..." # libpq DSN
objstorage:
cls: memory
# for journal checkers only:
journal:
# see https://docs.softwareheritage.org/devel/apidoc/swh.journal.client.html
# for the full list of options
sasl.mechanism: SCRAM-SHA-512
security.protocol: SASL_SSL
sasl.username: ...
sasl.password: ...
group_id: ...
privileged: True
message.max.bytes: 524288000
brokers:
- "broker1.journal.softwareheritage.org:9093
- "broker2.journal.softwareheritage.org:9093
- "broker3.journal.softwareheritage.org:9093
- "broker4.journal.softwareheritage.org:9093
- "broker5.journal.softwareheritage.org:9093
object_types: [directory, revision, snapshot, release]
auto_offset_reset: earliest
swh scrubber [OPTIONS] COMMAND [ARGS]...
Options
- -C, --config-file <config_file>#
Configuration file.
check#
group of commands which read from data stores and report errors.
swh scrubber check [OPTIONS] COMMAND [ARGS]...
journal#
Reads a complete kafka journal, and reports corrupt objects to the scrubber DB.
swh scrubber check journal [OPTIONS]
storage#
Reads a swh-storage instance, and reports corrupt objects to the scrubber DB.
This runs a single thread; parallelism is achieved by running this command multiple times, on disjoint ranges.
All objects of type object_type
are ordered, and split into the given number
of partitions. When running in parallel, the number of partitions should be the
same for all workers or they may work on overlapping or non-exhaustive ranges.
Then, this process will check all partitions in the given
[start_partition_id, end_partition_id)
range. When running in parallel, these
ranges should be set so that processes over the whole [0, nb_partitions)
range.
For example in order to have 8 threads checking revisions in parallel and with 64k checkpoints (to recover on crashes), the CLI should be ran 8 times with these parameters:
--object-type revision --nb-partitions 65536 --start-partition-id 0 --end-partition-id 8192
--object-type revision --nb-partitions 65536 --start-partition-id 8192 --end-partition-id 16384
--object-type revision --nb-partitions 65536 --start-partition-id 16384 --end-partition-id 24576
--object-type revision --nb-partitions 65536 --start-partition-id 24576 --end-partition-id 32768
--object-type revision --nb-partitions 65536 --start-partition-id 32768 --end-partition-id 40960
--object-type revision --nb-partitions 65536 --start-partition-id 40960 --end-partition-id 49152
--object-type revision --nb-partitions 65536 --start-partition-id 49152 --end-partition-id 57344
--object-type revision --nb-partitions 65536 --start-partition-id 57344 --end-partition-id 65536
swh scrubber check storage [OPTIONS]
Options
- --object-type <object_type>#
- Options:
snapshot | revision | release | directory
- --start-partition-id <start_partition_id>#
- --end-partition-id <end_partition_id>#
- --nb-partitions <nb_partitions>#
fix#
For each known corrupt object reported in the scrubber DB, looks up origins that may contain this object, and records them; so they can be used later for recovery.
swh scrubber fix [OPTIONS]
Options
- --start-object <start_object>#
- --end-object <end_object>#
locate#
For each known corrupt object reported in the scrubber DB, looks up origins that may contain this object, and records them; so they can be used later for recovery.
swh scrubber locate [OPTIONS]
Options
- --start-object <start_object>#
- --end-object <end_object>#