Software Heritage - Datastore Scrubber#
Tools to periodically checks data integrity in swh-storage and swh-objstorage, reports errors, and (try to) fix them.
This is a work in progress; some of the components described below do not exist yet (cassandra storage checker, objstorage checker, recovery, and reinjection)
The Scrubber package is made of the following parts:
Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.
There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and objstorage.
The journal is “crawled” using its native streaming; others are crawled by range,
reusing swh-storage’s backfiller utilities, and checkpointed from time to time
to the scrubber’s database (in the
For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the
swh scrubber check init tool:
$ swh scrubber check init --object-type snapshot --nb-partitions 65536 --name chk-snp Created configuration chk-snp  for checking snapshot in datastore storage postgresql
One (or more) checking worker can then be spawned by using the
check storage command:
$ swh scrubber check storage chk-snp [...]
- A configuration file is expected, as for most
This file must have a
scrubbersection with the configuration of the scrubber database. For storage checking operations, this configuration file must also have a
storageconfiguration section. See the swh-storage documentation for more details on this. A typical configuration file could look like:
scrubber: cls: postgresql db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824 storage: cls: postgresql db: service=swh objstorage: cls: noop
The configuration section
scrubber_db has been renamed as
swh-scrubber version 2.0.0
Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:
Brute-forcing variations until they match their checksum
Recovering from another data store
As a last resort, recovering from known origins, if any
Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.