Software Heritage - Datastore Scrubber#
Tools to periodically checks data integrity in swh-storage
, swh-objstorage
and swh-journal
, reports errors, and (try to) fix them.
The Scrubber package is made of the following parts:
Checking#
Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.
There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and object storage (any backends).
The journal is “crawled” using its native streaming; others are crawled by range,
reusing swh-storage’s backfiller utilities, and checkpointed from time to time
to the scrubber’s database (in the checked_range
table).
Storage#
For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init
tool:
$ swh scrubber check init storage --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql
Note
A configuration file is expected, as for most swh
tools.
This file must have a scrubber
section with the configuration of
the scrubber database. For storage checking operations, this
configuration file must also have a storage
configuration section.
See the swh-storage documentation for more details on this. A
typical configuration file could look like:
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
cls: postgresql
db: service=swh
objstorage:
cls: noop
Note
The configuration section scrubber_db
has been renamed as
scrubber
in swh-scrubber
version 2.0.0
One (or more) checking worker can then be spawned by using the swh scrubber
check run
command:
$ swh scrubber check run chk-snp
[...]
Object storage#
As with the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init
tool:
$ swh scrubber check init objstorage --object-type content --nb-partitions 65536 --name check-contents
Created configuration check-contents [3] for checking content in datastore objstorage remote
Note
A configuration file is expected, as for most swh
tools.
This file must have a scrubber
section with the configuration of
the scrubber database. For object storage checking operations, this
configuration file must have:
a
storage
configuration section if content ids are read from it (default)a
journal
configuration section if content ids are read from a kafka content topic (require to use flag--use-journal
of theswh scrubber check run
command)an
objstorage
configuration section targeting the object storage to check
See the swh-storage documentation, swh-objstorage documentation and swh-journal documentation for more details on this. A typical configuration file could look like:
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
cls: postgresql
db: service=swh
objstorage:
cls: noop
journal:
cls: kafka
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
group_id: swh.scrubber
prefix: swh.journal.objects
on_eof: stop
objstorage:
cls: remote
url: https://objstorage.softwareheritage.org/
By default, an object storage checker detects missing and corrupted contents.
To disable detection of missing contents, use the --no-check-references
option of the swh check init
command.
To disable detection of corrupted contents, use the --no-check-hashes
option of the swh check init
command.
One (or more) checking worker can then be spawned by using the swh scrubber
check run
command:
if the content ids must be read from a storage instance
$ swh scrubber check run check-contents
[...]
if the content ids must be read from a kafka content topic of
swh-journal
$ swh scrubber check run check-contents --use-journal
[...]
Journal#
As with the other checkers, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init
tool:
$ swh scrubber check init journal --object-type directory --name check-dirs-journal
Created configuration check-dirs-journal [4] for checking directory in datastore journal kafka
Note
A configuration file is expected, as for most swh
tools.
This file must have a scrubber
section with the configuration of
the scrubber database. For journal checking operations, this
configuration file must also have a journal
configuration section.
See the swh-journal documentation for more details on this. A typical configuration file could look like:
scrubber:
cls: postgresql
db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
journal:
cls: kafka
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
group_id: swh.scrubber
prefix: swh.journal.objects
on_eof: stop
One (or more) checking worker can then be spawned by using the swh scrubber
check run
command:
$ swh scrubber check run check-dirs-journal
[...]
Recovery#
Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:
Brute-forcing variations until they match their checksum
Recovering from another data store
As a last resort, recovering from known origins, if any
Reinjection#
Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.