swh.scrubber.db module#
- class swh.scrubber.db.Datastore(package: str, cls: str, instance: str)[source]#
Bases:
object
Represents a datastore being scrubbed; eg. swh-storage or swh-journal.
- class swh.scrubber.db.ConfigEntry(name: str, datastore: Datastore, object_type: ObjectType, nb_partitions: int, check_hashes: bool, check_references: bool)[source]#
Bases:
object
Represents a datastore being scrubbed; eg. swh-storage or swh-journal.
- object_type: ObjectType#
- class swh.scrubber.db.CorruptObject(id: swh.model.swhids.CoreSWHID, config: swh.scrubber.db.ConfigEntry, first_occurrence: datetime.datetime, object_: bytes)[source]#
Bases:
object
- config: ConfigEntry#
- class swh.scrubber.db.MissingObject(id: swh.model.swhids.CoreSWHID, config: swh.scrubber.db.ConfigEntry, first_occurrence: datetime.datetime)[source]#
Bases:
object
- config: ConfigEntry#
- class swh.scrubber.db.MissingObjectReference(missing_id: swh.model.swhids.CoreSWHID, reference_id: swh.model.swhids.CoreSWHID, config: swh.scrubber.db.ConfigEntry, first_occurrence: datetime.datetime)[source]#
Bases:
object
- config: ConfigEntry#
- class swh.scrubber.db.FixedObject(id: swh.model.swhids.CoreSWHID, object_: bytes, method: str, recovery_date: datetime.datetime | None = None)[source]#
Bases:
object
- class swh.scrubber.db.ScrubberDb(db, **kwargs)[source]#
Bases:
BaseDb
create a DB proxy
- Parameters:
conn – psycopg2 connection to the SWH DB
pool – psycopg2 pool of connections
- current_version = 7#
- datastore_get_or_add(datastore: Datastore) int [source]#
Creates a datastore if it does not exist, and returns its id.
- datastore_get(datastore_id: int) Datastore [source]#
Returns a datastore’s id. Raises
ValueError
if it does not exist.
- config_add(name: str | None, datastore: Datastore, object_type: ObjectType, nb_partitions: int, check_hashes: bool = True, check_references: bool = True) int [source]#
Creates a configuration entry (and potentially a datastore);
Will fail if a config with same (datastore. object_type, nb_paritions) already exists.
- config_get(config_id: int) ConfigEntry [source]#
- config_get_by_name(name: str, datastore: int | None = None) int | None [source]#
Get the configuration entry for given name, if any
- config_get_stats(config_id: int) Dict[str, Any] [source]#
Return statistics for the check configuration <check_id>.
- checked_partition_iter_next(config_id: int) Iterator[int] [source]#
Generates partitions to be checked for the given configuration
At each iteration, look for the next “free” partition in the checked_partition, for the given config_id, reserve it and return its id.
Reserving the partition means make sure there is a row in the table for this partition id with the start_date column set.
To chose a “free” partition is to select either the smaller partition is for which the start_date is NULL, or the first partition id not yet in the table.
Stops the iteration when the number of partitions for the config id is reached.
- checked_partition_reset(config_id: int, partition_id: int) bool [source]#
Reset the partition, aka clear start_date and end_date
- checked_partition_upsert(config_id: int, partition_id: int, date: datetime | None = None) None [source]#
Records in the database the given partition was last checked at the given date.
- checked_partition_get_last_date(config_id: int, partition_id: int) datetime | None [source]#
Returns the last date the given partition was checked in the given datastore, or
None
if it was never checked.Currently, this matches partition id and number exactly, with no regard for partitions that contain or are contained by it.
- checked_partition_get_running(config_id: int) Iterator[Tuple[int, datetime]] [source]#
Yields the partitions which are currently being checked; i.e. which have a start_date but no end_date.
- checked_partition_get_stuck(config_id: int, since: timedelta | None = None) Iterator[Tuple[int, datetime]] [source]#
Yields the partitions which are currently running for more than since; if not set, automatically guess a reasonable delay from completed partitions. If no such a delay can be extracted, fall back to 1 hour.
The heuristic for the automatic delay is 2x max(end_date-start_date) for the last 10 partitions checked.
- checked_partition_iter(config_id: int) Iterator[Tuple[int, int, datetime, datetime | None]] [source]#
Yields tuples of
(partition_id, nb_partitions, start_date, end_date)
- corrupt_object_iter() Iterator[CorruptObject] [source]#
Yields all records in the ‘corrupt_object’ table.
- corrupt_object_get(start_id: CoreSWHID, end_id: CoreSWHID, limit: int = 100) Iterator[CorruptObject] [source]#
Yields a page of records in the ‘corrupt_object’ table, ordered by id.
- Parameters:
start_id – Only return objects after this id
end_id – Only return objects before this id
in_origin – An origin URL. If provided, only returns objects that may be found in the given origin
- corrupt_object_grab_by_id(cur: cursor, start_id: CoreSWHID, end_id: CoreSWHID, limit: int = 100) Iterator[CorruptObject] [source]#
Returns a page of records in the ‘corrupt_object’ table for a fixer, ordered by id
These records are not already fixed (ie. do not have a corresponding entry in the ‘fixed_object’ table), and they are selected with an exclusive update lock.
- Parameters:
start_id – Only return objects after this id
end_id – Only return objects before this id
- corrupt_object_grab_by_origin(cur: cursor, origin_url: str, start_id: CoreSWHID | None = None, end_id: CoreSWHID | None = None, limit: int = 100) Iterator[CorruptObject] [source]#
Returns a page of records in the ‘corrupt_object’ table for a fixer, ordered by id
These records are not already fixed (ie. do not have a corresponding entry in the ‘fixed_object’ table), and they are selected with an exclusive update lock.
- Parameters:
origin_url – only returns objects that may be found in the given origin
- missing_object_add(id: CoreSWHID, reference_ids: Iterable[CoreSWHID], config: ConfigEntry) None [source]#
Adds a “hole” to the inventory, ie. an object missing from a datastore that is referenced by an other object of the same datastore.
If the missing object is already known to be missing by the scrubber database, this only records the reference (which can be useful to locate an origin to recover the object from). If that reference is already known too, this is a noop.
- Parameters:
id – SWHID of the missing object (the hole)
reference_id – SWHID of the object referencing the missing object
datastore – representation of the swh-storage/swh-journal/… instance containing this hole
- missing_object_iter() Iterator[MissingObject] [source]#
Yields all records in the ‘missing_object’ table.
- missing_object_reference_iter(missing_id: CoreSWHID) Iterator[MissingObjectReference] [source]#
Yields all records in the ‘missing_object_reference’ table.
- object_origin_get(after: str = '', limit: int = 1000) List[str] [source]#
Returns origins with non-fixed corrupt objects, ordered by URL.
- Parameters:
after – if given, only returns origins with an URL after this value
- fixed_object_add(cur: cursor, fixed_objects: List[FixedObject]) None [source]#
- fixed_object_iter() Iterator[FixedObject] [source]#