swh.scrubber.db module#

class swh.scrubber.db.Datastore(package: str, cls: str, instance: str)[source]#

Bases: object

Represents a datastore being scrubbed; eg. swh-storage or swh-journal.

package: str#

‘storage’, ‘journal’, or ‘objstorage’.

cls: str#

‘postgresql’/’cassandra’ for storage, ‘kafka’ for journal, ‘pathslicer’/’winery’/… for objstorage.

instance: str#

Human readable string.

class swh.scrubber.db.CorruptObject(id: swh.model.swhids.CoreSWHID, datastore: swh.scrubber.db.Datastore, first_occurrence: datetime.datetime, object_: bytes)[source]#

Bases: object

id: CoreSWHID#
datastore: Datastore#
first_occurrence: datetime#
object_: bytes#
class swh.scrubber.db.MissingObject(id: swh.model.swhids.CoreSWHID, datastore: swh.scrubber.db.Datastore, first_occurrence: datetime.datetime)[source]#

Bases: object

id: CoreSWHID#
datastore: Datastore#
first_occurrence: datetime#
class swh.scrubber.db.MissingObjectReference(missing_id: swh.model.swhids.CoreSWHID, reference_id: swh.model.swhids.CoreSWHID, datastore: swh.scrubber.db.Datastore, first_occurrence: datetime.datetime)[source]#

Bases: object

missing_id: CoreSWHID#
reference_id: CoreSWHID#
datastore: Datastore#
first_occurrence: datetime#
class swh.scrubber.db.FixedObject(id: swh.model.swhids.CoreSWHID, object_: bytes, method: str, recovery_date: Union[datetime.datetime, NoneType] = None)[source]#

Bases: object

id: CoreSWHID#
object_: bytes#
method: str#
recovery_date: Optional[datetime] = None#
class swh.scrubber.db.ScrubberDb(conn: connection, pool: Optional[AbstractConnectionPool] = None)[source]#

Bases: BaseDb

create a DB proxy

Parameters:
  • conn – psycopg2 connection to the SWH DB

  • pool – psycopg2 pool of connections

current_version = 5#
datastore_get_or_add(datastore: Datastore) int[source]#

Creates a datastore if it does not exist, and returns its id.

checked_partition_upsert(datastore: Datastore, object_type: ObjectType, partition_id: int, nb_partitions: int, date: datetime) None[source]#

Records in the database the given partition was last checked at the given date.

checked_partition_get_last_date(datastore: Datastore, object_type: ObjectType, partition_id: int, nb_partitions: int) Optional[datetime][source]#

Returns the last date the given partition was checked in the given datastore, or None if it was never checked.

Currently, this matches partition id and number exactly, with no regard for partitions that contain or are contained by it.

checked_partition_iter(datastore: Datastore) Iterator[Tuple[ObjectType, int, int, datetime]][source]#

Yields tuples of (partition_id, nb_partitions, last_date)

corrupt_object_add(id: CoreSWHID, datastore: Datastore, serialized_object: bytes) None[source]#
corrupt_object_iter() Iterator[CorruptObject][source]#

Yields all records in the ‘corrupt_object’ table.

corrupt_object_get(start_id: CoreSWHID, end_id: CoreSWHID, limit: int = 100) List[CorruptObject][source]#

Yields a page of records in the ‘corrupt_object’ table, ordered by id.

Parameters:
  • start_id – Only return objects after this id

  • end_id – Only return objects before this id

  • in_origin – An origin URL. If provided, only returns objects that may be found in the given origin

corrupt_object_grab_by_id(cur: cursor, start_id: CoreSWHID, end_id: CoreSWHID, limit: int = 100) List[CorruptObject][source]#

Returns a page of records in the ‘corrupt_object’ table for a fixer, ordered by id

These records are not already fixed (ie. do not have a corresponding entry in the ‘fixed_object’ table), and they are selected with an exclusive update lock.

Parameters:
  • start_id – Only return objects after this id

  • end_id – Only return objects before this id

corrupt_object_grab_by_origin(cur: cursor, origin_url: str, start_id: Optional[CoreSWHID] = None, end_id: Optional[CoreSWHID] = None, limit: int = 100) List[CorruptObject][source]#

Returns a page of records in the ‘corrupt_object’ table for a fixer, ordered by id

These records are not already fixed (ie. do not have a corresponding entry in the ‘fixed_object’ table), and they are selected with an exclusive update lock.

Parameters:

origin_url – only returns objects that may be found in the given origin

missing_object_add(id: CoreSWHID, reference_ids: Iterable[CoreSWHID], datastore: Datastore) None[source]#

Adds a “hole” to the inventory, ie. an object missing from a datastore that is referenced by an other object of the same datastore.

If the missing object is already known to be missing by the scrubber database, this only records the reference (which can be useful to locate an origin to recover the object from). If that reference is already known too, this is a noop.

Parameters:
  • id – SWHID of the missing object (the hole)

  • reference_id – SWHID of the object referencing the missing object

  • datastore – representation of the swh-storage/swh-journal/… instance containing this hole

missing_object_iter() Iterator[MissingObject][source]#

Yields all records in the ‘missing_object’ table.

missing_object_reference_iter(missing_id: CoreSWHID) Iterator[MissingObjectReference][source]#

Yields all records in the ‘missing_object_reference’ table.

object_origin_add(cur: cursor, swhid: CoreSWHID, origins: List[str]) None[source]#
object_origin_get(after: str = '', limit: int = 1000) List[str][source]#

Returns origins with non-fixed corrupt objects, ordered by URL.

Parameters:

after – if given, only returns origins with an URL after this value

fixed_object_add(cur: cursor, fixed_objects: List[FixedObject]) None[source]#
fixed_object_iter() Iterator[FixedObject][source]#