swh.storage.backfill module#

Storage backfiller.

The backfiller goal is to produce back part or all of the objects from a storage to the journal topics

Current implementation consists in the JournalBackfiller class.

It simply reads the objects from the storage and sends every object identifier back to the journal.

swh.storage.backfill.directory_converter(db: BaseDb, directory_d: Dict[str, Any]) Directory[source]#

Convert directory from the flat representation to swh model compatible objects.

swh.storage.backfill.raw_extrinsic_metadata_converter(db: BaseDb, metadata: Dict[str, Any]) RawExtrinsicMetadata[source]#

Convert a raw extrinsic metadata from the flat representation to swh model compatible objects.

swh.storage.backfill.extid_converter(db: BaseDb, extid: Dict[str, Any]) ExtID[source]#

Convert an extid from the flat representation to swh model compatible objects.

swh.storage.backfill.revision_converter(db: BaseDb, revision_d: Dict[str, Any]) Revision[source]#

Convert revision from the flat representation to swh model compatible objects.

swh.storage.backfill.release_converter(db: BaseDb, release_d: Dict[str, Any]) Release[source]#

Convert release from the flat representation to swh model compatible objects.

swh.storage.backfill.snapshot_converter(db: BaseDb, snapshot_d: Dict[str, Any]) Snapshot[source]#

Convert snapshot from the flat representation to swh model compatible objects.

swh.storage.backfill.object_to_offset(object_id, numbits)[source]#
Compute the index of the range containing object id, when dividing

space into 2^numbits.

Parameters:
  • object_id (str) – The hex representation of object_id

  • numbits (int) – Number of bits in which we divide input space

Returns:

The index of the range containing object id

swh.storage.backfill.byte_ranges(numbits: int, start_object: str | None = None, end_object: str | None = None) Iterator[Tuple[bytes | None, bytes | None]][source]#
Generate start/end pairs of bytes spanning numbits bits and

constrained by optional start_object and end_object.

Parameters:
  • numbits – Number of bits in which we divide input space

  • start_object – Hex object id contained in the first range returned

  • end_object – Hex object id contained in the last range returned

Yields:

2^numbits pairs of bytes

swh.storage.backfill.raw_extrinsic_metadata_target_ranges(start_object: str | None = None, end_object: str | None = None) Iterator[Tuple[str | None, str | None]][source]#

Generate ranges of values for the target attribute of raw_extrinsic_metadata objects.

This generates one range for all values before the first SWHID (which would correspond to raw origin URLs), then a number of hex-based ranges for each known type of SWHID (2**12 ranges for directories, 2**8 ranges for all other types). Finally, it generates one extra range for values above all possible SWHIDs.

swh.storage.backfill.integer_ranges(start: str, end: str, block_size: int = 1000) Iterator[Tuple[int | None, int | None]][source]#
swh.storage.backfill.compute_query(obj_type, start, end)[source]#
swh.storage.backfill.fetch(db, obj_type, start, end)[source]#

Fetch all obj_type’s identifiers from db.

This opens one connection, stream objects and when done, close the connection.

Parameters:
  • db (BaseDb) – Db connection object

  • obj_type (str) – Object type

  • start (Union[bytes|Tuple]) – Range start identifier

  • end (Union[bytes|Tuple]) – Range end identifier

Raises:

ValueError if obj_type is not supported

Yields:

Objects in the given range

class swh.storage.backfill.JournalBackfiller(config=None)[source]#

Bases: object

Class in charge of reading the storage’s objects and sends those back to the journal’s topics.

This is designed to be run periodically.

property db#
check_config(config)[source]#
parse_arguments(object_type, start_object, end_object)[source]#

Parse arguments

Raises:
  • ValueError for unsupported object type

  • ValueError if object ids are not parseable

Returns:

Parsed start and end object ids

run(object_type, start_object, end_object, dry_run=False)[source]#

Reads storage’s subscribed object types and send them to the journal’s reading topic.