swh.indexer.rehash module#

class swh.indexer.rehash.RecomputeChecksums[source]#

Bases: object

Class in charge of (re)computing content’s hashes.

Hashes to compute are defined across 2 configuration options:

compute_checksums ([str])

list of hash algorithms that py:func:swh.model.hashutil.MultiHash.from_data function should be able to deal with. For variable-length checksums, a desired checksum length should also be provided. Their format is <algorithm’s name>:<variable-length> e.g: blake2:512

recompute_checksums (bool)

a boolean to notify that we also want to recompute potential existing hashes specified in compute_checksums. Default to False.

get_new_contents_metadata(all_contents: List[Dict[str, Any]]) Generator[Tuple[Dict[str, Any], List[Any]], Any, None][source]#
Retrieve raw contents and compute new checksums on the

contents. Unknown or corrupted contents are skipped.

Parameters:

all_contents – List of contents as dictionary with the necessary primary keys

Yields:

tuple – tuple of (content to update, list of checksums computed)

run(contents: List[Dict[str, Any]]) Dict[source]#

Given a list of content:

  • (re)compute a given set of checksums on contents available in our object storage

  • update those contents with the new metadata

Parameters:

contents – contents as dictionary with necessary keys. key present in such dictionary should be the ones defined in the ‘primary_key’ option.

Returns:

A summary dict with key ‘status’, task’ status and ‘count’ the number of updated contents.