swh.indexer.rehash module#
- class swh.indexer.rehash.RecomputeChecksums[source]#
Bases:
object
Class in charge of (re)computing content’s hashes.
Hashes to compute are defined across 2 configuration options:
- compute_checksums ([str])
list of hash algorithms that py:func:swh.model.hashutil.MultiHash.from_data function should be able to deal with. For variable-length checksums, a desired checksum length should also be provided. Their format is <algorithm’s name>:<variable-length> e.g: blake2:512
- recompute_checksums (bool)
a boolean to notify that we also want to recompute potential existing hashes specified in compute_checksums. Default to False.
- get_new_contents_metadata(all_contents: List[Dict[str, Any]]) Generator[Tuple[Dict[str, Any], List[Any]], Any, None] [source]#
- Retrieve raw contents and compute new checksums on the
contents. Unknown or corrupted contents are skipped.
- Parameters:
all_contents – List of contents as dictionary with the necessary primary keys
- Yields:
tuple – tuple of (content to update, list of checksums computed)
- run(contents: List[Dict[str, Any]]) Dict [source]#
Given a list of content:
(re)compute a given set of checksums on contents available in our object storage
update those contents with the new metadata
- Parameters:
contents – contents as dictionary with necessary keys. key present in such dictionary should be the ones defined in the ‘primary_key’ option.
- Returns:
A summary dict with key ‘status’, task’ status and ‘count’ the number of updated contents.