swh.indexer.rehash module

class swh.indexer.rehash.RecomputeChecksums[source]

Bases: swh.core.config.SWHConfig

Class in charge of (re)computing content’s hashes.

Hashes to compute are defined across 2 configuration options:

compute_checksums ([str])

list of hash algorithms that py:func:swh.model.hashutil.MultiHash.from_data function should be able to deal with. For variable-length checksums, a desired checksum length should also be provided. Their format is <algorithm’s name>:<variable-length> e.g: blake2:512

recompute_checksums (bool)

a boolean to notify that we also want to recompute potential existing hashes specified in compute_checksums. Default to False.

DEFAULT_CONFIG = {'batch_size_retrieve_content': ('int', 10), 'batch_size_update': ('int', 100), 'compute_checksums': ('list[str]', []), 'objstorage': ('dict', {'cls': 'pathslicing', 'args': {'root': '/srv/softwareheritage/objects', 'slicing': '0:2/2:4/4:6'}}), 'recompute_checksums': ('bool', False), 'storage': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}})}
CONFIG_BASE_FILENAME = 'indexer/rehash'
get_new_contents_metadata(all_contents: List[Dict[str, Any]]) → Generator[Tuple[Dict[str, Any], List[Any]], Any, None][source]
Retrieve raw contents and compute new checksums on the

contents. Unknown or corrupted contents are skipped.


all_contents – List of contents as dictionary with the necessary primary keys


tuple – tuple of (content to update, list of checksums computed)

run(contents: List[Dict[str, Any]]) → Dict[source]

Given a list of content:

  • (re)compute a given set of checksums on contents available in our object storage

  • update those contents with the new metadata


contents – contents as dictionary with necessary keys. key present in such dictionary should be the ones defined in the ‘primary_key’ option.


A summary dict with key ‘status’, task’ status and ‘count’ the number of updated contents.