swh.model.discovery module#
Primitives for finding unknown content efficiently.
- class swh.model.discovery.Sample(contents, skipped_contents, directories)#
Bases:
tuple
Create new instance of Sample(contents, skipped_contents, directories)
- contents#
Alias for field number 0
- directories#
Alias for field number 2
- skipped_contents#
Alias for field number 1
- class swh.model.discovery.ArchiveDiscoveryInterface(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory])[source]#
Bases:
Protocol
Interface used in discovery code to abstract over ways of connecting to the SWH archive (direct storage, web API, etc.) for all methods needed by discovery algorithms.
- skipped_contents: List[SkippedContent]#
- content_missing(contents: List[bytes]) Iterable[bytes] [source]#
List content missing from the archive by sha1
- class swh.model.discovery.BaseDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
Bases:
object
Creates the base structures and methods needed for discovery algorithms. Subclasses should override
get_sample
to affect how the discovery is made.The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise.
- mark_known(entries: Iterable[bytes])[source]#
Mark
entries
and those they imply as known in the SWH archive
- mark_unknown(entries: Iterable[bytes])[source]#
Mark
entries
and those they imply as unknown in the SWH archive
- get_sample() Sample [source]#
Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.
- do_query(archive: ArchiveDiscoveryInterface, sample: Sample) None [source]#
Given a three-tuple of samples, ask the archive which are known or unknown and mark them as such.
- class swh.model.discovery.RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
Bases:
BaseDiscoveryGraph
Use a random sampling using only directories.
This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat.
- swh.model.discovery.filter_known_objects(archive: ArchiveDiscoveryInterface, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
Filter
archive
’scontents
,skipped_contents
anddirectories
to only return those that are unknown to the SWH archive using a discovery algorithm.The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise.