swh.model.discovery module#

Primitives for finding unknown content efficiently.

class swh.model.discovery.Sample(contents, skipped_contents, directories)#

Bases: tuple

Create new instance of Sample(contents, skipped_contents, directories)

contents#

Alias for field number 0

directories#

Alias for field number 2

skipped_contents#

Alias for field number 1

class swh.model.discovery.ArchiveDiscoveryInterface(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory])[source]#

Bases: Protocol

Interface used in discovery code to abstract over ways of connecting to the SWH archive (direct storage, web API, etc.) for all methods needed by discovery algorithms.

contents: List[Content]#
skipped_contents: List[SkippedContent]#
directories: List[Directory]#
content_missing(contents: List[bytes]) Iterable[bytes][source]#

List content missing from the archive by sha1

skipped_content_missing(skipped_contents: List[bytes]) Iterable[bytes][source]#

List skipped content missing from the archive by sha1

directory_missing(directories: List[bytes]) Iterable[bytes][source]#

List directories missing from the archive by sha1

class swh.model.discovery.BaseDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#

Bases: object

Creates the base structures and methods needed for discovery algorithms. Subclasses should override get_sample to affect how the discovery is made.

The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise.

mark_known(entries: Iterable[bytes])[source]#

Mark entries and those they imply as known in the SWH archive

mark_unknown(entries: Iterable[bytes])[source]#

Mark entries and those they imply as unknown in the SWH archive

get_sample() Sample[source]#

Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.

do_query(archive: ArchiveDiscoveryInterface, sample: Sample) None[source]#

Given a three-tuple of samples, ask the archive which are known or unknown and mark them as such.

class swh.model.discovery.RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#

Bases: BaseDiscoveryGraph

Use a random sampling using only directories.

This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat.

get_sample() Sample[source]#

Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.

swh.model.discovery.filter_known_objects(archive: ArchiveDiscoveryInterface, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#

Filter archive’s contents, skipped_contents and directories to only return those that are unknown to the SWH archive using a discovery algorithm.

The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise.