swh.model.discovery module#

Primitives for finding unknown content efficiently.

class swh.model.discovery.Sample(contents, skipped_contents, directories)#

Bases: tuple

Create new instance of Sample(contents, skipped_contents, directories)

property contents#

Alias for field number 0

property directories#

Alias for field number 2

property skipped_contents#

Alias for field number 1

class swh.model.discovery.ArchiveDiscoveryInterface(*args, **kwds)[source]#

Bases: Protocol

Interface used in discovery code to abstract over ways of connecting to the SWH archive (direct storage, web API, etc.) for all methods needed by discovery algorithms.

contents: List[Content]#
skipped_contents: List[SkippedContent]#
directories: List[Directory]#
async content_missing(contents: List[bytes]) Iterable[bytes][source]#

List content missing from the archive by sha1

async skipped_content_missing(skipped_contents: List[bytes]) Iterable[bytes][source]#

List skipped content missing from the archive by sha1

async directory_missing(directories: List[bytes]) Iterable[bytes][source]#

List directories missing from the archive by sha1

class swh.model.discovery.BaseDiscoveryGraph(contents, skipped_contents, directories)[source]#

Bases: object

Creates the base structures and methods needed for discovery algorithms. Subclasses should override get_sample to affect how the discovery is made.

mark_known(entries: Iterable[bytes])[source]#

Mark entries and those they imply as known in the SWH archive

mark_unknown(entries: Iterable[bytes])[source]#

Mark entries and those they imply as unknown in the SWH archive

async get_sample() Sample[source]#

Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.

async do_query(archive: ArchiveDiscoveryInterface, sample: Sample) None[source]#

Given a three-tuple of samples, ask the archive which are known or unknown and mark them as such.

class swh.model.discovery.RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories)[source]#

Bases: BaseDiscoveryGraph

Use a random sampling using only directories.

This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat.

async get_sample() Sample[source]#

Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.

async swh.model.discovery.filter_known_objects(archive: ArchiveDiscoveryInterface)[source]#

Filter archive’s contents, skipped_contents and directories to only return those that are unknown to the SWH archive using a discovery algorithm.