swh.loader.core.discovery module#
Primitives for finding the unknown parts of disk contents efficiently.
- class swh.loader.core.discovery.Sample(contents, skipped_contents, directories)#
Bases:
tuple
Create new instance of Sample(contents, skipped_contents, directories)
- property contents#
Alias for field number 0
- property directories#
Alias for field number 2
- property skipped_contents#
Alias for field number 1
- class swh.loader.core.discovery.ArchiveDiscoveryInterface(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory])[source]#
Bases:
ABC
Interface used in discovery code to abstract over ways of connecting to the SWH archive (direct storage, web API, etc.) for all methods needed by discovery algorithms.
- skipped_contents: List[SkippedContent]#
- abstract async content_missing(contents: List[bytes]) Iterable[bytes] [source]#
List content missing from the archive by sha1
- class swh.loader.core.discovery.DiscoveryStorageConnection(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory], swh_storage: StorageInterface)[source]#
Bases:
ArchiveDiscoveryInterface
Use the storage APIs to query the archive
- async content_missing(contents: List[bytes]) Iterable[bytes] [source]#
List content missing from the archive by sha1
- async skipped_content_missing(skipped_contents: List[bytes]) Iterable[bytes] [source]#
List skipped content missing from the archive by sha1
- async directory_missing(directories: List[bytes]) Iterable[bytes] [source]#
List directories missing from the archive by sha1
- skipped_contents: List[SkippedContent]#
- class swh.loader.core.discovery.BaseDiscoveryGraph(contents, skipped_contents, directories)[source]#
Bases:
object
Creates the base structures and methods needed for discovery algorithms. Subclasses should override
get_sample
to affect how the discovery is made.- mark_known(entries: Iterable[bytes])[source]#
Mark
entries
and those they imply as known in the SWH archive
- mark_unknown(entries: Iterable[bytes])[source]#
Mark
entries
and those they imply as unknown in the SWH archive
- async get_sample() Sample [source]#
Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known.
- async do_query(archive: ArchiveDiscoveryInterface, sample: Sample) None [source]#
Given a three-tuple of samples, ask the archive which are known or unknown and mark them as such.
- class swh.loader.core.discovery.RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories)[source]#
Bases:
BaseDiscoveryGraph
Use a random sampling using only directories.
This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat.
- async swh.loader.core.discovery.filter_known_objects(archive: ArchiveDiscoveryInterface)[source]#
Filter
archive
’scontents
,skipped_contents
anddirectories
to only return those that are unknown to the SWH archive using a discovery algorithm.