swh.scanner.policy module#

swh.scanner.policy.source_size(source_tree: Directory)[source]#

return the size of a source tree as the number of nodes it contains

class swh.scanner.policy.Policy(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: object

source_tree: Directory#

representation of a source code project directory in the merkle tree

data: MerkleNodeInfo#

information about contents and directories of the merkle tree

abstract async run(client: Client)[source]#

Scan a source code project

class swh.scanner.policy.LazyBFS(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Read nodes in the merkle tree using the BFS algorithm. Lookup only directories that are unknown otherwise set all the downstream contents to known.

async run(client: Client)[source]#

Scan a source code project

class swh.scanner.policy.GreedyBFS(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Query graph nodes in chunks (to maximize the Web API rate limit use) and set the downstream contents of known directories to known.

async run(client: Client)[source]#

Scan a source code project

async get_nodes_chunks(client: Client, ssize: int)[source]#

Query chunks of QUERY_LIMIT nodes at once in order to fill the Web API rate limit. It query all the nodes in the case the source code contains less than QUERY_LIMIT nodes.

class swh.scanner.policy.FilePriority(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Check the Merkle tree querying all the file contents and set all the upstream directories to unknown in the case a file content is unknown. Finally check all the directories which status is still unknown and set all the sub-directories of known directories to known.

async run(client: Client)[source]#

Scan a source code project

class swh.scanner.policy.DirectoryPriority(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Check the Merkle tree querying all the directories that have at least one file content and set all the upstream directories to unknown in the case a directory is unknown otherwise set all the downstream contents to known. Finally check the status of empty directories and all the remaining file contents.

async run(client: Client)[source]#

Scan a source code project

has_contents(directory: Directory)[source]#

Check if the directory given in input has contents

get_contents(dir_: Directory)[source]#

Get all the contents of a given directory

class swh.scanner.policy.WebAPIConnection(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory], client: Client)[source]#

Bases: ArchiveDiscoveryInterface

Use the web APIs to query the archive

async content_missing(contents: List[bytes]) List[bytes][source]#

List content missing from the archive by sha1

async skipped_content_missing(skipped_contents: List[bytes]) Iterable[bytes][source]#

List skipped content missing from the archive by sha1

async directory_missing(directories: List[bytes]) Iterable[bytes][source]#

List directories missing from the archive by sha1

class swh.scanner.policy.RandomDirSamplingPriority(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Check the Merkle tree querying random directories. Set all ancestors to unknown for unknown directories, otherwise set all descendants to known. Finally check all the remaining file contents.

async run(client: Client)[source]#

Scan a source code project

class swh.scanner.policy.QueryAll(source_tree: Directory, data: MerkleNodeInfo)[source]#

Bases: Policy

Check the status of every node in the Merkle tree.

async run(client: Client)[source]#

Scan a source code project