swh.scanner.scanner module

async swh.scanner.scanner.pids_discovery(pids: List[str], session: aiohttp.client.ClientSession, api_url: str) → Dict[str, Dict[str, bool]][source]

API Request to get information about the persistent identifiers given in input.

Parameters
  • pids – a list of persistent identifier

  • api_url – url for the API request

Returns

key: persistent identifier searched value:

value[‘known’] = True if the pid is found value[‘known’] = False if the pid is not found

Return type

A dictionary with

swh.scanner.scanner.directory_filter(path_name: Union[str, bytes], exclude_patterns: Set[Any]) → bool[source]

It checks if the path_name is matching with the patterns given in input.

It is also used as a dir_filter function when generating the directory object from swh.model.from_disk

Returns

False if the directory has to be ignored, True otherwise

swh.scanner.scanner.get_subpaths(path: pathlib.PosixPath, exclude_patterns: Set[Any]) → Iterator[Tuple[pathlib.PosixPath, str]][source]

Find the persistent identifier of the directories and files under a given path.

Parameters

path – the root path

Yields

pairs of – path, the relative persistent identifier

async swh.scanner.scanner.parse_path(path: pathlib.PosixPath, session: aiohttp.client.ClientSession, api_url: str, exclude_patterns: Set[Any]) → Iterator[Tuple[str, str, bool]][source]

Check if the sub paths of the given path are present in the archive or not.

Parameters
  • path – the source path

  • api_url – url for the API request

Returns

a subpath of the given path, the pid of the subpath and the result of the api call

Return type

a map containing tuples with

async swh.scanner.scanner.run(root: pathlib.PosixPath, api_url: str, source_tree: swh.scanner.model.Tree, exclude_patterns: Set[Any]) → None[source]

Start scanning from the given root.

It fills the source tree with the path discovered.

Parameters
  • root – the root path to scan

  • api_url – url for the API request