swh.indexer.storage.api.client module#

class swh.indexer.storage.api.client.RemoteStorage(url, api_exception=None, timeout=None, chunk_size=4096, reraise_exceptions=None, **kwargs)[source]#

Bases: RPCClient

Proxy to a remote storage API

backend_class#

alias of IndexerStorageInterface

api_exception#

alias of IndexerStorageAPIError

reraise_exceptions: ClassVar[List[Type[Exception]]] = [<class 'swh.indexer.storage.exc.IndexerStorageArgumentException'>, <class 'swh.indexer.storage.exc.DuplicateId'>]#

On server errors, if any of the exception classes in this list has the same name as the error name, then the exception will be instantiated and raised instead of a generic RemoteException.

extra_type_decoders: Dict[str, Callable] = {'idx_model': <function <lambda>>}#

Value of extra_decoders passed to json_loads or msgpack_loads to be able to deserialize more object types.

extra_type_encoders: List[Tuple[type, str, Callable]] = [(<class 'swh.indexer.storage.model.BaseRow'>, 'idx_model', <function _encode_model_object>)]#

Value of extra_encoders passed to json_dumps or msgpack_dumps to be able to serialize more object types.

check_config(*, check_write)#

Check that the storage is configured and ready to go.

content_fossology_license_add(licenses: List[ContentLicenseRow]) Dict[str, int]#

Add licenses not present in storage.

Parameters:
  • license – license rows to be added, with their tool attribute set to

  • None.

Returns:

Dict summary of number of rows added

content_fossology_license_get(ids: Iterable[bytes]) List[ContentLicenseRow]#

Retrieve licenses per id.

Parameters:

ids – sha1 identifiers

Yields:

license rows; possibly more than one per (sha1, tool_id) if there are multiple licenses.

content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str]#

Retrieve licenses within the partition partition_id bound by limit.

Parameters:
  • **indexer_configuration_id** – The tool used to index data

  • **partition_id** – index of the partition to fetch

  • **nb_partitions** – total number of partitions to split into

  • **page_token** – opaque token used for pagination

  • **limit** – Limit result (default to 1000)

Raises:
  • IndexerStorageArgumentException for;

  • - limit to None

  • - wrong indexer_type provided

Returns: PagedResult of Sha1. If next_page_token is None, there is no more data

to fetch

content_metadata_add(metadata: List[ContentMetadataRow]) Dict[str, int]#

Add metadata not present in storage.

Parameters:

metadata (iterable) –

dictionaries with keys:

  • id: sha1

  • metadata: arbitrary dict

Returns:

Dict summary of number of rows added

content_metadata_get(ids: Iterable[bytes]) List[ContentMetadataRow]#

Retrieve metadata per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata

content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]]#

List metadata missing from storage.

Parameters:

metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier

  • indexer_configuration_id (int): tool used to compute the results

Yields:

missing sha1s

content_mimetype_add(mimetypes: List[ContentMimetypeRow]) Dict[str, int]#

Add mimetypes not present in storage.

Parameters:
  • mimetypes – mimetype rows to be added, with their tool attribute set to

  • None.

  • overwrite (True) –

  • default)

Returns:

Dict summary of number of rows added

content_mimetype_get(ids: Iterable[bytes]) List[ContentMimetypeRow]#

Retrieve full content mimetype per ids.

Parameters:

ids – sha1 identifiers

Returns:

mimetype row objects

content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str]#

Retrieve mimetypes within partition partition_id bound by limit.

Parameters:
  • **indexer_configuration_id** – The tool used to index data

  • **partition_id** – index of the partition to fetch

  • **nb_partitions** – total number of partitions to split into

  • **page_token** – opaque token used for pagination

  • **limit** – Limit result (default to 1000)

Raises:
  • IndexerStorageArgumentException for;

  • - limit to None

  • - wrong indexer_type provided

Returns:

PagedResult of Sha1. If next_page_token is None, there is no more data to fetch

content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]]#

Generate mimetypes missing from storage.

Parameters:

mimetypes (iterable) –

iterable of dict with keys:

  • id (bytes): sha1 identifier

  • indexer_configuration_id (int): tool used to compute the results

Returns:

list of tuple (id, indexer_configuration_id) missing

directory_intrinsic_metadata_add(metadata: List[DirectoryIntrinsicMetadataRow]) Dict[str, int]#

Add metadata not present in storage.

Parameters:

metadata – ContentMetadataRow objects

Returns:

Dict summary of number of rows added

directory_intrinsic_metadata_get(ids: Iterable[bytes]) List[DirectoryIntrinsicMetadataRow]#

Retrieve directory metadata per id.

Parameters:

ids (iterable) – sha1 checksums

Returns:

ContentMetadataRow objects

directory_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]]#

List metadata missing from storage.

Parameters:

metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1_git directory identifier

  • indexer_configuration_id (int): tool used to compute the results

Returns:

missing ids

indexer_configuration_add(tools)#

Add new tools to the storage.

Parameters:

tools ([dict]) –

List of dictionary representing tool to insert in the db. Dictionary with the following keys:

  • tool_name (str): tool’s name

  • tool_version (str): tool’s version

  • tool_configuration (dict): tool’s configuration (free form dict)

Returns:

List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.

indexer_configuration_get(tool)#

Retrieve tool information.

Parameters:

tool (dict) –

Dictionary representing a tool with the following keys:

  • tool_name (str): tool’s name

  • tool_version (str): tool’s version

  • tool_configuration (dict): tool’s configuration (free form dict)

Returns:

The same dictionary with an id key, None otherwise.

origin_extrinsic_metadata_add(metadata: List[OriginExtrinsicMetadataRow]) Dict[str, int]#

Add origin metadata not present in storage.

Parameters:

metadata – list of OriginExtrinsicMetadataRow objects

Returns:

Dict summary of number of rows added

origin_extrinsic_metadata_get(urls: Iterable[str]) List[OriginExtrinsicMetadataRow]#

Retrieve origin metadata per id.

Parameters:

urls (iterable) – origin URLs

Returns: list of OriginExtrinsicMetadataRow

origin_intrinsic_metadata_add(metadata: List[OriginIntrinsicMetadataRow]) Dict[str, int]#

Add origin metadata not present in storage.

Parameters:

metadata – list of OriginIntrinsicMetadataRow objects

Returns:

Dict summary of number of rows added

origin_intrinsic_metadata_get(urls: Iterable[str]) List[OriginIntrinsicMetadataRow]#

Retrieve origin metadata per id.

Parameters:

urls (iterable) – origin URLs

Returns: list of OriginIntrinsicMetadataRow

origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: List[str] | None = None, tool_ids: List[int] | None = None) PagedResult[str | OriginIntrinsicMetadataRow, str]#

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • page_token (str) – Opaque token used for pagination.

  • limit (int) – The maximum number of results to return

  • ids_only (bool) – Determines whether only origin urls are returned or the content as well

  • mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.

Returns:

OriginIntrinsicMetadataRow objects

origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[OriginIntrinsicMetadataRow]#

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • conjunction – List of terms to be searched for.

  • limit – The maximum number of results to return

Returns:

list of OriginIntrinsicMetadataRow

origin_intrinsic_metadata_stats()#

Returns counts of indexed metadata per origins, broken down into metadata types.

Returns:

dictionary with keys:

  • total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)

  • non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from

  • per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.

Return type:

dict