swh.indexer.storage.interface module#
- class swh.indexer.storage.interface.IndexerStorageInterface(*args, **kwargs)[source]#
Bases:
Protocol
- content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]] [source]#
Generate mimetypes missing from storage.
- Parameters:
mimetypes (iterable) –
iterable of dict with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns:
list of tuple (id, indexer_configuration_id) missing
- content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str] [source]#
Retrieve mimetypes within partition partition_id bound by limit.
- Parameters:
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises:
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns:
PagedResult of Sha1. If next_page_token is None, there is no more data to fetch
- content_mimetype_add(mimetypes: List[ContentMimetypeRow]) Dict[str, int] [source]#
Add mimetypes not present in storage.
- Parameters:
mimetypes – mimetype rows to be added, with their tool attribute set to
None.
overwrite (
True
)default)
- Returns:
Dict summary of number of rows added
- content_mimetype_get(ids: Iterable[bytes]) List[ContentMimetypeRow] [source]#
Retrieve full content mimetype per ids.
- Parameters:
ids – sha1 identifiers
- Returns:
mimetype row objects
- content_fossology_license_get(ids: Iterable[bytes]) List[ContentLicenseRow] [source]#
Retrieve licenses per id.
- Parameters:
ids – sha1 identifiers
- Yields:
license rows; possibly more than one per (sha1, tool_id) if there are multiple licenses.
- content_fossology_license_add(licenses: List[ContentLicenseRow]) Dict[str, int] [source]#
Add licenses not present in storage.
- Parameters:
license – license rows to be added, with their tool attribute set to
None.
- Returns:
Dict summary of number of rows added
- content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str] [source]#
Retrieve licenses within the partition partition_id bound by limit.
- Parameters:
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises:
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns: PagedResult of Sha1. If next_page_token is None, there is no more data
to fetch
- content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]] [source]#
List metadata missing from storage.
- Parameters:
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Yields:
missing sha1s
- content_metadata_get(ids: Iterable[bytes]) List[ContentMetadataRow] [source]#
Retrieve metadata per id.
- Parameters:
ids (iterable) – sha1 checksums
- Yields:
dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata
- content_metadata_add(metadata: List[ContentMetadataRow]) Dict[str, int] [source]#
Add metadata not present in storage.
- Parameters:
metadata (iterable) –
dictionaries with keys:
id: sha1
metadata: arbitrary dict
- Returns:
Dict summary of number of rows added
- directory_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]] [source]#
List metadata missing from storage.
- Parameters:
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1_git directory identifier
indexer_configuration_id (int): tool used to compute the results
- Returns:
missing ids
- directory_intrinsic_metadata_get(ids: Iterable[bytes]) List[DirectoryIntrinsicMetadataRow] [source]#
Retrieve directory metadata per id.
- Parameters:
ids (iterable) – sha1 checksums
- Returns:
ContentMetadataRow objects
- directory_intrinsic_metadata_add(metadata: List[DirectoryIntrinsicMetadataRow]) Dict[str, int] [source]#
Add metadata not present in storage.
- Parameters:
metadata – ContentMetadataRow objects
- Returns:
Dict summary of number of rows added
- origin_intrinsic_metadata_get(urls: Iterable[str]) List[OriginIntrinsicMetadataRow] [source]#
Retrieve origin metadata per id.
- Parameters:
urls (iterable) – origin URLs
Returns: list of OriginIntrinsicMetadataRow
- origin_intrinsic_metadata_add(metadata: List[OriginIntrinsicMetadataRow]) Dict[str, int] [source]#
Add origin metadata not present in storage.
- Parameters:
metadata – list of OriginIntrinsicMetadataRow objects
- Returns:
Dict summary of number of rows added
- origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[OriginIntrinsicMetadataRow] [source]#
Returns the list of origins whose metadata contain all the terms.
- Parameters:
conjunction – List of terms to be searched for.
limit – The maximum number of results to return
- Returns:
list of OriginIntrinsicMetadataRow
- origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: List[str] | None = None, tool_ids: List[int] | None = None) PagedResult[str | OriginIntrinsicMetadataRow, str] [source]#
Returns the list of origins whose metadata contain all the terms.
- Parameters:
page_token (str) – Opaque token used for pagination.
limit (int) – The maximum number of results to return
ids_only (bool) – Determines whether only origin urls are returned or the content as well
mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.
- Returns:
OriginIntrinsicMetadataRow objects
- origin_intrinsic_metadata_stats()[source]#
Returns counts of indexed metadata per origins, broken down into metadata types.
- Returns:
dictionary with keys:
total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)
non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from
per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.
- Return type:
- origin_extrinsic_metadata_get(urls: Iterable[str]) List[OriginExtrinsicMetadataRow] [source]#
Retrieve origin metadata per id.
- Parameters:
urls (iterable) – origin URLs
Returns: list of OriginExtrinsicMetadataRow
- origin_extrinsic_metadata_add(metadata: List[OriginExtrinsicMetadataRow]) Dict[str, int] [source]#
Add origin metadata not present in storage.
- Parameters:
metadata – list of OriginExtrinsicMetadataRow objects
- Returns:
Dict summary of number of rows added
- indexer_configuration_add(tools)[source]#
Add new tools to the storage.
- Parameters:
tools ([dict]) –
List of dictionary representing tool to insert in the db. Dictionary with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns:
List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
- indexer_configuration_get(tool)[source]#
Retrieve tool information.
- Parameters:
tool (dict) –
Dictionary representing a tool with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns:
The same dictionary with an id key, None otherwise.