swh.indexer.storage.interface module¶
-
class
swh.indexer.storage.interface.
IndexerStorageInterface
(*args, **kwds)[source]¶ Bases:
typing_extensions.Protocol
-
content_mimetype_missing
(mimetypes: Iterable[Dict]) → List[Tuple[bytes, int]][source]¶ Generate mimetypes missing from storage.
- Parameters
mimetypes (iterable) –
iterable of dict with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of tuple (id, indexer_configuration_id) missing
-
content_mimetype_get_partition
(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[bytes, str][source]¶ Retrieve mimetypes within partition partition_id bound by limit.
- Parameters
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns
PagedResult of Sha1. If next_page_token is None, there is no more data to fetch
-
content_mimetype_add
(mimetypes: List[swh.indexer.storage.model.ContentMimetypeRow]) → Dict[str, int][source]¶ Add mimetypes not present in storage.
- Parameters
mimetypes – mimetype rows to be added, with their tool attribute set to
None. –
overwrite (
True
) –default) –
- Returns
Dict summary of number of rows added
-
content_mimetype_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentMimetypeRow][source]¶ Retrieve full content mimetype per ids.
- Parameters
ids – sha1 identifiers
- Returns
mimetype row objects
-
content_language_missing
(languages: Iterable[Dict]) → List[Tuple[bytes, int]][source]¶ List languages missing from storage.
- Parameters
languages (iterable) –
dictionaries with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of tuple (id, indexer_configuration_id) missing
-
content_language_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentLanguageRow][source]¶ Retrieve full content language per ids.
- Parameters
ids (iterable) – sha1 identifier
- Returns
language row objects
-
content_language_add
(languages: List[swh.indexer.storage.model.ContentLanguageRow]) → Dict[str, int][source]¶ Add languages not present in storage.
- Parameters
languages – language row objects
- Returns
Dict summary of number of rows added
List ctags missing from storage.
- Parameters
ctags (iterable) –
dicts with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of missing id for the tuple (id, indexer_configuration_id)
Retrieve ctags per id.
- Parameters
ids (iterable) – sha1 checksums
- Returns
list of language rows
Add ctags not present in storage
- Parameters
ctags (iterable) –
dictionaries with keys:
id (bytes): sha1
ctags ([list): List of dictionary with keys: name, kind, line, lang
- Returns
Dict summary of number of rows added
Search through content’s raw ctags symbols.
- Parameters
expression (str) – Expression to search for
limit (int) – Number of rows to return (default to 10).
last_sha1 (str) – Offset from which retrieving data (default to ‘’).
- Returns
rows of ctags including id, name, lang, kind, line, etc…
-
content_fossology_license_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentLicenseRow][source]¶ Retrieve licenses per id.
- Parameters
ids – sha1 identifiers
- Yields
license rows; possibly more than one per (sha1, tool_id) if there are multiple licenses.
-
content_fossology_license_add
(licenses: List[swh.indexer.storage.model.ContentLicenseRow]) → Dict[str, int][source]¶ Add licenses not present in storage.
- Parameters
license – license rows to be added, with their tool attribute set to
None. –
- Returns
Dict summary of number of rows added
-
content_fossology_license_get_partition
(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[bytes, str][source]¶ Retrieve licenses within the partition partition_id bound by limit.
- Parameters
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns: PagedResult of Sha1. If next_page_token is None, there is no more data
to fetch
-
content_metadata_missing
(metadata: Iterable[Dict]) → List[Tuple[bytes, int]][source]¶ List metadata missing from storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Yields
missing sha1s
-
content_metadata_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentMetadataRow][source]¶ Retrieve metadata per id.
- Parameters
ids (iterable) – sha1 checksums
- Yields
dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata
-
content_metadata_add
(metadata: List[swh.indexer.storage.model.ContentMetadataRow]) → Dict[str, int][source]¶ Add metadata not present in storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id: sha1
metadata: arbitrary dict
- Returns
Dict summary of number of rows added
-
revision_intrinsic_metadata_missing
(metadata: Iterable[Dict]) → List[Tuple[bytes, int]][source]¶ List metadata missing from storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1_git revision identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
missing ids
-
revision_intrinsic_metadata_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow][source]¶ Retrieve revision metadata per id.
- Parameters
ids (iterable) – sha1 checksums
- Returns
ContentMetadataRow objects
-
revision_intrinsic_metadata_add
(metadata: List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]) → Dict[str, int][source]¶ Add metadata not present in storage.
- Parameters
metadata – ContentMetadataRow objects
- Returns
Dict summary of number of rows added
-
origin_intrinsic_metadata_get
(urls: Iterable[str]) → List[swh.indexer.storage.model.OriginIntrinsicMetadataRow][source]¶ Retrieve origin metadata per id.
- Parameters
urls (iterable) – origin URLs
Returns: list of OriginIntrinsicMetadataRow
-
origin_intrinsic_metadata_add
(metadata: List[swh.indexer.storage.model.OriginIntrinsicMetadataRow]) → Dict[str, int][source]¶ Add origin metadata not present in storage.
- Parameters
metadata – list of OriginIntrinsicMetadataRow objects
- Returns
Dict summary of number of rows added
-
origin_intrinsic_metadata_search_fulltext
(conjunction: List[str], limit: int = 100) → List[swh.indexer.storage.model.OriginIntrinsicMetadataRow][source]¶ Returns the list of origins whose metadata contain all the terms.
- Parameters
conjunction – List of terms to be searched for.
limit – The maximum number of results to return
- Returns
list of OriginIntrinsicMetadataRow
-
origin_intrinsic_metadata_search_by_producer
(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: Optional[List[str]] = None, tool_ids: Optional[List[int]] = None) → swh.core.api.classes.PagedResult[Union[str, swh.indexer.storage.model.OriginIntrinsicMetadataRow], str][source]¶ Returns the list of origins whose metadata contain all the terms.
- Parameters
page_token (str) – Opaque token used for pagination.
limit (int) – The maximum number of results to return
ids_only (bool) – Determines whether only origin urls are returned or the content as well
mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.
- Returns
OriginIntrinsicMetadataRow objects
-
origin_intrinsic_metadata_stats
()[source]¶ Returns counts of indexed metadata per origins, broken down into metadata types.
- Returns
dictionary with keys:
total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)
non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from
per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.
- Return type
dict
-
indexer_configuration_add
(tools)[source]¶ Add new tools to the storage.
- Parameters
tools ([dict]) –
List of dictionary representing tool to insert in the db. Dictionary with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns
List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
-
indexer_configuration_get
(tool)[source]¶ Retrieve tool information.
- Parameters
tool (dict) –
Dictionary representing a tool with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns
The same dictionary with an id key, None otherwise.
-