swh.indexer.storage.api.client module¶
-
class
swh.indexer.storage.api.client.
RemoteStorage
(url, api_exception=None, timeout=None, chunk_size=4096, reraise_exceptions=None, **kwargs)[source]¶ Bases:
swh.core.api.RPCClient
Proxy to a remote storage API
-
backend_class
¶ alias of
swh.indexer.storage.interface.IndexerStorageInterface
-
api_exception
¶
-
reraise_exceptions
: ClassVar[List[Type[Exception]]] = [<class 'swh.indexer.storage.exc.IndexerStorageArgumentException'>, <class 'swh.indexer.storage.exc.DuplicateId'>]¶
-
extra_type_decoders
: Dict[str, Callable] = {'idx_model': <function <lambda>>}¶
-
extra_type_encoders
: List[Tuple[type, str, Callable]] = [(<class 'swh.indexer.storage.model.BaseRow'>, 'idx_model', <function _encode_model_object>)]¶
-
check_config
(*, check_write)¶ Check that the storage is configured and ready to go.
Add ctags not present in storage
- Parameters
ctags (iterable) –
dictionaries with keys:
id (bytes): sha1
ctags ([list): List of dictionary with keys: name, kind, line, lang
- Returns
Dict summary of number of rows added
Retrieve ctags per id.
- Parameters
ids (iterable) – sha1 checksums
- Returns
list of language rows
List ctags missing from storage.
- Parameters
ctags (iterable) –
dicts with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of missing id for the tuple (id, indexer_configuration_id)
Search through content’s raw ctags symbols.
- Parameters
expression (str) – Expression to search for
limit (int) – Number of rows to return (default to 10).
last_sha1 (str) – Offset from which retrieving data (default to ‘’).
- Returns
rows of ctags including id, name, lang, kind, line, etc…
-
content_fossology_license_add
(licenses: List[swh.indexer.storage.model.ContentLicenseRow]) → Dict[str, int]¶ Add licenses not present in storage.
- Parameters
license – license rows to be added, with their tool attribute set to
None. –
- Returns
Dict summary of number of rows added
-
content_fossology_license_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentLicenseRow]¶ Retrieve licenses per id.
- Parameters
ids – sha1 identifiers
- Yields
license rows; possibly more than one per (sha1, tool_id) if there are multiple licenses.
-
content_fossology_license_get_partition
(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[bytes, str]¶ Retrieve licenses within the partition partition_id bound by limit.
- Parameters
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns: PagedResult of Sha1. If next_page_token is None, there is no more data
to fetch
-
content_language_add
(languages: List[swh.indexer.storage.model.ContentLanguageRow]) → Dict[str, int]¶ Add languages not present in storage.
- Parameters
languages – language row objects
- Returns
Dict summary of number of rows added
-
content_language_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentLanguageRow]¶ Retrieve full content language per ids.
- Parameters
ids (iterable) – sha1 identifier
- Returns
language row objects
-
content_language_missing
(languages: Iterable[Dict]) → List[Tuple[bytes, int]]¶ List languages missing from storage.
- Parameters
languages (iterable) –
dictionaries with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of tuple (id, indexer_configuration_id) missing
-
content_metadata_add
(metadata: List[swh.indexer.storage.model.ContentMetadataRow]) → Dict[str, int]¶ Add metadata not present in storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id: sha1
metadata: arbitrary dict
- Returns
Dict summary of number of rows added
-
content_metadata_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentMetadataRow]¶ Retrieve metadata per id.
- Parameters
ids (iterable) – sha1 checksums
- Yields
dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata
-
content_metadata_missing
(metadata: Iterable[Dict]) → List[Tuple[bytes, int]]¶ List metadata missing from storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Yields
missing sha1s
-
content_mimetype_add
(mimetypes: List[swh.indexer.storage.model.ContentMimetypeRow]) → Dict[str, int]¶ Add mimetypes not present in storage.
- Parameters
mimetypes – mimetype rows to be added, with their tool attribute set to
None. –
overwrite (
True
) –default) –
- Returns
Dict summary of number of rows added
-
content_mimetype_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.ContentMimetypeRow]¶ Retrieve full content mimetype per ids.
- Parameters
ids – sha1 identifiers
- Returns
mimetype row objects
-
content_mimetype_get_partition
(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) → swh.core.api.classes.PagedResult[bytes, str]¶ Retrieve mimetypes within partition partition_id bound by limit.
- Parameters
**indexer_configuration_id** – The tool used to index data
**partition_id** – index of the partition to fetch
**nb_partitions** – total number of partitions to split into
**page_token** – opaque token used for pagination
**limit** – Limit result (default to 1000)
- Raises
IndexerStorageArgumentException for; –
- limit to None –
- wrong indexer_type provided –
- Returns
PagedResult of Sha1. If next_page_token is None, there is no more data to fetch
-
content_mimetype_missing
(mimetypes: Iterable[Dict]) → List[Tuple[bytes, int]]¶ Generate mimetypes missing from storage.
- Parameters
mimetypes (iterable) –
iterable of dict with keys:
id (bytes): sha1 identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
list of tuple (id, indexer_configuration_id) missing
-
indexer_configuration_add
(tools)¶ Add new tools to the storage.
- Parameters
tools ([dict]) –
List of dictionary representing tool to insert in the db. Dictionary with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns
List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
-
indexer_configuration_get
(tool)¶ Retrieve tool information.
- Parameters
tool (dict) –
Dictionary representing a tool with the following keys:
tool_name (str): tool’s name
tool_version (str): tool’s version
tool_configuration (dict): tool’s configuration (free form dict)
- Returns
The same dictionary with an id key, None otherwise.
-
origin_intrinsic_metadata_add
(metadata: List[swh.indexer.storage.model.OriginIntrinsicMetadataRow]) → Dict[str, int]¶ Add origin metadata not present in storage.
- Parameters
metadata – list of OriginIntrinsicMetadataRow objects
- Returns
Dict summary of number of rows added
-
origin_intrinsic_metadata_get
(urls: Iterable[str]) → List[swh.indexer.storage.model.OriginIntrinsicMetadataRow]¶ Retrieve origin metadata per id.
- Parameters
urls (iterable) – origin URLs
Returns: list of OriginIntrinsicMetadataRow
-
origin_intrinsic_metadata_search_by_producer
(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: Optional[List[str]] = None, tool_ids: Optional[List[int]] = None) → swh.core.api.classes.PagedResult[Union[str, swh.indexer.storage.model.OriginIntrinsicMetadataRow], str]¶ Returns the list of origins whose metadata contain all the terms.
- Parameters
page_token (str) – Opaque token used for pagination.
limit (int) – The maximum number of results to return
ids_only (bool) – Determines whether only origin urls are returned or the content as well
mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.
- Returns
OriginIntrinsicMetadataRow objects
-
origin_intrinsic_metadata_search_fulltext
(conjunction: List[str], limit: int = 100) → List[swh.indexer.storage.model.OriginIntrinsicMetadataRow]¶ Returns the list of origins whose metadata contain all the terms.
- Parameters
conjunction – List of terms to be searched for.
limit – The maximum number of results to return
- Returns
list of OriginIntrinsicMetadataRow
-
origin_intrinsic_metadata_stats
()¶ Returns counts of indexed metadata per origins, broken down into metadata types.
- Returns
dictionary with keys:
total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)
non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from
per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.
- Return type
dict
-
revision_intrinsic_metadata_add
(metadata: List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]) → Dict[str, int]¶ Add metadata not present in storage.
- Parameters
metadata – ContentMetadataRow objects
- Returns
Dict summary of number of rows added
-
revision_intrinsic_metadata_get
(ids: Iterable[bytes]) → List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]¶ Retrieve revision metadata per id.
- Parameters
ids (iterable) – sha1 checksums
- Returns
ContentMetadataRow objects
-
revision_intrinsic_metadata_missing
(metadata: Iterable[Dict]) → List[Tuple[bytes, int]]¶ List metadata missing from storage.
- Parameters
metadata (iterable) –
dictionaries with keys:
id (bytes): sha1_git revision identifier
indexer_configuration_id (int): tool used to compute the results
- Returns
missing ids
-