swh.indexer.storage.in_memory module

swh.indexer.storage.in_memory.check_id_types(data: List[Dict[str, Any]])[source]

Checks all elements of the list have an ‘id’ whose type is ‘bytes’.

class swh.indexer.storage.in_memory.SubStorage(row_class: Type[swh.indexer.storage.in_memory.TValue], tools, journal_writer)[source]

Bases: Generic[swh.indexer.storage.in_memory.TValue]

Implements common missing/get/add logic for each indexer type.

missing(keys: Iterable[Dict]) List[bytes][source]

List data missing from storage.

Parameters

data (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier

  • indexer_configuration_id (int): tool used to compute the results

Yields

missing sha1s

get(ids: Iterable[bytes]) List[swh.indexer.storage.in_memory.TValue][source]

Retrieve data per id.

Parameters

ids (iterable) – sha1 checksums

Yields

dict

dictionaries with the following keys:

  • id (bytes)

  • tool (dict): tool used to compute metadata

  • arbitrary data (as provided to add)

get_all() List[swh.indexer.storage.in_memory.TValue][source]
get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) swh.core.api.classes.PagedResult[bytes, str][source]

Retrieve ids of content with indexer_type within partition partition_id bound by limit.

Parameters
  • **indexer_type** – Type of data content to index (mimetype, language, etc…)

  • **indexer_configuration_id** – The tool used to index data

  • **partition_id** – index of the partition to fetch

  • **nb_partitions** – total number of partitions to split into

  • **page_token** – opaque token used for pagination

  • **limit** – Limit result (default to 1000)

  • **with_textual_data** (bool) – Deal with only textual content (True) or all content (all contents by defaults, False)

Raises
  • IndexerStorageArgumentException for;

  • - limit to None

  • - wrong indexer_type provided

Returns

PagedResult of Sha1. If next_page_token is None, there is no more data to fetch

add(data: Iterable[swh.indexer.storage.in_memory.TValue]) int[source]

Add data not present in storage.

Parameters

data (iterable) –

dictionaries with keys:

  • id: sha1

  • indexer_configuration_id: tool used to compute the results

  • arbitrary data

class swh.indexer.storage.in_memory.IndexerStorage(journal_writer=None)[source]

Bases: object

In-memory SWH indexer storage.

check_config(*, check_write)[source]
content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]][source]
content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) swh.core.api.classes.PagedResult[bytes, str][source]
content_mimetype_add(mimetypes: List[swh.indexer.storage.model.ContentMimetypeRow]) Dict[str, int][source]
content_mimetype_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.ContentMimetypeRow][source]
content_language_missing(languages: Iterable[Dict]) List[Tuple[bytes, int]][source]
content_language_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.ContentLanguageRow][source]
content_language_add(languages: List[swh.indexer.storage.model.ContentLanguageRow]) Dict[str, int][source]
content_ctags_missing(ctags: Iterable[Dict]) List[Tuple[bytes, int]][source]
content_ctags_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.ContentCtagsRow][source]
content_ctags_add(ctags: List[swh.indexer.storage.model.ContentCtagsRow]) Dict[str, int][source]
content_fossology_license_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.ContentLicenseRow][source]
content_fossology_license_add(licenses: List[swh.indexer.storage.model.ContentLicenseRow]) Dict[str, int][source]
content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: Optional[str] = None, limit: int = 1000) swh.core.api.classes.PagedResult[bytes, str][source]
content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]
content_metadata_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.ContentMetadataRow][source]
content_metadata_add(metadata: List[swh.indexer.storage.model.ContentMetadataRow]) Dict[str, int][source]
revision_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]
revision_intrinsic_metadata_get(ids: Iterable[bytes]) List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow][source]
revision_intrinsic_metadata_add(metadata: List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]) Dict[str, int][source]
origin_intrinsic_metadata_get(urls: Iterable[str]) List[swh.indexer.storage.model.OriginIntrinsicMetadataRow][source]
origin_intrinsic_metadata_add(metadata: List[swh.indexer.storage.model.OriginIntrinsicMetadataRow]) Dict[str, int][source]
origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[swh.indexer.storage.model.OriginIntrinsicMetadataRow][source]
origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: Optional[List[str]] = None, tool_ids: Optional[List[int]] = None) swh.core.api.classes.PagedResult[Union[str, swh.indexer.storage.model.OriginIntrinsicMetadataRow], str][source]
origin_intrinsic_metadata_stats()[source]
indexer_configuration_add(tools)[source]
indexer_configuration_get(tool)[source]