swh.indexer.storage package#



Module contents:


Recursively replaces NUL characters, as postgresql does not allow them in text fields.

swh.indexer.storage.get_indexer_storage(cls: str, **kwargs) IndexerStorageInterface[source]#

Instantiate an indexer storage implementation of class cls with arguments kwargs.

  • cls – indexer storage class (local, remote or memory)

  • kwargs – dictionary of arguments passed to the indexer storage class constructor


an instance of swh.indexer.storage


ValueError if passed an unknown storage class.


If any two row models in data have the same unique key, raises a ValueError.

Values associated to the key must be hashable.


data (List[dict]) – List of dictionaries to be inserted

>>> tool1 = {"name": "foo", "version": "1.2.3", "configuration": {}}
>>> tool2 = {"name": "foo", "version": "1.2.4", "configuration": {}}
>>> check_id_duplicates([
...     ContentLicenseRow(id=b'foo', tool=tool1, license="GPL"),
...     ContentLicenseRow(id=b'foo', tool=tool2, license="GPL"),
... ])
>>> check_id_duplicates([
...     ContentLicenseRow(id=b'foo', tool=tool1, license="AGPL"),
...     ContentLicenseRow(id=b'foo', tool=tool1, license="AGPL"),
... ])
Traceback (most recent call last):
swh.indexer.storage.exc.DuplicateId: [{'id': b'foo', 'license': 'AGPL', 'tool_configuration': '{}', 'tool_name': 'foo', 'tool_version': '1.2.3'}]
class swh.indexer.storage.IndexerStorage(db, min_pool_conns=1, max_pool_conns=10, journal_writer=None)[source]#

Bases: object

SWH Indexer Storage Datastore

  • db – either a libpq connection string, or a psycopg2 connection

  • journal_writer – configuration passed to swh.journal.writer.get_journal_writer

current_version = 137#
check_config(*, check_write)[source]#
content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]][source]#
get_partition(indexer_type: str, indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000, with_textual_data=False) PagedResult[bytes, str][source]#

Retrieve ids of content with indexer_type within within partition partition_id bound by limit.

  • **indexer_type** – Type of data content to index (mimetype, etc…)

  • **indexer_configuration_id** – The tool used to index data

  • **partition_id** – index of the partition to fetch

  • **nb_partitions** – total number of partitions to split into

  • **page_token** – opaque token used for pagination

  • **limit** – Limit result (default to 1000)

  • **with_textual_data** (bool) – Deal with only textual content (True) or all content (all contents by defaults, False)

  • IndexerStorageArgumentException for;

  • - limit to None

  • - wrong indexer_type provided


PagedResult of Sha1. If next_page_token is None, there is no more data to fetch

content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str][source]#
content_mimetype_add(mimetypes: List[ContentMimetypeRow]) Dict[str, int][source]#
content_mimetype_get(ids: Iterable[bytes]) List[ContentMimetypeRow][source]#
content_fossology_license_get(ids: Iterable[bytes]) List[ContentLicenseRow][source]#
content_fossology_license_add(licenses: List[ContentLicenseRow]) Dict[str, int][source]#
content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str][source]#
content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]#
content_metadata_get(ids: Iterable[bytes]) List[ContentMetadataRow][source]#
content_metadata_add(metadata: List[ContentMetadataRow]) Dict[str, int][source]#
directory_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]#
directory_intrinsic_metadata_get(ids: Iterable[bytes]) List[DirectoryIntrinsicMetadataRow][source]#
directory_intrinsic_metadata_add(metadata: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#
origin_intrinsic_metadata_get(urls: Iterable[str]) List[OriginIntrinsicMetadataRow][source]#
origin_intrinsic_metadata_add(metadata: List[OriginIntrinsicMetadataRow]) Dict[str, int][source]#
origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[OriginIntrinsicMetadataRow][source]#
origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: List[str] | None = None, tool_ids: List[int] | None = None) PagedResult[str | OriginIntrinsicMetadataRow, str][source]#
origin_extrinsic_metadata_get(urls: Iterable[str]) List[OriginExtrinsicMetadataRow][source]#
origin_extrinsic_metadata_add(metadata: List[OriginExtrinsicMetadataRow]) Dict[str, int][source]#