swh.indexer.storage package#

Subpackages#

Submodules#

Module contents#

swh.indexer.storage.sanitize_json(doc)[source]#

Recursively replaces NUL characters, as postgresql does not allow them in text fields.

swh.indexer.storage.get_indexer_storage(cls: str, **kwargs) IndexerStorageInterface[source]#

Instantiate an indexer storage implementation of class cls with arguments kwargs.

Parameters:
  • cls – indexer storage class (local, remote or memory)

  • kwargs – dictionary of arguments passed to the indexer storage class constructor

Returns:

an instance of swh.indexer.storage

Raises:

ValueError if passed an unknown storage class.

swh.indexer.storage.check_id_duplicates(data)[source]#

If any two row models in data have the same unique key, raises a ValueError.

Values associated to the key must be hashable.

Parameters:

data (List[dict]) – List of dictionaries to be inserted

>>> tool1 = {"name": "foo", "version": "1.2.3", "configuration": {}}
>>> tool2 = {"name": "foo", "version": "1.2.4", "configuration": {}}
>>> check_id_duplicates([
...     ContentLicenseRow(id=b'foo', tool=tool1, license="GPL"),
...     ContentLicenseRow(id=b'foo', tool=tool2, license="GPL"),
... ])
>>> check_id_duplicates([
...     ContentLicenseRow(id=b'foo', tool=tool1, license="AGPL"),
...     ContentLicenseRow(id=b'foo', tool=tool1, license="AGPL"),
... ])
Traceback (most recent call last):
...
swh.indexer.storage.exc.DuplicateId: [{'id': b'foo', 'license': 'AGPL', 'tool_configuration': '{}', 'tool_name': 'foo', 'tool_version': '1.2.3'}]
class swh.indexer.storage.IndexerStorage(db, min_pool_conns=1, max_pool_conns=10, journal_writer=None)[source]#

Bases: object

SWH Indexer Storage Datastore

Parameters:
  • db – either a libpq connection string, or a psycopg2 connection

  • journal_writer – configuration passed to swh.journal.writer.get_journal_writer

current_version = 137#
get_db()[source]#
put_db(db)[source]#
check_config(*, check_write)[source]#
content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]][source]#
get_partition(indexer_type: str, indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000, with_textual_data=False) PagedResult[bytes, str][source]#

Retrieve ids of content with indexer_type within within partition partition_id bound by limit.

Parameters:
  • **indexer_type** – Type of data content to index (mimetype, etc…)

  • **indexer_configuration_id** – The tool used to index data

  • **partition_id** – index of the partition to fetch

  • **nb_partitions** – total number of partitions to split into

  • **page_token** – opaque token used for pagination

  • **limit** – Limit result (default to 1000)

  • **with_textual_data** (bool) – Deal with only textual content (True) or all content (all contents by defaults, False)

Raises:
  • IndexerStorageArgumentException for;

  • - limit to None

  • - wrong indexer_type provided

Returns:

PagedResult of Sha1. If next_page_token is None, there is no more data to fetch

content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str][source]#
content_mimetype_add(mimetypes: List[ContentMimetypeRow]) Dict[str, int][source]#
content_mimetype_get(ids: Iterable[bytes]) List[ContentMimetypeRow][source]#
content_fossology_license_get(ids: Iterable[bytes]) List[ContentLicenseRow][source]#
content_fossology_license_add(licenses: List[ContentLicenseRow]) Dict[str, int][source]#
content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str][source]#
content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]#
content_metadata_get(ids: Iterable[bytes]) List[ContentMetadataRow][source]#
content_metadata_add(metadata: List[ContentMetadataRow]) Dict[str, int][source]#
directory_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]][source]#
directory_intrinsic_metadata_get(ids: Iterable[bytes]) List[DirectoryIntrinsicMetadataRow][source]#
directory_intrinsic_metadata_add(metadata: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#
origin_intrinsic_metadata_get(urls: Iterable[str]) List[OriginIntrinsicMetadataRow][source]#
origin_intrinsic_metadata_add(metadata: List[OriginIntrinsicMetadataRow]) Dict[str, int][source]#
origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[OriginIntrinsicMetadataRow][source]#
origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: List[str] | None = None, tool_ids: List[int] | None = None) PagedResult[str | OriginIntrinsicMetadataRow, str][source]#
origin_intrinsic_metadata_stats()[source]#
origin_extrinsic_metadata_get(urls: Iterable[str]) List[OriginExtrinsicMetadataRow][source]#
origin_extrinsic_metadata_add(metadata: List[OriginExtrinsicMetadataRow]) Dict[str, int][source]#
indexer_configuration_add(tools)[source]#
indexer_configuration_get(tool)[source]#