swh.indexer.storage.db module#

class swh.indexer.storage.db.Db(conn: connection, pool: AbstractConnectionPool | None = None)[source]#

Bases: BaseDb

Proxy to the SWH Indexer DB, with wrappers around stored procedures

create a DB proxy

Parameters:
  • conn – psycopg2 connection to the SWH DB

  • pool – psycopg2 pool of connections

content_mimetype_hash_keys = ['id', 'indexer_configuration_id']#
content_mimetype_missing_from_list(mimetypes: Iterable[Dict], cur=None) Iterator[bytes][source]#

List missing mimetypes.

content_mimetype_cols = ['id', 'mimetype', 'encoding', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']#
mktemp_content_mimetype(cur=None)[source]#
content_mimetype_add_from_temp(cur=None)[source]#
content_indexer_names = {'fossology_license': 'content_fossology_license', 'mimetype': 'content_mimetype'}#
content_get_range(content_type, start, end, indexer_configuration_id, limit=1000, with_textual_data=False, cur=None)[source]#

Retrieve contents with content_type, within range [start, end] bound by limit and associated to the given indexer configuration id.

When asking to work on textual content, that filters on the mimetype table with any mimetype that is not binary.

content_mimetype_get_from_list(ids, cur=None)[source]#
content_fossology_license_cols = ['id', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration', 'license']#
mktemp_content_fossology_license(cur=None)[source]#
content_fossology_license_add_from_temp(cur=None)[source]#

Add new licenses per content.

content_fossology_license_get_from_list(ids, cur=None)[source]#

Retrieve licenses per id.

content_metadata_hash_keys = ['id', 'indexer_configuration_id']#
content_metadata_missing_from_list(metadata, cur=None)[source]#

List missing metadata.

content_metadata_cols = ['id', 'metadata', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']#
mktemp_content_metadata(cur=None)[source]#
content_metadata_add_from_temp(cur=None)[source]#
content_metadata_get_from_list(ids, cur=None)[source]#
directory_intrinsic_metadata_hash_keys = ['id', 'indexer_configuration_id']#
directory_intrinsic_metadata_missing_from_list(metadata, cur=None)[source]#

List missing metadata.

directory_intrinsic_metadata_cols = ['id', 'metadata', 'mappings', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']#
mktemp_directory_intrinsic_metadata(cur=None)[source]#
directory_intrinsic_metadata_add_from_temp(cur=None)[source]#
directory_intrinsic_metadata_get_from_list(ids, cur=None)[source]#
origin_intrinsic_metadata_cols = ['id', 'metadata', 'from_directory', 'mappings', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']#
origin_intrinsic_metadata_regconfig = 'pg_catalog.simple'#

The dictionary used to normalize ‘metadata’ and queries. ‘pg_catalog.simple’ provides no stopword, so it should be suitable for proper names and non-English content. When updating this value, make sure to add a new index on origin_intrinsic_metadata.metadata.

mktemp_origin_intrinsic_metadata(cur=None)[source]#
origin_intrinsic_metadata_add_from_temp(cur=None)[source]#
origin_intrinsic_metadata_get_from_list(ids, cur=None)[source]#
origin_intrinsic_metadata_search_fulltext(terms, *, limit, cur)[source]#
origin_intrinsic_metadata_search_by_producer(last, limit, ids_only, mappings, tool_ids, cur)[source]#
origin_extrinsic_metadata_cols = ['id', 'metadata', 'from_remd_id', 'mappings', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']#
mktemp_origin_extrinsic_metadata(cur=None)[source]#
origin_extrinsic_metadata_add_from_temp(cur=None)[source]#
origin_extrinsic_metadata_get_from_list(ids, cur=None)[source]#
indexer_configuration_cols = ['id', 'tool_name', 'tool_version', 'tool_configuration']#
mktemp_indexer_configuration(cur=None)[source]#
indexer_configuration_add_from_temp(cur=None)[source]#
indexer_configuration_get(tool_name, tool_version, tool_configuration, cur=None)[source]#
indexer_configuration_get_from_id(id_, cur=None)[source]#