swh.indexer.storage package

Submodules

swh.indexer.storage.converters module

swh.indexer.storage.converters.ctags_to_db(ctags)[source]

Convert a ctags entry into a ready ctags entry.

Parameters:ctags (dict) –

ctags entry with the following keys:

  • id (bytes): content’s identifier
  • tool_id (int): tool id used to compute ctags
  • ctags ([dict]): List of dictionary with the following keys:
    • name (str): symbol’s name
    • kind (str): symbol’s kind
    • line (int): symbol’s line in the content
    • language (str): language
Returns:list of ctags entries as dicts with the following keys:
  • id (bytes): content’s identifier
  • name (str): symbol’s name
  • kind (str): symbol’s kind
  • language (str): language for that content
  • tool_id (int): tool id used to compute ctags
Return type:list
swh.indexer.storage.converters.db_to_ctags(ctag)[source]

Convert a ctags entry into a ready ctags entry.

Parameters:ctags (dict) –

ctags entry with the following keys:

  • id (bytes): content’s identifier
  • ctags ([dict]): List of dictionary with the following keys: - name (str): symbol’s name - kind (str): symbol’s kind - line (int): symbol’s line in the content - language (str): language
Returns:list of ctags ready entry (dict with the following keys):
  • id (bytes): content’s identifier
  • name (str): symbol’s name
  • kind (str): symbol’s kind
  • language (str): language for that content
  • tool (dict): tool used to compute the ctags
Return type:list
swh.indexer.storage.converters.db_to_mimetype(mimetype)[source]

Convert a ctags entry into a ready ctags output.

swh.indexer.storage.converters.db_to_language(language)[source]

Convert a language entry into a ready language output.

swh.indexer.storage.converters.db_to_metadata(metadata)[source]

Convert a metadata entry into a ready metadata output.

swh.indexer.storage.converters.db_to_fossology_license(license)[source]

swh.indexer.storage.db module

class swh.indexer.storage.db.Db(conn, pool=None)[source]

Bases: swh.core.db.BaseDb

Proxy to the SWH Indexer DB, with wrappers around stored procedures

content_mimetype_hash_keys = ['id', 'indexer_configuration_id']
_missing_from_list(table, data, hash_keys, cur=None)[source]

Read from table the data with hash_keys that are missing.

Parameters:
  • table (str) – Table name (e.g content_mimetype, content_language, etc…)
  • data (dict) – Dict of data to read from
  • hash_keys ([str]) – List of keys to read in the data dict.
Yields:

The data which is missing from the db.

content_mimetype_missing_from_list(mimetypes, cur=None)[source]

List missing mimetypes.

content_mimetype_cols = ['id', 'mimetype', 'encoding', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_mimetype(cur=None)[source]
content_mimetype_add_from_temp(conflict_update, cur=None)[source]
_convert_key(key, main_table='c')[source]

Convert keys according to specific use in the module.

Parameters:
  • key (str) – Key expression to change according to the alias used in the query
  • main_table (str) – Alias to use for the main table. Default to c for content_{something}.
Expected:
Tables content_{something} being aliased as ‘c’ (something in {language, mimetype, …}), table indexer_configuration being aliased as ‘i’.
_get_from_list(table, ids, cols, cur=None, id_col='id')[source]

Fetches entries from the table such that their id field (or whatever is given to id_col) is in ids. Returns the columns cols. The `cur`sor is used to connect to the database.

content_indexer_names = {'fossology_license': 'content_fossology_license', 'mimetype': 'content_mimetype'}
content_get_range(content_type, start, end, indexer_configuration_id, limit=1000, with_textual_data=False, cur=None)[source]

Retrieve contents with content_type, within range [start, end] bound by limit and associated to the given indexer configuration id.

When asking to work on textual content, that filters on the mimetype table with any mimetype that is not binary.

content_mimetype_get_from_list(ids, cur=None)[source]
content_language_hash_keys = ['id', 'indexer_configuration_id']
content_language_missing_from_list(languages, cur=None)[source]

List missing languages.

content_language_cols = ['id', 'lang', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_language(cur=None)[source]
content_language_add_from_temp(conflict_update, cur=None)[source]
content_language_get_from_list(ids, cur=None)[source]
content_ctags_hash_keys = ['id', 'indexer_configuration_id']
content_ctags_missing_from_list(ctags, cur=None)[source]

List missing ctags.

content_ctags_cols = ['id', 'name', 'kind', 'line', 'lang', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_ctags(cur=None)[source]
content_ctags_add_from_temp(conflict_update, cur=None)[source]
content_ctags_get_from_list(ids, cur=None)[source]
content_fossology_license_cols = ['id', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration', 'licenses']
mktemp_content_fossology_license(cur=None)[source]
content_fossology_license_add_from_temp(conflict_update, cur=None)[source]

Add new licenses per content.

content_fossology_license_get_from_list(ids, cur=None)[source]

Retrieve licenses per id.

content_metadata_hash_keys = ['id', 'indexer_configuration_id']
content_metadata_missing_from_list(metadata, cur=None)[source]

List missing metadata.

content_metadata_cols = ['id', 'metadata', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_metadata(cur=None)[source]
content_metadata_add_from_temp(conflict_update, cur=None)[source]
content_metadata_get_from_list(ids, cur=None)[source]
revision_intrinsic_metadata_hash_keys = ['id', 'indexer_configuration_id']
revision_intrinsic_metadata_missing_from_list(metadata, cur=None)[source]

List missing metadata.

revision_intrinsic_metadata_cols = ['id', 'metadata', 'mappings', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_revision_intrinsic_metadata(cur=None)[source]
revision_intrinsic_metadata_add_from_temp(conflict_update, cur=None)[source]
revision_intrinsic_metadata_delete(entries, cur=None)[source]
revision_intrinsic_metadata_get_from_list(ids, cur=None)[source]
origin_intrinsic_metadata_cols = ['id', 'metadata', 'from_revision', 'mappings', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
origin_intrinsic_metadata_regconfig = 'pg_catalog.simple'

The dictionary used to normalize ‘metadata’ and queries. ‘pg_catalog.simple’ provides no stopword, so it should be suitable for proper names and non-English content. When updating this value, make sure to add a new index on origin_intrinsic_metadata.metadata.

mktemp_origin_intrinsic_metadata(cur=None)[source]
origin_intrinsic_metadata_add_from_temp(conflict_update, cur=None)[source]
origin_intrinsic_metadata_delete(entries, cur=None)[source]
origin_intrinsic_metadata_get_from_list(orig_ids, cur=None)[source]
origin_intrinsic_metadata_search_fulltext(terms, *, limit, cur)[source]
origin_intrinsic_metadata_search_by_producer(start, end, limit, ids_only, mappings, tool_ids, cur)[source]
indexer_configuration_cols = ['id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_indexer_configuration(cur=None)[source]
indexer_configuration_add_from_temp(cur=None)[source]
__module__ = 'swh.indexer.storage.db'
indexer_configuration_get(tool_name, tool_version, tool_configuration, cur=None)[source]

swh.indexer.storage.in_memory module

swh.indexer.storage.in_memory._transform_tool(tool)[source]
class swh.indexer.storage.in_memory.SubStorage(tools)[source]

Bases: object

Implements common missing/get/add logic for each indexer type.

__init__(tools)[source]

Initialize self. See help(type(self)) for accurate signature.

missing(ids)[source]

List data missing from storage.

Parameters:data (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
get(ids)[source]

Retrieve data per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dict

dictionaries with the following keys:

  • id (bytes)
  • tool (dict): tool used to compute metadata
  • arbitrary data (as provided to add)
get_all()[source]
get_range(start, end, indexer_configuration_id, limit)[source]

Retrieve data within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

add(data, conflict_update)[source]

Add data not present in storage.

Parameters:
  • data (iterable) –

    dictionaries with keys:

    • id: sha1
    • indexer_configuration_id: tool used to compute the results
    • arbitrary data
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false)
add_merge(new_data, conflict_update, merged_key)[source]
delete(entries)[source]
__dict__ = mappingproxy({'get': <function SubStorage.get>, '__weakref__': <attribute '__weakref__' of 'SubStorage' objects>, '__doc__': 'Implements common missing/get/add logic for each indexer type.', '__module__': 'swh.indexer.storage.in_memory', '__init__': <function SubStorage.__init__>, 'delete': <function SubStorage.delete>, '__dict__': <attribute '__dict__' of 'SubStorage' objects>, 'add': <function SubStorage.add>, 'add_merge': <function SubStorage.add_merge>, 'missing': <function SubStorage.missing>, 'get_range': <function SubStorage.get_range>, 'get_all': <function SubStorage.get_all>})
__module__ = 'swh.indexer.storage.in_memory'
__weakref__

list of weak references to the object (if defined)

class swh.indexer.storage.in_memory.IndexerStorage[source]

Bases: object

In-memory SWH indexer storage.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

content_mimetype_missing(mimetypes)[source]

Generate mimetypes missing from storage.

Parameters:mimetypes (iterable) –

iterable of dict with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:tuple (id, indexer_configuration_id) – missing id
content_mimetype_get_range(start, end, indexer_configuration_id, limit=1000)[source]

Retrieve mimetypes within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_mimetype_add(mimetypes, conflict_update=False)[source]

Add mimetypes not present in storage.

Parameters:mimetypes (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • indexer_configuration_id (int): tool’s id used to compute the results
  • conflict_update (bool): Flag to determine if we want to overwrite (True) or skip duplicates (False, the default)
content_mimetype_get(ids, db=None, cur=None)[source]

Retrieve full content mimetype per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

mimetypes (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • tool (dict): Tool used to compute the language
content_language_missing(languages)[source]

List languages missing from storage.

Parameters:languages (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_language_get(ids)[source]

Retrieve full content language per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

languages (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • lang (bytes): raw content’s language
  • tool (dict): Tool used to compute the language
content_language_add(languages, conflict_update=False)[source]

Add languages not present in storage.

Parameters:
  • languages (iterable) –

    dictionaries with keys:

    • id (bytes): sha1
    • lang (bytes): language detected
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
content_ctags_missing(ctags)[source]

List ctags missing from storage.

Parameters:ctags (iterable) –

dicts with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_ctags_get(ids)[source]

Retrieve ctags per id.

Parameters:ids (iterable) – sha1 checksums
Yields:Dictionaries with keys – - id (bytes): content’s identifier - name (str): symbol’s name - kind (str): symbol’s kind - lang (str): language for that content - tool (dict): tool used to compute the ctags’ info
content_ctags_add(ctags, conflict_update=False)[source]

Add ctags not present in storage

Parameters:ctags (iterable) –

dictionaries with keys:

  • id (bytes): sha1
  • ctags ([list): List of dictionary with keys: name, kind,
    line, lang
  • indexer_configuration_id: tool used to compute the results

Search through content’s raw ctags symbols.

Parameters:
  • expression (str) – Expression to search for
  • limit (int) – Number of rows to return (default to 10).
  • last_sha1 (str) – Offset from which retrieving data (default to ‘’).
Yields:

rows of ctags including id, name, lang, kind, line, etc…

content_fossology_license_get(ids)[source]

Retrieve licenses per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dict{id: facts} where facts is a dict with the following keys:

  • licenses ([str]): associated licenses for that content
  • tool (dict): Tool used to compute the license
content_fossology_license_add(licenses, conflict_update=False)[source]

Add licenses not present in storage.

Parameters:
  • licenses (iterable) –

    dictionaries with keys:

    • id: sha1
    • licenses ([bytes]): List of licenses associated to sha1
    • tool (str): nomossa
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
Returns:

content_license entries which failed due to unknown licenses

Return type:

list

content_fossology_license_get_range(start, end, indexer_configuration_id, limit=1000)[source]

Retrieve licenses within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_metadata_missing(metadata)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
content_metadata_get(ids)[source]

Retrieve metadata per id.

Parameters:ids (iterable) – sha1 checksums
Yields:dictionaries with the following keys – - id (bytes) - metadata (str): associated metadata - tool (dict): tool used to compute metadata
content_metadata_add(metadata, conflict_update=False)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute the results
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_intrinsic_metadata_missing(metadata)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1_git revision identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing ids
revision_intrinsic_metadata_get(ids)[source]

Retrieve revision metadata per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dictionaries with the following keys

  • id (bytes)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
revision_intrinsic_metadata_add(metadata, conflict_update=False)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1_git of revision
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
    • mappings (List[str]): list of mappings used to translate these metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_intrinsic_metadata_delete(entries)[source]

Remove revision metadata from the storage.

Parameters:entries (dict) – dictionaries with the following keys: - revision (int): origin identifier - id (int): tool used to compute metadata
origin_intrinsic_metadata_get(ids)[source]

Retrieve origin metadata per id.

Parameters:

ids (iterable) – origin identifiers

Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_add(metadata, conflict_update=False)[source]

Add origin metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: origin identifier
    • from_revision: sha1 id of the revision used to generate these metadata.
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
    • mappings (List[str]): list of mappings used to translate these metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_delete(entries)[source]

Remove origin metadata from the storage.

Parameters:entries (dict) –

dictionaries with the following keys: - id (int): origin identifier - indexer_configuration_id (int): tool used to compute

metadata
origin_intrinsic_metadata_search_fulltext(conjunction, limit=100)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • conjunction (List[str]) – List of terms to be searched for.
  • limit (int) – The maximum number of results to return
Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_search_by_producer(start=0, end=None, limit=100, ids_only=False, mappings=None, tool_ids=None, db=None, cur=None)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • start (int) – The minimum origin id to return
  • end (int) – The maximum origin id to return
  • limit (int) – The maximum number of results to return
  • ids_only (bool) – Determines whether only origin ids are returned or the content as well
  • mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.
Yields:

list

list of origin ids (int) if ids_only=True, else

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_stats()[source]

Returns statistics on stored intrinsic metadata.

Returns:dictionary with keys:
  • total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)
  • non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from
  • per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.
Return type:dict
indexer_configuration_add(tools)[source]

Add new tools to the storage.

Parameters:tools ([dict]) –

List of dictionary representing tool to insert in the db. Dictionary with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
Return type:list
indexer_configuration_get(tool)[source]

Retrieve tool information.

Parameters:tool (dict) –

Dictionary representing a tool with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:The same dictionary with an id key, None otherwise.
_tool_key(tool)[source]
__dict__ = mappingproxy({'origin_intrinsic_metadata_add': <function IndexerStorage.origin_intrinsic_metadata_add>, 'content_fossology_license_get_range': <function IndexerStorage.content_fossology_license_get_range>, 'origin_intrinsic_metadata_stats': <function IndexerStorage.origin_intrinsic_metadata_stats>, 'content_language_add': <function IndexerStorage.content_language_add>, 'indexer_configuration_add': <function IndexerStorage.indexer_configuration_add>, 'content_metadata_add': <function IndexerStorage.content_metadata_add>, 'content_metadata_missing': <function IndexerStorage.content_metadata_missing>, 'content_ctags_missing': <function IndexerStorage.content_ctags_missing>, 'content_language_get': <function IndexerStorage.content_language_get>, '__weakref__': <attribute '__weakref__' of 'IndexerStorage' objects>, '__doc__': 'In-memory SWH indexer storage.', '__module__': 'swh.indexer.storage.in_memory', 'content_fossology_license_get': <function IndexerStorage.content_fossology_license_get>, '_tool_key': <function IndexerStorage._tool_key>, 'origin_intrinsic_metadata_search_fulltext': <function IndexerStorage.origin_intrinsic_metadata_search_fulltext>, 'revision_intrinsic_metadata_delete': <function IndexerStorage.revision_intrinsic_metadata_delete>, 'content_mimetype_missing': <function IndexerStorage.content_mimetype_missing>, 'content_ctags_search': <function IndexerStorage.content_ctags_search>, 'revision_intrinsic_metadata_get': <function IndexerStorage.revision_intrinsic_metadata_get>, 'content_mimetype_get_range': <function IndexerStorage.content_mimetype_get_range>, 'revision_intrinsic_metadata_add': <function IndexerStorage.revision_intrinsic_metadata_add>, 'content_mimetype_add': <function IndexerStorage.content_mimetype_add>, 'content_language_missing': <function IndexerStorage.content_language_missing>, 'origin_intrinsic_metadata_get': <function IndexerStorage.origin_intrinsic_metadata_get>, 'origin_intrinsic_metadata_delete': <function IndexerStorage.origin_intrinsic_metadata_delete>, '__init__': <function IndexerStorage.__init__>, 'content_ctags_add': <function IndexerStorage.content_ctags_add>, 'content_ctags_get': <function IndexerStorage.content_ctags_get>, '__dict__': <attribute '__dict__' of 'IndexerStorage' objects>, 'revision_intrinsic_metadata_missing': <function IndexerStorage.revision_intrinsic_metadata_missing>, 'content_metadata_get': <function IndexerStorage.content_metadata_get>, 'content_fossology_license_add': <function IndexerStorage.content_fossology_license_add>, 'indexer_configuration_get': <function IndexerStorage.indexer_configuration_get>, 'content_mimetype_get': <function IndexerStorage.content_mimetype_get>, 'origin_intrinsic_metadata_search_by_producer': <function IndexerStorage.origin_intrinsic_metadata_search_by_producer>})
__module__ = 'swh.indexer.storage.in_memory'
__weakref__

list of weak references to the object (if defined)

Module contents

swh.indexer.storage.get_indexer_storage(cls, args)[source]

Get an indexer storage object of class storage_class with arguments storage_args.

Parameters:
  • cls (str) – storage’s class, either ‘local’ or ‘remote’
  • args (dict) – dictionary of arguments passed to the storage class constructor
Returns:

an instance of swh.indexer’s storage (either local or remote)

Raises:

ValueError if passed an unknown storage class.

swh.indexer.storage._check_id_duplicates(data)[source]

If any two dictionaries in data have the same id, raises a ValueError.

Values associated to the key must be hashable.

Parameters:data (List[dict]) – List of dictionaries to be inserted
>>> _check_id_duplicates([
...     {'id': 'foo', 'data': 'spam'},
...     {'id': 'bar', 'data': 'egg'},
... ])
>>> _check_id_duplicates([
...     {'id': 'foo', 'data': 'spam'},
...     {'id': 'foo', 'data': 'egg'},
... ])
Traceback (most recent call last):
  ...
ValueError: The same id is present more than once.
class swh.indexer.storage.IndexerStorage(db, min_pool_conns=1, max_pool_conns=10)[source]

Bases: object

SWH Indexer Storage

__init__(db, min_pool_conns=1, max_pool_conns=10)[source]
Parameters:db_conn – either a libpq connection string, or a psycopg2 connection
get_db()[source]
put_db(db)[source]
check_config(*, check_write, db=None, cur=None)[source]

Check that the storage is configured and ready to go.

content_mimetype_missing(mimetypes, db=None, cur=None)[source]

Generate mimetypes missing from storage.

Parameters:mimetypes (iterable) –

iterable of dict with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:tuple (id, indexer_configuration_id) – missing id
_content_get_range(content_type, start, end, indexer_configuration_id, limit=1000, with_textual_data=False, db=None, cur=None)[source]
Retrieve ids of type content_type within range [start, end] bound
by limit.
Parameters:
  • **content_type** (str) – content’s type (mimetype, language, etc…)
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
  • **with_textual_data** (bool) – Deal with only textual content (True) or all content (all contents by defaults, False)
Raises:
  • ValueError for;
    • limit to None
    • wrong content_type provided
Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_mimetype_get_range(start, end, indexer_configuration_id, limit=1000, db=None, cur=None)[source]

Retrieve mimetypes within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_mimetype_add(mimetypes, conflict_update=False, db=None, cur=None)[source]

Add mimetypes not present in storage.

Parameters:mimetypes (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • indexer_configuration_id (int): tool’s id used to compute the results
  • conflict_update (bool): Flag to determine if we want to overwrite (True) or skip duplicates (False, the default)
content_mimetype_get(ids, db=None, cur=None)[source]

Retrieve full content mimetype per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

mimetypes (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • tool (dict): Tool used to compute the language
content_language_missing(languages, db=None, cur=None)[source]

List languages missing from storage.

Parameters:languages (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_language_get(ids, db=None, cur=None)[source]

Retrieve full content language per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

languages (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • lang (bytes): raw content’s language
  • tool (dict): Tool used to compute the language
content_language_add(languages, conflict_update=False, db=None, cur=None)[source]

Add languages not present in storage.

Parameters:
  • languages (iterable) –

    dictionaries with keys:

    • id (bytes): sha1
    • lang (bytes): language detected
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
content_ctags_missing(ctags, db=None, cur=None)[source]

List ctags missing from storage.

Parameters:ctags (iterable) –

dicts with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_ctags_get(ids, db=None, cur=None)[source]

Retrieve ctags per id.

Parameters:ids (iterable) – sha1 checksums
Yields:Dictionaries with keys – - id (bytes): content’s identifier - name (str): symbol’s name - kind (str): symbol’s kind - lang (str): language for that content - tool (dict): tool used to compute the ctags’ info
content_ctags_add(ctags, conflict_update=False, db=None, cur=None)[source]

Add ctags not present in storage

Parameters:ctags (iterable) –

dictionaries with keys:

  • id (bytes): sha1
  • ctags ([list): List of dictionary with keys: name, kind, line, lang

Search through content’s raw ctags symbols.

Parameters:
  • expression (str) – Expression to search for
  • limit (int) – Number of rows to return (default to 10).
  • last_sha1 (str) – Offset from which retrieving data (default to ‘’).
Yields:

rows of ctags including id, name, lang, kind, line, etc…

content_fossology_license_get(ids, db=None, cur=None)[source]

Retrieve licenses per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dict{id: facts} where facts is a dict with the following keys:

  • licenses ([str]): associated licenses for that content
  • tool (dict): Tool used to compute the license
content_fossology_license_add(licenses, conflict_update=False, db=None, cur=None)[source]

Add licenses not present in storage.

Parameters:
  • licenses (iterable) –

    dictionaries with keys:

    • id: sha1
    • licenses ([bytes]): List of licenses associated to sha1
    • tool (str): nomossa
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
Returns:

content_license entries which failed due to unknown licenses

Return type:

list

content_fossology_license_get_range(start, end, indexer_configuration_id, limit=1000, db=None, cur=None)[source]

Retrieve licenses within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_metadata_missing(metadata, db=None, cur=None)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
content_metadata_get(ids, db=None, cur=None)[source]

Retrieve metadata per id.

Parameters:ids (iterable) – sha1 checksums
Yields:dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata
content_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1
    • metadata: arbitrary dict
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_intrinsic_metadata_missing(metadata, db=None, cur=None)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1_git revision identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing ids
revision_intrinsic_metadata_get(ids, db=None, cur=None)[source]

Retrieve revision metadata per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dictionaries with the following keys – - id (bytes) - metadata (str): associated metadata - tool (dict): tool used to compute metadata - mappings (List[str]): list of mappings used to translate

these metadata

revision_intrinsic_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1_git of revision
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
    • mappings (List[str]): list of mappings used to translate these metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_intrinsic_metadata_delete(entries, db=None, cur=None)[source]

Remove revision metadata from the storage.

Parameters:entries (dict) –

dictionaries with the following keys: - id (bytes): revision identifier - indexer_configuration_id (int): tool used to compute

metadata
origin_intrinsic_metadata_get(ids, db=None, cur=None)[source]

Retrieve origin metadata per id.

Parameters:

ids (iterable) – origin identifiers

Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add origin metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: origin identifier
    • from_revision: sha1 id of the revision used to generate these metadata.
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
    • mappings (List[str]): list of mappings used to translate these metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_delete(entries, db=None, cur=None)[source]

Remove origin metadata from the storage.

Parameters:entries (dict) –

dictionaries with the following keys: - id (int): origin identifier - indexer_configuration_id (int): tool used to compute

metadata
origin_intrinsic_metadata_search_fulltext(conjunction, limit=100, db=None, cur=None)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • conjunction (List[str]) – List of terms to be searched for.
  • limit (int) – The maximum number of results to return
Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_search_by_producer(start=0, end=None, limit=100, ids_only=False, mappings=None, tool_ids=None, db=None, cur=None)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • start (int) – The minimum origin id to return
  • end (int) – The maximum origin id to return
  • limit (int) – The maximum number of results to return
  • ids_only (bool) – Determines whether only origin ids are returned or the content as well
  • mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings.
Yields:

list

list of origin ids (int) if ids_only=True, else

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
  • mappings (List[str]): list of mappings used to translate these metadata
origin_intrinsic_metadata_stats(db=None, cur=None)[source]

Returns counts of indexed metadata per origins, broken down into metadata types.

Returns:dictionary with keys:
  • total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary)
  • non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from
  • per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings.
Return type:dict
__dict__ = mappingproxy({'origin_intrinsic_metadata_add': <function IndexerStorage.origin_intrinsic_metadata_add>, 'content_fossology_license_get_range': <function IndexerStorage.content_fossology_license_get_range>, 'origin_intrinsic_metadata_stats': <function IndexerStorage.origin_intrinsic_metadata_stats>, 'content_language_add': <function IndexerStorage.content_language_add>, 'indexer_configuration_add': <function IndexerStorage.indexer_configuration_add>, 'content_metadata_add': <function IndexerStorage.content_metadata_add>, 'content_metadata_missing': <function IndexerStorage.content_metadata_missing>, 'content_ctags_missing': <function IndexerStorage.content_ctags_missing>, 'content_language_get': <function IndexerStorage.content_language_get>, '__weakref__': <attribute '__weakref__' of 'IndexerStorage' objects>, '__doc__': 'SWH Indexer Storage\n\n ', '__module__': 'swh.indexer.storage', 'content_fossology_license_get': <function IndexerStorage.content_fossology_license_get>, 'origin_intrinsic_metadata_search_fulltext': <function IndexerStorage.origin_intrinsic_metadata_search_fulltext>, 'revision_intrinsic_metadata_delete': <function IndexerStorage.revision_intrinsic_metadata_delete>, 'put_db': <function IndexerStorage.put_db>, '_content_get_range': <function IndexerStorage._content_get_range>, 'content_mimetype_missing': <function IndexerStorage.content_mimetype_missing>, 'content_ctags_search': <function IndexerStorage.content_ctags_search>, 'revision_intrinsic_metadata_get': <function IndexerStorage.revision_intrinsic_metadata_get>, 'content_mimetype_get_range': <function IndexerStorage.content_mimetype_get_range>, 'revision_intrinsic_metadata_add': <function IndexerStorage.revision_intrinsic_metadata_add>, 'get_db': <function IndexerStorage.get_db>, 'content_mimetype_add': <function IndexerStorage.content_mimetype_add>, 'content_language_missing': <function IndexerStorage.content_language_missing>, 'origin_intrinsic_metadata_get': <function IndexerStorage.origin_intrinsic_metadata_get>, 'origin_intrinsic_metadata_delete': <function IndexerStorage.origin_intrinsic_metadata_delete>, '__init__': <function IndexerStorage.__init__>, 'check_config': <function IndexerStorage.check_config>, 'content_ctags_add': <function IndexerStorage.content_ctags_add>, 'content_ctags_get': <function IndexerStorage.content_ctags_get>, '__dict__': <attribute '__dict__' of 'IndexerStorage' objects>, 'revision_intrinsic_metadata_missing': <function IndexerStorage.revision_intrinsic_metadata_missing>, 'content_metadata_get': <function IndexerStorage.content_metadata_get>, 'content_fossology_license_add': <function IndexerStorage.content_fossology_license_add>, 'indexer_configuration_get': <function IndexerStorage.indexer_configuration_get>, 'content_mimetype_get': <function IndexerStorage.content_mimetype_get>, 'origin_intrinsic_metadata_search_by_producer': <function IndexerStorage.origin_intrinsic_metadata_search_by_producer>})
__module__ = 'swh.indexer.storage'
__weakref__

list of weak references to the object (if defined)

indexer_configuration_add(tools, db=None, cur=None)[source]

Add new tools to the storage.

Parameters:tools ([dict]) –

List of dictionary representing tool to insert in the db. Dictionary with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
indexer_configuration_get(tool, db=None, cur=None)[source]

Retrieve tool information.

Parameters:tool (dict) –

Dictionary representing a tool with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:The same dictionary with an id key, None otherwise.