swh.indexer.storage package

Submodules

swh.indexer.storage.converters module

swh.indexer.storage.converters.ctags_to_db(ctags)[source]

Convert a ctags entry into a ready ctags entry.

Parameters:ctags (dict) –

ctags entry with the following keys:

  • id (bytes): content’s identifier
  • tool_id (int): tool id used to compute ctags
  • ctags ([dict]): List of dictionary with the following keys:
    • name (str): symbol’s name
    • kind (str): symbol’s kind
    • line (int): symbol’s line in the content
    • language (str): language
Returns:list of ctags entries as dicts with the following keys:
  • id (bytes): content’s identifier
  • name (str): symbol’s name
  • kind (str): symbol’s kind
  • language (str): language for that content
  • tool_id (int): tool id used to compute ctags
Return type:list
swh.indexer.storage.converters.db_to_ctags(ctag)[source]

Convert a ctags entry into a ready ctags entry.

Parameters:ctags (dict) –

ctags entry with the following keys:

  • id (bytes): content’s identifier
  • ctags ([dict]): List of dictionary with the following keys: - name (str): symbol’s name - kind (str): symbol’s kind - line (int): symbol’s line in the content - language (str): language
Returns:list of ctags ready entry (dict with the following keys):
  • id (bytes): content’s identifier
  • name (str): symbol’s name
  • kind (str): symbol’s kind
  • language (str): language for that content
  • tool (dict): tool used to compute the ctags
Return type:list
swh.indexer.storage.converters.db_to_mimetype(mimetype)[source]

Convert a ctags entry into a ready ctags output.

swh.indexer.storage.converters.db_to_language(language)[source]

Convert a language entry into a ready language output.

swh.indexer.storage.converters.db_to_metadata(metadata)[source]

Convert a metadata entry into a ready metadata output.

swh.indexer.storage.converters.db_to_fossology_license(license)[source]

swh.indexer.storage.db module

class swh.indexer.storage.db.Db(conn, pool=None)[source]

Bases: swh.storage.db.BaseDb

Proxy to the SWH Indexer DB, with wrappers around stored procedures

content_mimetype_hash_keys = ['id', 'indexer_configuration_id']
content_mimetype_missing_from_list(mimetypes, cur=None)[source]

List missing mimetypes.

content_mimetype_cols = ['id', 'mimetype', 'encoding', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_mimetype(cur=None)[source]
content_mimetype_add_from_temp(conflict_update, cur=None)[source]
content_indexer_names = {'fossology_license': 'content_fossology_license', 'mimetype': 'content_mimetype'}
content_get_range(content_type, start, end, indexer_configuration_id, limit=1000, with_textual_data=False, cur=None)[source]

Retrieve contents with content_type, within range [start, end] bound by limit and associated to the given indexer configuration id.

When asking to work on textual content, that filters on the mimetype table with any mimetype that is not binary.

content_mimetype_get_from_list(ids, cur=None)[source]
content_language_hash_keys = ['id', 'indexer_configuration_id']
content_language_missing_from_list(languages, cur=None)[source]

List missing languages.

content_language_cols = ['id', 'lang', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_language(cur=None)[source]
content_language_add_from_temp(conflict_update, cur=None)[source]
content_language_get_from_list(ids, cur=None)[source]
content_ctags_hash_keys = ['id', 'indexer_configuration_id']
content_ctags_missing_from_list(ctags, cur=None)[source]

List missing ctags.

content_ctags_cols = ['id', 'name', 'kind', 'line', 'lang', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_ctags(cur=None)[source]
content_ctags_add_from_temp(conflict_update, cur=None)[source]
content_ctags_get_from_list(ids, cur=None)[source]
content_fossology_license_cols = ['id', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration', 'licenses']
mktemp_content_fossology_license(cur=None)[source]
content_fossology_license_add_from_temp(conflict_update, cur=None)[source]

Add new licenses per content.

content_fossology_license_get_from_list(ids, cur=None)[source]

Retrieve licenses per id.

content_metadata_hash_keys = ['id', 'indexer_configuration_id']
content_metadata_missing_from_list(metadata, cur=None)[source]

List missing metadata.

content_metadata_cols = ['id', 'translated_metadata', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_content_metadata(cur=None)[source]
content_metadata_add_from_temp(conflict_update, cur=None)[source]
content_metadata_get_from_list(ids, cur=None)[source]
revision_metadata_hash_keys = ['id', 'indexer_configuration_id']
revision_metadata_missing_from_list(metadata, cur=None)[source]

List missing metadata.

revision_metadata_cols = ['id', 'translated_metadata', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_revision_metadata(cur=None)[source]
revision_metadata_add_from_temp(conflict_update, cur=None)[source]
revision_metadata_get_from_list(ids, cur=None)[source]
origin_intrinsic_metadata_cols = ['origin_id', 'metadata', 'from_revision', 'tool_id', 'tool_name', 'tool_version', 'tool_configuration']
origin_intrinsic_metadata_regconfig = 'pg_catalog.simple'

The dictionary used to normalize ‘metadata’ and queries. ‘pg_catalog.simple’ provides no stopword, so it should be suitable for proper names and non-English content. When updating this value, make sure to add a new index on origin_intrinsic_metadata.metadata.

mktemp_origin_intrinsic_metadata(cur=None)[source]
origin_intrinsic_metadata_add_from_temp(conflict_update, cur=None)[source]
origin_intrinsic_metadata_get_from_list(orig_ids, cur=None)[source]
origin_intrinsic_metadata_search_fulltext(terms, *, limit, cur=None)[source]
indexer_configuration_cols = ['id', 'tool_name', 'tool_version', 'tool_configuration']
mktemp_indexer_configuration(cur=None)[source]
indexer_configuration_add_from_temp(cur=None)[source]
indexer_configuration_get(tool_name, tool_version, tool_configuration, cur=None)[source]

swh.indexer.storage.in_memory module

class swh.indexer.storage.in_memory.SubStorage(tools)[source]

Bases: object

Implements common missing/get/add logic for each indexer type.

missing(ids)[source]

List data missing from storage.

Parameters:data (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
get(ids)[source]

Retrieve data per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dict

dictionaries with the following keys:

  • id (bytes)
  • tool (dict): tool used to compute metadata
  • arbitrary data (as provided to add)
get_all()[source]
get_range(start, end, indexer_configuration_id, limit)[source]

Retrieve data within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

add(data, conflict_update)[source]

Add data not present in storage.

Parameters:
  • data (iterable) –

    dictionaries with keys:

    • id: sha1
    • indexer_configuration_id: tool used to compute the results
    • arbitrary data
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false)
add_merge(new_data, conflict_update, merged_key)[source]
class swh.indexer.storage.in_memory.IndexerStorage[source]

Bases: object

In-memory SWH indexer storage.

content_mimetype_missing(mimetypes)[source]

Generate mimetypes missing from storage.

Parameters:mimetypes (iterable) –

iterable of dict with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:tuple (id, indexer_configuration_id) – missing id
content_mimetype_get_range(start, end, indexer_configuration_id, limit=1000)[source]

Retrieve mimetypes within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_mimetype_add(mimetypes, conflict_update=False)[source]

Add mimetypes not present in storage.

Parameters:mimetypes (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • indexer_configuration_id (int): tool’s id used to compute the results
  • conflict_update (bool): Flag to determine if we want to overwrite (True) or skip duplicates (False, the default)
content_mimetype_get(ids, db=None, cur=None)[source]

Retrieve full content mimetype per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

mimetypes (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • tool (dict): Tool used to compute the language
content_language_missing(languages)[source]

List languages missing from storage.

Parameters:languages (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_language_get(ids)[source]

Retrieve full content language per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

languages (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • lang (bytes): raw content’s language
  • tool (dict): Tool used to compute the language
content_language_add(languages, conflict_update=False)[source]

Add languages not present in storage.

Parameters:
  • languages (iterable) –

    dictionaries with keys:

    • id (bytes): sha1
    • lang (bytes): language detected
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
content_ctags_missing(ctags)[source]

List ctags missing from storage.

Parameters:ctags (iterable) –

dicts with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_ctags_get(ids)[source]

Retrieve ctags per id.

Parameters:ids (iterable) – sha1 checksums
Yields:Dictionaries with keys – - id (bytes): content’s identifier - name (str): symbol’s name - kind (str): symbol’s kind - lang (str): language for that content - tool (dict): tool used to compute the ctags’ info
content_ctags_add(ctags, conflict_update=False)[source]

Add ctags not present in storage

Parameters:ctags (iterable) –

dictionaries with keys:

  • id (bytes): sha1
  • ctags ([list): List of dictionary with keys: name, kind,
    line, lang
  • indexer_configuration_id: tool used to compute the results

Search through content’s raw ctags symbols.

Parameters:
  • expression (str) – Expression to search for
  • limit (int) – Number of rows to return (default to 10).
  • last_sha1 (str) – Offset from which retrieving data (default to ‘’).
Yields:

rows of ctags including id, name, lang, kind, line, etc…

content_fossology_license_get(ids)[source]

Retrieve licenses per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

`{id –

facts}` where facts is a dict with the following keys:

  • licenses ([str]): associated licenses for that content
  • tool (dict): Tool used to compute the license
content_fossology_license_add(licenses, conflict_update=False)[source]

Add licenses not present in storage.

Parameters:
  • licenses (iterable) –

    dictionaries with keys:

    • id: sha1
    • licenses ([bytes]): List of licenses associated to sha1
    • tool (str): nomossa
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
Returns:

content_license entries which failed due to unknown licenses

Return type:

list

content_fossology_license_get_range(start, end, indexer_configuration_id, limit=1000)[source]

Retrieve licenses within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_metadata_missing(metadata)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
content_metadata_get(ids)[source]

Retrieve metadata per id.

Parameters:ids (iterable) – sha1 checksums
Yields:dictionaries with the following keys – - id (bytes) - translated_metadata (str): associated metadata - tool (dict): tool used to compute metadata
content_metadata_add(metadata, conflict_update=False)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1
    • translated_metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute the results
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_metadata_missing(metadata)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1_git revision identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing ids
revision_metadata_get(ids)[source]

Retrieve revision metadata per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

dictionaries with the following keys

  • id (bytes)
  • translated_metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
revision_metadata_add(metadata, conflict_update=False)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1_git of revision
    • translated_metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_get(ids)[source]

Retrieve origin metadata per id.

Parameters:

ids (iterable) – origin identifiers

Yields:

list

dictionaries with the following keys:

  • origin_id (int)
  • translated_metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
origin_intrinsic_metadata_add(metadata, conflict_update=False)[source]

Add origin metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • origin_id: origin identifier
    • from_revision: sha1 id of the revision used to generate these metadata.
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_search_fulltext(conjunction, limit=100)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • conjunction (List[str]) – List of terms to be searched for.
  • limit (int) – The maximum number of results to return
Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
indexer_configuration_add(tools)[source]

Add new tools to the storage.

Parameters:tools ([dict]) –

List of dictionary representing tool to insert in the db. Dictionary with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
Return type:list
indexer_configuration_get(tool)[source]

Retrieve tool information.

Parameters:tool (dict) –

Dictionary representing a tool with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:The same dictionary with an id key, None otherwise.

Module contents

swh.indexer.storage.get_indexer_storage(cls, args)[source]

Get an indexer storage object of class storage_class with arguments storage_args.

Parameters:
  • cls (str) – storage’s class, either ‘local’ or ‘remote’
  • args (dict) – dictionary of arguments passed to the storage class constructor
Returns:

an instance of swh.indexer’s storage (either local or remote)

Raises:

ValueError if passed an unknown storage class.

class swh.indexer.storage.IndexerStorage(db, min_pool_conns=1, max_pool_conns=10)[source]

Bases: object

SWH Indexer Storage

get_db()[source]
check_config(*, check_write)[source]

Check that the storage is configured and ready to go.

content_mimetype_missing(mimetypes, db=None, cur=None)[source]

Generate mimetypes missing from storage.

Parameters:mimetypes (iterable) –

iterable of dict with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:tuple (id, indexer_configuration_id) – missing id
content_mimetype_get_range(start, end, indexer_configuration_id, limit=1000, db=None, cur=None)[source]

Retrieve mimetypes within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_mimetype_add(mimetypes, conflict_update=False, db=None, cur=None)[source]

Add mimetypes not present in storage.

Parameters:mimetypes (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • indexer_configuration_id (int): tool’s id used to compute the results
  • conflict_update (bool): Flag to determine if we want to overwrite (True) or skip duplicates (False, the default)
content_mimetype_get(ids, db=None, cur=None)[source]

Retrieve full content mimetype per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

mimetypes (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • mimetype (bytes): raw content’s mimetype
  • encoding (bytes): raw content’s encoding
  • tool (dict): Tool used to compute the language
content_language_missing(languages, db=None, cur=None)[source]

List languages missing from storage.

Parameters:languages (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_language_get(ids, db=None, cur=None)[source]

Retrieve full content language per ids.

Parameters:

ids (iterable) – sha1 identifier

Yields:

languages (iterable)

dictionaries with keys:

  • id (bytes): sha1 identifier
  • lang (bytes): raw content’s language
  • tool (dict): Tool used to compute the language
content_language_add(languages, conflict_update=False, db=None, cur=None)[source]

Add languages not present in storage.

Parameters:
  • languages (iterable) –

    dictionaries with keys:

    • id (bytes): sha1
    • lang (bytes): language detected
  • conflict_update (bool) – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
content_ctags_missing(ctags, db=None, cur=None)[source]

List ctags missing from storage.

Parameters:ctags (iterable) –

dicts with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:an iterable of missing id for the tuple (id, indexer_configuration_id)
content_ctags_get(ids, db=None, cur=None)[source]

Retrieve ctags per id.

Parameters:ids (iterable) – sha1 checksums
Yields:Dictionaries with keys – - id (bytes): content’s identifier - name (str): symbol’s name - kind (str): symbol’s kind - lang (str): language for that content - tool (dict): tool used to compute the ctags’ info
content_ctags_add(ctags, conflict_update=False, db=None, cur=None)[source]

Add ctags not present in storage

Parameters:ctags (iterable) –

dictionaries with keys:

  • id (bytes): sha1
  • ctags ([list): List of dictionary with keys: name, kind, line, lang

Search through content’s raw ctags symbols.

Parameters:
  • expression (str) – Expression to search for
  • limit (int) – Number of rows to return (default to 10).
  • last_sha1 (str) – Offset from which retrieving data (default to ‘’).
Yields:

rows of ctags including id, name, lang, kind, line, etc…

content_fossology_license_get(ids, db=None, cur=None)[source]

Retrieve licenses per id.

Parameters:

ids (iterable) – sha1 checksums

Yields:

`{id –

facts}` where facts is a dict with the following keys:

  • licenses ([str]): associated licenses for that content
  • tool (dict): Tool used to compute the license
content_fossology_license_add(licenses, conflict_update=False, db=None, cur=None)[source]

Add licenses not present in storage.

Parameters:
  • licenses (iterable) –

    dictionaries with keys:

    • id: sha1
    • licenses ([bytes]): List of licenses associated to sha1
    • tool (str): nomossa
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
Returns:

content_license entries which failed due to unknown licenses

Return type:

list

content_fossology_license_get_range(start, end, indexer_configuration_id, limit=1000, db=None, cur=None)[source]

Retrieve licenses within range [start, end] bound by limit.

Parameters:
  • **start** (bytes) – Starting identifier range (expected smaller than end)
  • **end** (bytes) – Ending identifier range (expected larger than start)
  • **indexer_configuration_id** (int) – The tool used to index data
  • **limit** (int) – Limit result (default to 1000)
Raises:

ValueError for limit to None

Returns:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at
    this sha1 if any

Return type:

a dict with keys

content_metadata_missing(metadata, db=None, cur=None)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1 identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing sha1s
content_metadata_get(ids, db=None, cur=None)[source]

Retrieve metadata per id.

Parameters:ids (iterable) – sha1 checksums
Yields:dictionaries with the following keys – id (bytes) translated_metadata (str): associated metadata tool (dict): tool used to compute metadata
content_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1
    • translated_metadata: arbitrary dict
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
revision_metadata_missing(metadata, db=None, cur=None)[source]

List metadata missing from storage.

Parameters:metadata (iterable) –

dictionaries with keys:

  • id (bytes): sha1_git revision identifier
  • indexer_configuration_id (int): tool used to compute the results
Yields:missing ids
revision_metadata_get(ids, db=None, cur=None)[source]

Retrieve revision metadata per id.

Parameters:ids (iterable) – sha1 checksums
Yields:dictionaries with the following keys – - id (bytes) - translated_metadata (str): associated metadata - tool (dict): tool used to compute metadata
revision_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • id: sha1_git of revision
    • translated_metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_get(ids, db=None, cur=None)[source]

Retrieve origin metadata per id.

Parameters:

ids (iterable) – origin identifiers

Yields:

list

dictionaries with the following keys:

  • origin_id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
origin_intrinsic_metadata_add(metadata, conflict_update=False, db=None, cur=None)[source]

Add origin metadata not present in storage.

Parameters:
  • metadata (iterable) –

    dictionaries with keys:

    • origin_id: origin identifier
    • from_revision: sha1 id of the revision used to generate these metadata.
    • metadata: arbitrary dict
    • indexer_configuration_id: tool used to compute metadata
  • conflict_update – Flag to determine if we want to overwrite (true) or skip duplicates (false, the default)
origin_intrinsic_metadata_search_fulltext(conjunction, limit=100, db=None, cur=None)[source]

Returns the list of origins whose metadata contain all the terms.

Parameters:
  • conjunction (List[str]) – List of terms to be searched for.
  • limit (int) – The maximum number of results to return
Yields:

list

dictionaries with the following keys:

  • id (int)
  • metadata (str): associated metadata
  • tool (dict): tool used to compute metadata
indexer_configuration_add(tools, db=None, cur=None)[source]

Add new tools to the storage.

Parameters:tools ([dict]) –

List of dictionary representing tool to insert in the db. Dictionary with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list.
indexer_configuration_get(tool, db=None, cur=None)[source]

Retrieve tool information.

Parameters:tool (dict) –

Dictionary representing a tool with the following keys:

  • tool_name (str): tool’s name
  • tool_version (str): tool’s version
  • tool_configuration (dict): tool’s configuration (free form dict)
Returns:The same dictionary with an id key, None otherwise.