swh.indexer.metadata module

swh.indexer.metadata.call_with_batches(f: Callable[[List[Dict[str, Any]]], Dict[str, Any]], args: List[Dict[str, str]], batch_size: int) → Iterator[str][source]

Calls a function with batches of args, and concatenates the results.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.ContentIndexer

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata

  • reading content from objstorage with the content’s id sha1

  • computing metadata by given context

  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool

  • store result in content_metadata table

filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id, data, log_suffix='unknown revision')[source]

Index sha1s’ content and store result.

Parameters
  • id (bytes) – content’s identifier

  • data (bytes) – raw content in bytes

Returns

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type

dict

persist_index_computations(results: List[Dict], policy_update: str) → Dict[str, int][source]

Persist the results in storage.

Parameters
  • results – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - metadata (jsonb): detected metadata

  • policy_update – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

results: List[Dict]
scheduler: Any
class swh.indexer.metadata.RevisionMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.RevisionIndexer

Revision-level indexer

This indexer is in charge of:

  • filtering revisions already indexed in revision_intrinsic_metadata table with defined computation tool

  • retrieve all entry_files in root directory

  • use metadata_detector for file_names containing metadata

  • compute metadata translation if necessary and possible (depends on tool)

  • send sha1s to content indexing if possible

  • store the results for revision

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'swh-metadata-detector', 'version': '0.0.2', 'configuration': {}})}
filter(sha1_gits)[source]

Filter out known sha1s and return only missing ones.

index(rev)[source]

Index rev by processing it and organizing result.

use metadata_detector to iterate on filenames

  • if one filename detected -> sends file to content indexer

  • if multiple file detected -> translation needed at revision level

Parameters

rev – revision model object from storage

Returns

dictionary representing a revision_intrinsic_metadata, with keys:

  • id (str): rev’s identifier (sha1_git)

  • indexer_configuration_id (bytes): tool used

  • metadata: dict of retrieved metadata

Return type

dict

persist_index_computations(results: List[Dict], policy_update: str) → Dict[str, int][source]

Persist the results in storage.

Parameters
  • results – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes

  • policy_update – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

translate_revision_intrinsic_metadata(detected_files: Dict[str, List[Any]], log_suffix: str) → Tuple[List[Any], List[Any]][source]

Determine plan of action to translate metadata when containing one or multiple detected files:

Parameters

detected_files – dictionary mapping context names (e.g., “npm”, “authors”) to list of sha1

Returns

list of mappings used and dict with translated metadata according to the CodeMeta vocabulary

Return type

(List[str], dict)

results: List[Dict]
scheduler: Any
class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]

Bases: swh.indexer.indexer.OriginIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'swh-metadata-detector', 'version': '0.0.2', 'configuration': {}})}
USE_TOOLS = False
index_list(origin_urls, **kwargs)[source]
persist_index_computations(results: List[Dict], policy_update: str) → Dict[str, int][source]

Persist the computation resulting from the index.

Parameters
  • results ([result]) – List of results. One result is the result of the index function.

  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

Returns

a summary dict of what has been inserted in the storage

results: List[Dict]
scheduler: Any