swh.indexer.metadata module

swh.indexer.metadata.call_with_batches(f: Callable[[List[swh.indexer.metadata.T1]], Iterable[swh.indexer.metadata.T2]], args: List[swh.indexer.metadata.T1], batch_size: int) Iterator[swh.indexer.metadata.T2][source]

Calls a function with batches of args, and concatenates the results.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.ContentIndexer[swh.indexer.storage.model.ContentMetadataRow]

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata

  • reading content from objstorage with the content’s id sha1

  • computing metadata by given context

  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool

  • store result in content_metadata table

Prepare and check that the indexer is ready to run.

filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id: bytes, data: Optional[bytes] = None, log_suffix='unknown revision', **kwargs) List[swh.indexer.storage.model.ContentMetadataRow][source]

Index sha1s’ content and store result.

Parameters
  • id – content’s identifier

  • data – raw content in bytes

Returns

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type

dict

persist_index_computations(results: List[swh.indexer.storage.model.ContentMetadataRow]) Dict[str, int][source]

Persist the results in storage.

Parameters

results – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - metadata (jsonb): detected metadata

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface
class swh.indexer.metadata.RevisionMetadataIndexer(*args, **kwargs)[source]

Bases: swh.indexer.indexer.RevisionIndexer[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]

Revision-level indexer

This indexer is in charge of:

  • filtering revisions already indexed in revision_intrinsic_metadata table with defined computation tool

  • retrieve all entry_files in root directory

  • use metadata_detector for file_names containing metadata

  • compute metadata translation if necessary and possible (depends on tool)

  • send sha1s to content indexing if possible

  • store the results for revision

Prepare and check that the indexer is ready to run.

filter(sha1_gits)[source]

Filter out known sha1s and return only missing ones.

index(id: bytes, data: Optional[swh.model.model.Revision], **kwargs) List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow][source]

Index rev by processing it and organizing result.

use metadata_detector to iterate on filenames

  • if one filename detected -> sends file to content indexer

  • if multiple file detected -> translation needed at revision level

Parameters
  • id – sha1_git of the revision

  • data – revision model object from storage

Returns

dictionary representing a revision_intrinsic_metadata, with keys:

  • id (str): rev’s identifier (sha1_git)

  • indexer_configuration_id (bytes): tool used

  • metadata: dict of retrieved metadata

Return type

dict

persist_index_computations(results: List[swh.indexer.storage.model.RevisionIntrinsicMetadataRow]) Dict[str, int][source]

Persist the results in storage.

Parameters

results – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes

translate_revision_intrinsic_metadata(detected_files: Dict[str, List[Any]], log_suffix: str) Tuple[List[Any], Any][source]

Determine plan of action to translate metadata when containing one or multiple detected files:

Parameters

detected_files – dictionary mapping context names (e.g., “npm”, “authors”) to list of sha1

Returns

list of mappings used and dict with translated metadata according to the CodeMeta vocabulary

Return type

(List[str], dict)

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface
class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]

Bases: swh.indexer.indexer.OriginIndexer[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.RevisionIntrinsicMetadataRow]]

Prepare and check that the indexer is ready to run.

USE_TOOLS = False
index_list(origin_urls: List[str], **kwargs) List[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.RevisionIntrinsicMetadataRow]][source]
persist_index_computations(results: List[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.RevisionIntrinsicMetadataRow]]) Dict[str, int][source]

Persist the computation resulting from the index.

Parameters

results – List of results. One result is the result of the index function.

Returns

a summary dict of what has been inserted in the storage

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface