swh.indexer.metadata module

swh.indexer.metadata.call_with_batches(f: Callable[[List[swh.indexer.metadata.T1]], Iterable[swh.indexer.metadata.T2]], args: List[swh.indexer.metadata.T1], batch_size: int) Iterator[swh.indexer.metadata.T2][source]

Calls a function with batches of args, and concatenates the results.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.ContentIndexer[swh.indexer.storage.model.ContentMetadataRow]

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata

  • reading content from objstorage with the content’s id sha1

  • computing metadata by given context

  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool

  • store result in content_metadata table

Prepare and check that the indexer is ready to run.

filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id: bytes, data: Optional[bytes] = None, log_suffix='unknown directory', **kwargs) List[swh.indexer.storage.model.ContentMetadataRow][source]

Index sha1s’ content and store result.

Parameters
  • id – content’s identifier

  • data – raw content in bytes

Returns

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type

dict

persist_index_computations(results: List[swh.indexer.storage.model.ContentMetadataRow]) Dict[str, int][source]

Persist the results in storage.

Parameters

results – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - metadata (jsonb): detected metadata

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface
class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]

Bases: swh.indexer.indexer.DirectoryIndexer[swh.indexer.storage.model.DirectoryIntrinsicMetadataRow]

Directory-level indexer

This indexer is in charge of:

  • filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool

  • retrieve all entry_files in directory

  • use metadata_detector for file_names containing metadata

  • compute metadata translation if necessary and possible (depends on tool)

  • send sha1s to content indexing if possible

  • store the results for directory

Prepare and check that the indexer is ready to run.

filter(sha1_gits)[source]

Filter out known sha1s and return only missing ones.

index(id: bytes, data: Optional[swh.model.model.Directory] = None, **kwargs) List[swh.indexer.storage.model.DirectoryIntrinsicMetadataRow][source]

Index directory by processing it and organizing result.

use metadata_detector to iterate on filenames

  • if one filename detected -> sends file to content indexer

  • if multiple file detected -> translation needed at directory level

Parameters
  • id – sha1_git of the directory

  • data – directory model object from storage

Returns

dictionary representing a directory_intrinsic_metadata, with keys:

  • id: directory’s identifier (sha1_git)

  • indexer_configuration_id (bytes): tool used

  • metadata: dict of retrieved metadata

Return type

dict

persist_index_computations(results: List[swh.indexer.storage.model.DirectoryIntrinsicMetadataRow]) Dict[str, int][source]

Persist the results in storage.

Parameters

results – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes

translate_directory_intrinsic_metadata(detected_files: Dict[str, List[Any]], log_suffix: str) Tuple[List[Any], Any][source]

Determine plan of action to translate metadata when containing one or multiple detected files:

Parameters

detected_files – dictionary mapping context names (e.g., “npm”, “authors”) to list of sha1

Returns

list of mappings used and dict with translated metadata according to the CodeMeta vocabulary

Return type

(List[str], dict)

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface
class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]

Bases: swh.indexer.indexer.OriginIndexer[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.DirectoryIntrinsicMetadataRow]]

Prepare and check that the indexer is ready to run.

USE_TOOLS = False
index_list(origins: List[swh.model.model.Origin], check_origin_known: bool = True, **kwargs) List[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.DirectoryIntrinsicMetadataRow]][source]
persist_index_computations(results: List[Tuple[swh.indexer.storage.model.OriginIntrinsicMetadataRow, swh.indexer.storage.model.DirectoryIntrinsicMetadataRow]]) Dict[str, int][source]

Persist the computation resulting from the index.

Parameters

results – List of results. One result is the result of the index function.

Returns

a summary dict of what has been inserted in the storage

results: List[swh.indexer.indexer.TResult]
scheduler: Any
storage: swh.storage.interface.StorageInterface
objstorage: Any
idx_storage: swh.indexer.storage.interface.IndexerStorageInterface