swh.indexer.metadata module#

swh.indexer.metadata.call_with_batches(f: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) Iterator[T2][source]#

Calls a function with batches of args, and concatenates the results.

class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#

Bases: BaseIndexer[bytes, RawExtrinsicMetadata, OriginExtrinsicMetadataRow]

Prepare and check that the indexer is ready to run.

process_journal_objects(objects: ObjectsDict) Dict[source]#

Read swh message objects (content, origin, …) from the journal to:

  • retrieve the associated objects from the storage backend (e.g. storage, objstorage…)

  • execute the associated indexing computations

  • store the results in the indexer storage

index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) List[OriginExtrinsicMetadataRow][source]#

Index computation for the id and associated raw data.

  • id – identifier or Dict object

  • data – id’s data from storage or objstorage depending on object type


a dict that makes sense for the persist_index_computations() method.

Return type:


persist_index_computations(results: List[OriginExtrinsicMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#

Bases: ContentIndexer[ContentMetadataRow]

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata

  • reading content from objstorage with the content’s id sha1

  • computing metadata by given context

  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool

  • store result in content_metadata table

Prepare and check that the indexer is ready to run.

filter(ids: List[ObjId])[source]#

Filter out known sha1s and return only missing ones.

index(id: ObjId, data: bytes | None = None, log_suffix='unknown directory', **kwargs) List[ContentMetadataRow][source]#

Index sha1s’ content and store result.

  • id – content’s identifier

  • data – raw content in bytes


dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type:


persist_index_computations(results: List[ContentMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#

Bases: DirectoryIndexer[DirectoryIntrinsicMetadataRow]

Directory-level indexer

This indexer is in charge of:

  • filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool

  • retrieve all entry_files in directory

  • use metadata_detector for file_names containing metadata

  • compute metadata translation if necessary and possible (depends on tool)

  • send sha1s to content indexing if possible

  • store the results for directory

Prepare and check that the indexer is ready to run.


Filter out known sha1s and return only missing ones.

index(id: bytes, data: Directory | None = None, **kwargs) List[DirectoryIntrinsicMetadataRow][source]#

Index directory by processing it and organizing result.

use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one)

  • id – sha1_git of the directory

  • data – should always be None


dictionary representing a directory_intrinsic_metadata, with keys:

  • id: directory’s identifier (sha1_git)

  • indexer_configuration_id (bytes): tool used

  • metadata: dict of retrieved metadata

Return type:


persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

translate_directory_intrinsic_metadata(files: List[DirectoryLsEntry], log_suffix: str) Tuple[List[Any], Any][source]#

Determine plan of action to translate metadata in the given root directory


files – list of file entries, as returned by swh.storage.interface.StorageInterface.directory_ls()


list of mappings used and dict with translated metadata according to the CodeMeta vocabulary

Return type:

(List[str], dict)

class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#

Bases: OriginIndexer[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]

Prepare and check that the indexer is ready to run.

USE_TOOLS = False#
index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]][source]#
persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) Dict[str, int][source]#

Persist the computation resulting from the index.


results – List of results. One result is the result of the index function.


a summary dict of what has been inserted in the storage