swh.indexer.metadata module#

swh.indexer.metadata.fetch_in_batches(fetch_fn: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) Iterator[T2][source]#

Calls a function fetch_fn on batchs of args, this yields the results when ok.

When a batch raised, processing continues with the next batch of data to read.

Then another round of read is executed on the failed batchs but one object at a time, any further failure is logged and skipped, so callers receive a partial result set rather than a total failure.

swh.indexer.metadata.fetch_as_dict(fetch_fn: Callable[[List[T1]], Iterable[T2]], ids: List[T1], batch_size: int) Dict[T1, T2][source]#

Return a dict {id: object}; missing items are logged.

class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#

Bases: BaseIndexer[bytes, RawExtrinsicMetadata, OriginExtrinsicMetadataRow]

Indexer for Raw Extrinsic Metadata

For supported extrinsic metadata formats, translate the original format into CodeMeta, and attach the result to the Origin.

Use XXX to get registered mapping formats.

Prepare and check that the indexer is ready to run.

object_types: List[str] = ['raw_extrinsic_metadata']#
process_journal_objects(objects: ObjectsDict) Dict[source]#

Read swh message objects (content, origin, …) from the journal to:

  • retrieve the associated objects from the storage backend (e.g. storage, objstorage…)

  • execute the associated indexing computations

  • store the results in the indexer storage

index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) List[OriginExtrinsicMetadataRow][source]#

Index computation for the id and associated raw data.

Parameters:
  • id – identifier or Dict object

  • data – id’s data from storage or objstorage depending on object type

Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

persist_index_computations(results: List[OriginExtrinsicMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#

Bases: ContentIndexer[ContentMetadataRow]

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata

  • reading content from objstorage with the content’s id sha1

  • computing metadata by given context

  • using the metadata_mapping as the ‘swh-metadata-translator’ tool

  • store result in content_metadata table

Prepare and check that the indexer is ready to run.

filter(ids: List[HashDict])[source]#

Filter out known sha1s and return only missing ones.

index(id: HashDict, data: bytes | None = None, log_suffix='unknown directory', **kwargs) List[ContentMetadataRow][source]#

Index sha1s’ content and store result.

Parameters:
  • id – content’s identifier

  • data – raw content in bytes

Returns:

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type:

dict

persist_index_computations(results: List[ContentMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

swh.indexer.metadata.directory_get(storage: StorageInterface, directory_id: bytes) Tuple[Directory | None, bool][source]#

Get the directory from the storage. This used a more effective implementation to read the directory from the storage. It’s currently limited though. It can only read partially a directory.

Parameters:
  • storage – the storage instance

  • directory_id – the directory’s identifier

Returns:

  • The directory if it could be properly put back together. None otherwise.

  • Whether the list of entries was truncated

class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#

Bases: DirectoryIndexer[DirectoryIntrinsicMetadataRow]

Directory-level indexer

This indexer is in charge of:

  • filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool

  • retrieve all entry_files in directory

  • use metadata_detector for file_names containing metadata

  • compute metadata translation if necessary and possible (depends on tool)

  • send sha1s to content indexing if possible

  • store the results for directory

Prepare and check that the indexer is ready to run.

filter(sha1_gits)[source]#

Filter out known sha1s and return only missing ones.

index(id: bytes, data: Directory | None = None, **kwargs) List[DirectoryIntrinsicMetadataRow][source]#

Index directory by processing it and organizing result.

use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one)

Parameters:
  • id – sha1_git of the directory

  • data – should always be None

Returns:

dictionary representing a directory_intrinsic_metadata, with keys:

  • id: directory’s identifier (sha1_git)

  • indexer_configuration_id (bytes): tool used

  • metadata: dict of retrieved metadata

Return type:

dict

persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#

Persist the results in storage.

translate_directory_intrinsic_metadata(mapping_contents: Dict[str, Set[Content]], log_suffix: str) Tuple[List[Any], Any][source]#

Determine how to translate metadata from the directory file entries.

Parameters:

files – list of file entries DirectoryEntry of type ‘file’

Returns:

list of mappings used and dict with translated metadata according to the CodeMeta vocabulary

Return type:

(List[str], dict)

class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#

Bases: OriginIndexer[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]

Indexer for intrinsic metadata found within origin’s root directory

If there is a metadata file corresponding to a known format in the root directory of an Origin (i.e. in the root directory of the , read it and

Prepare and check that the indexer is ready to run.

USE_TOOLS = False#
index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]][source]#
persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) Dict[str, int][source]#

Persist the computation resulting from the index.

Parameters:

results – List of results. One result is the result of the index function.

Returns:

a summary dict of what has been inserted in the storage