swh.indexer.metadata module#

swh.indexer.metadata.call_with_batches(f: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) → Iterator[T2][source]#: Calls a function with batches of args, and concatenates the results.

class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#

Bases: BaseIndexer[bytes, RawExtrinsicMetadata, OriginExtrinsicMetadataRow]

Prepare and check that the indexer is ready to run.

process_journal_objects(objects: ObjectsDict) → Dict[source]#

Read swh message objects (content, origin, …) from the journal to:

retrieve the associated objects from the storage backend (e.g. storage, objstorage…)
execute the associated indexing computations
store the results in the indexer storage

index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) → List[OriginExtrinsicMetadataRow][source]#

Index computation for the id and associated raw data.

Parameters:

id – identifier or Dict object
data – id’s data from storage or objstorage depending on object type

Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

persist_index_computations(results: List[OriginExtrinsicMetadataRow]) → Dict[str, int][source]#: Persist the results in storage.

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#

Bases: ContentIndexer[ContentMetadataRow]

Content-level indexer

This indexer is in charge of:

filtering out content already indexed in content_metadata
reading content from objstorage with the content’s id sha1
computing metadata by given context
using the metadata_dictionary as the ‘swh-metadata-translator’ tool
store result in content_metadata table

Prepare and check that the indexer is ready to run.

filter(ids: List[ObjId])[source]#: Filter out known sha1s and return only missing ones.

index(id: ObjId, data: bytes | None = None, log_suffix='unknown directory', **kwargs) → List[ContentMetadataRow][source]#

Index sha1s’ content and store result.

Parameters:

id – content’s identifier
data – raw content in bytes

Returns:

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type:

dict

persist_index_computations(results: List[ContentMetadataRow]) → Dict[str, int][source]#: Persist the results in storage.

class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#

Bases: DirectoryIndexer[DirectoryIntrinsicMetadataRow]

Directory-level indexer

This indexer is in charge of:

filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool
retrieve all entry_files in directory
use metadata_detector for file_names containing metadata
compute metadata translation if necessary and possible (depends on tool)
send sha1s to content indexing if possible
store the results for directory

Prepare and check that the indexer is ready to run.

filter(sha1_gits)[source]#: Filter out known sha1s and return only missing ones.

index(id: bytes, data: Directory | None = None, **kwargs) → List[DirectoryIntrinsicMetadataRow][source]#

Index directory by processing it and organizing result.

use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one)

Parameters:

id – sha1_git of the directory
data – should always be None

Returns:

dictionary representing a directory_intrinsic_metadata, with keys:

id: directory’s identifier (sha1_git)
indexer_configuration_id (bytes): tool used
metadata: dict of retrieved metadata

Return type:

dict

persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) → Dict[str, int][source]#: Persist the results in storage.

translate_directory_intrinsic_metadata(files: List[DirectoryLsEntry], log_suffix: str) → Tuple[List[Any], Any][source]#

Determine plan of action to translate metadata in the given root directory

Parameters:: files – list of file entries, as returned by swh.storage.interface.StorageInterface.directory_ls()
Returns:: list of mappings used and dict with translated metadata according to the CodeMeta vocabulary
Return type:: (List[str], dict)

class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#

Bases: OriginIndexer[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]

Prepare and check that the indexer is ready to run.

USE_TOOLS = False#

index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) → List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]][source]#

persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) → Dict[str, int][source]#

Persist the computation resulting from the index.

Parameters:: results – List of results. One result is the result of the index function.
Returns:: a summary dict of what has been inserted in the storage

swh.indexer.metadata module#

This Page