swh.indexer.metadata module#
- swh.indexer.metadata.call_with_batches(f: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) Iterator[T2] [source]#
Calls a function with batches of args, and concatenates the results.
- class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#
Bases:
BaseIndexer
[bytes
,RawExtrinsicMetadata
,OriginExtrinsicMetadataRow
]Prepare and check that the indexer is ready to run.
- process_journal_objects(objects: ObjectsDict) Dict [source]#
Read swh message objects (content, origin, …) from the journal to:
retrieve the associated objects from the storage backend (e.g. storage, objstorage…)
execute the associated indexing computations
store the results in the indexer storage
- index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) List[OriginExtrinsicMetadataRow] [source]#
Index computation for the id and associated raw data.
- Parameters:
id – identifier or Dict object
data – id’s data from storage or objstorage depending on object type
- Returns:
a dict that makes sense for the
persist_index_computations()
method.- Return type:
- class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#
Bases:
ContentIndexer
[ContentMetadataRow
]Content-level indexer
This indexer is in charge of:
filtering out content already indexed in content_metadata
reading content from objstorage with the content’s id sha1
computing metadata by given context
using the metadata_dictionary as the ‘swh-metadata-translator’ tool
store result in content_metadata table
Prepare and check that the indexer is ready to run.
- filter(ids: List[CompositeObjId])[source]#
Filter out known sha1s and return only missing ones.
- index(id: CompositeObjId, data: bytes | None = None, log_suffix='unknown directory', **kwargs) List[ContentMetadataRow] [source]#
Index sha1s’ content and store result.
- Parameters:
id – content’s identifier
data – raw content in bytes
- Returns:
dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None
- Return type:
- class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#
Bases:
DirectoryIndexer
[DirectoryIntrinsicMetadataRow
]Directory-level indexer
This indexer is in charge of:
filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool
retrieve all entry_files in directory
use metadata_detector for file_names containing metadata
compute metadata translation if necessary and possible (depends on tool)
send sha1s to content indexing if possible
store the results for directory
Prepare and check that the indexer is ready to run.
- index(id: bytes, data: Directory | None = None, **kwargs) List[DirectoryIntrinsicMetadataRow] [source]#
Index directory by processing it and organizing result.
use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one)
- Parameters:
id – sha1_git of the directory
data – should always be None
- Returns:
dictionary representing a directory_intrinsic_metadata, with keys:
id: directory’s identifier (sha1_git)
indexer_configuration_id (bytes): tool used
metadata: dict of retrieved metadata
- Return type:
- persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) Dict[str, int] [source]#
Persist the results in storage.
- translate_directory_intrinsic_metadata(files: List[DirectoryLsEntry], log_suffix: str) Tuple[List[Any], Any] [source]#
Determine plan of action to translate metadata in the given root directory
- Parameters:
files – list of file entries, as returned by
swh.storage.interface.StorageInterface.directory_ls()
- Returns:
list of mappings used and dict with translated metadata according to the CodeMeta vocabulary
- Return type:
- class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#
Bases:
OriginIndexer
[Tuple
[OriginIntrinsicMetadataRow
,DirectoryIntrinsicMetadataRow
]]Prepare and check that the indexer is ready to run.
- USE_TOOLS = False#
- index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]] [source]#
- persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) Dict[str, int] [source]#
Persist the computation resulting from the index.
- Parameters:
results – List of results. One result is the result of the index function.
- Returns:
a summary dict of what has been inserted in the storage