swh.indexer.metadata module#
- swh.indexer.metadata.fetch_in_batches(fetch_fn: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) Iterator[T2][source]#
Calls a function fetch_fn on batchs of args, this yields the results when ok.
When a batch raised, processing continues with the next batch of data to read.
Then another round of read is executed on the failed batchs but one object at a time, any further failure is logged and skipped, so callers receive a partial result set rather than a total failure.
- swh.indexer.metadata.fetch_as_dict(fetch_fn: Callable[[List[T1]], Iterable[T2]], ids: List[T1], batch_size: int) Dict[T1, T2][source]#
Return a dict
{id: object}; missing items are logged.
- class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#
Bases:
BaseIndexer[bytes,RawExtrinsicMetadata,OriginExtrinsicMetadataRow]Indexer for Raw Extrinsic Metadata
For supported extrinsic metadata formats, translate the original format into CodeMeta, and attach the result to the Origin.
Use XXX to get registered mapping formats.
Prepare and check that the indexer is ready to run.
- process_journal_objects(objects: ObjectsDict) Dict[source]#
Read swh message objects (content, origin, …) from the journal to:
retrieve the associated objects from the storage backend (e.g. storage, objstorage…)
execute the associated indexing computations
store the results in the indexer storage
- index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) List[OriginExtrinsicMetadataRow][source]#
Index computation for the id and associated raw data.
- Parameters:
id – identifier or Dict object
data – id’s data from storage or objstorage depending on object type
- Returns:
a dict that makes sense for the
persist_index_computations()method.- Return type:
- class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#
Bases:
ContentIndexer[ContentMetadataRow]Content-level indexer
This indexer is in charge of:
filtering out content already indexed in content_metadata
reading content from objstorage with the content’s id sha1
computing metadata by given context
using the metadata_mapping as the ‘swh-metadata-translator’ tool
store result in content_metadata table
Prepare and check that the indexer is ready to run.
- index(id: HashDict, data: bytes | None = None, log_suffix='unknown directory', **kwargs) List[ContentMetadataRow][source]#
Index sha1s’ content and store result.
- Parameters:
id – content’s identifier
data – raw content in bytes
- Returns:
dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None
- Return type:
- swh.indexer.metadata.directory_get(storage: StorageInterface, directory_id: bytes) Tuple[Directory | None, bool][source]#
Get the directory from the storage. This used a more effective implementation to read the directory from the storage. It’s currently limited though. It can only read partially a directory.
- Parameters:
storage – the storage instance
directory_id – the directory’s identifier
- Returns:
The directory if it could be properly put back together. None otherwise.
Whether the list of entries was truncated
- class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#
Bases:
DirectoryIndexer[DirectoryIntrinsicMetadataRow]Directory-level indexer
This indexer is in charge of:
filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool
retrieve all entry_files in directory
use metadata_detector for file_names containing metadata
compute metadata translation if necessary and possible (depends on tool)
send sha1s to content indexing if possible
store the results for directory
Prepare and check that the indexer is ready to run.
- index(id: bytes, data: Directory | None = None, **kwargs) List[DirectoryIntrinsicMetadataRow][source]#
Index directory by processing it and organizing result.
use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one)
- Parameters:
id – sha1_git of the directory
data – should always be None
- Returns:
dictionary representing a directory_intrinsic_metadata, with keys:
id: directory’s identifier (sha1_git)
indexer_configuration_id (bytes): tool used
metadata: dict of retrieved metadata
- Return type:
- persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#
Persist the results in storage.
- class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#
Bases:
OriginIndexer[Tuple[OriginIntrinsicMetadataRow,DirectoryIntrinsicMetadataRow]]Indexer for intrinsic metadata found within origin’s root directory
If there is a metadata file corresponding to a known format in the root directory of an Origin (i.e. in the root directory of the , read it and
Prepare and check that the indexer is ready to run.
- USE_TOOLS = False#
- index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]][source]#
- persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) Dict[str, int][source]#
Persist the computation resulting from the index.
- Parameters:
results – List of results. One result is the result of the index function.
- Returns:
a summary dict of what has been inserted in the storage