swh.indexer.mimetype module#

swh.indexer.mimetype.compute_mimetype_encoding(raw_content: bytes) Dict[str, str][source]#

Determine mimetype and encoding from the raw content.

Parameters:

raw_content – content’s raw data

Returns:

mimetype and encoding key and corresponding values.

Return type:

dict

class swh.indexer.mimetype.MixinMimetypeIndexer(*args, **kwargs)[source]#

Bases: object

Mixin mimetype indexer.

See MimetypeIndexer and MimetypePartitionIndexer

tool: Any#
idx_storage: IndexerStorageInterface#
index(id: bytes, data: Optional[bytes] = None, **kwargs) List[ContentMimetypeRow][source]#

Index sha1s’ content and store result.

Parameters:
  • id – content’s identifier

  • data – raw content in bytes

Returns:

content’s mimetype; dict keys being

  • id: content’s identifier (sha1)

  • mimetype: mimetype in bytes

  • encoding: encoding in bytes

Return type:

dict

persist_index_computations(results: List[ContentMimetypeRow]) Dict[str, int][source]#

Persist the results in storage.

Parameters:

results – list of content’s mimetype dicts (see index())

class swh.indexer.mimetype.MimetypeIndexer(*args, **kwargs)[source]#

Bases: MixinMimetypeIndexer, ContentIndexer[ContentMimetypeRow]

Mimetype Indexer working on list of content identifiers.

It:

  • (optionally) filters out content already indexed (cf. filter())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

Prepare and check that the indexer is ready to run.

filter(ids)[source]#

Filter out known sha1s and return only missing ones.

idx_storage: IndexerStorageInterface#
class swh.indexer.mimetype.MimetypePartitionIndexer(*args, **kwargs)[source]#

Bases: MixinMimetypeIndexer, ContentPartitionIndexer[ContentMimetypeRow]

Mimetype Range Indexer working on range of content identifiers.

It:

  • (optionally) filters out content already indexed (cf indexed_contents_in_partition())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

Prepare and check that the indexer is ready to run.

indexed_contents_in_partition(partition_id: int, nb_partitions: int) Iterable[bytes][source]#

Retrieve indexed content ids within partition_id.

Parameters:
  • partition_id – Index of the partition to fetch

  • nb_partitions – Total number of partitions to split into

  • page_token – opaque token used for pagination

idx_storage: IndexerStorageInterface#