swh.indexer.mimetype module¶
-
swh.indexer.mimetype.
compute_mimetype_encoding
(raw_content: bytes) → Dict[str, str][source]¶ Determine mimetype and encoding from the raw content.
- Parameters
raw_content – content’s raw data
- Returns
mimetype and encoding key and corresponding values.
- Return type
dict
-
class
swh.indexer.mimetype.
MixinMimetypeIndexer
(*args, **kwargs)[source]¶ Bases:
object
Mixin mimetype indexer.
See
MimetypeIndexer
andMimetypePartitionIndexer
-
tool
: Any¶
-
index
(id: bytes, data: Optional[bytes] = None, **kwargs) → List[swh.indexer.storage.model.ContentMimetypeRow][source]¶ Index sha1s’ content and store result.
- Parameters
id – content’s identifier
data – raw content in bytes
- Returns
content’s mimetype; dict keys being
id: content’s identifier (sha1)
mimetype: mimetype in bytes
encoding: encoding in bytes
- Return type
dict
-
persist_index_computations
(results: List[swh.indexer.storage.model.ContentMimetypeRow]) → Dict[str, int][source]¶ Persist the results in storage.
- Parameters
results – list of content’s mimetype dicts (see
index()
)
-
-
class
swh.indexer.mimetype.
MimetypeIndexer
(*args, **kwargs)[source]¶ Bases:
swh.indexer.mimetype.MixinMimetypeIndexer
,swh.indexer.indexer.ContentIndexer
[swh.indexer.storage.model.ContentMimetypeRow
]Mimetype Indexer working on list of content identifiers.
It:
(optionally) filters out content already indexed (cf.
filter()
)reads content from objstorage per the content’s id (sha1)
computes {mimetype, encoding} from that content
stores result in storage
-
class
swh.indexer.mimetype.
MimetypePartitionIndexer
(*args, **kwargs)[source]¶ Bases:
swh.indexer.mimetype.MixinMimetypeIndexer
,swh.indexer.indexer.ContentPartitionIndexer
[swh.indexer.storage.model.ContentMimetypeRow
]Mimetype Range Indexer working on range of content identifiers.
It:
(optionally) filters out content already indexed (cf
indexed_contents_in_partition()
)reads content from objstorage per the content’s id (sha1)
computes {mimetype, encoding} from that content
stores result in storage
-
indexed_contents_in_partition
(partition_id: int, nb_partitions: int, page_token: Optional[str] = None) → swh.core.api.classes.PagedResult[bytes, str][source]¶ Retrieve indexed content ids within partition_id.
- Parameters
partition_id – Index of the partition to fetch
nb_partitions – Total number of partitions to split into
page_token – opaque token used for pagination
- Returns
PagedResult of Sha1. If next_page_token is None, there is no more data to fetch