swh.indexer.mimetype module

swh.indexer.mimetype.compute_mimetype_encoding(raw_content: bytes) → Dict[str, bytes][source]

Determine mimetype and encoding from the raw content.

Parameters

raw_content – content’s raw data

Returns

mimetype and encoding key and corresponding values.

Return type

dict

class swh.indexer.mimetype.MixinMimetypeIndexer[source]

Bases: object

Mixin mimetype indexer.

See MimetypeIndexer and MimetypeRangeIndexer

tool: Any
idx_storage: Any
ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'file', 'version': '1:5.30-1+deb9u1', 'configuration': {'type': 'library', 'debian-package': 'python3-magic'}}), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/mimetype'
index(id: bytes, data: Optional[bytes] = None, **kwargs) → Dict[str, Any][source]

Index sha1s’ content and store result.

Parameters
  • id – content’s identifier

  • data – raw content in bytes

Returns

content’s mimetype; dict keys being

  • id: content’s identifier (sha1)

  • mimetype: mimetype in bytes

  • encoding: encoding in bytes

Return type

dict

persist_index_computations(results: List[Dict], policy_update: str) → Dict[str, int][source]

Persist the results in storage.

Parameters
  • results – list of content’s mimetype dicts (see index())

  • policy_update – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

class swh.indexer.mimetype.MimetypeIndexer(config=None, **kw)[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentIndexer

Mimetype Indexer working on list of content identifiers.

It:

  • (optionally) filters out content already indexed (cf. filter())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

filter(ids)[source]

Filter out known sha1s and return only missing ones.

idx_storage
class swh.indexer.mimetype.MimetypeRangeIndexer(config=None, **kw)[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentRangeIndexer

Mimetype Range Indexer working on range of content identifiers.

It:

  • (optionally) filters out content already indexed (cf indexed_contents_in_range())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

indexed_contents_in_range(start: bytes, end: bytes) → Dict[str, Optional[bytes]][source]

Retrieve indexed content id within range [start, end].

Parameters
  • start – Starting bound from range identifier

  • end – End range identifier

Returns

a dict with keys:

  • ids: iterable of content ids within the range.

  • next: The next range of sha1 starts at this sha1 if any

Return type

dict

idx_storage