Software Heritage - Indexer =========================== Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - fossology-license - metadata - origin: - metadata (intrinsic, using the content indexer; and extrinsic) An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype - fossology-license (queue swh_indexer_fossology_license): compute the license - metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary) Current origin indexers: - metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)