swh-indexer

Tools to compute multiple indexes on SWH’s raw contents:

  • content:
    • mimetype
    • ctags
    • language
    • fossology-license
    • metadata
  • revision:
    • metadata

An indexer is in charge of:

  • looking up objects
  • extracting information from those objects
  • store those information in the swh-indexer db

There are multiple indexers working on different object types:

  • content indexer: works with content sha1 hashes
  • revision indexer: works with revision sha1 hashes
  • origin indexer: works with origin identifiers

Indexation procedure:

  • receive batch of ids
  • retrieve the associated data depending on object type
  • compute for that object some index
  • store the result to swh’s storage

Current content indexers:

  • mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype
  • language (queue swh_indexer_content_language): detect the programming language
  • ctags (queue swh_indexer_content_ctags): compute tags information
  • fossology-license (queue swh_indexer_fossology_license): compute the license
  • metadata: translate file into translated_metadata dict

Current revision indexers:

  • metadata: detects files containing metadata and retrieves translated_metadata in content_metadata table in storage or run content indexer to translate files.