swh.indexer package

Submodules

swh.indexer.codemeta module

swh.indexer.codemeta.make_absolute_uri(local_name)[source]
swh.indexer.codemeta.compact(doc)[source]

Same as pyld.jsonld.compact, but in the context of CodeMeta.

swh.indexer.codemeta.expand(doc)[source]

Same as pyld.jsonld.expand, but in the context of CodeMeta.

swh.indexer.ctags module

swh.indexer.ctags.run_ctags(path, lang=None, ctags_command='ctags')[source]

Run ctags on file path with optional language.

Parameters:
  • path – path to the file
  • lang – language for that path (optional)
Yields:

dict – ctags’ output

class swh.indexer.ctags.CtagsIndexer[source]

Bases: swh.indexer.indexer.ContentIndexer, swh.indexer.indexer.DiskIndexer

CONFIG_BASE_FILENAME = 'indexer/ctags'
ADDITIONAL_CONFIG = {'languages': ('dict', {'adl': None, 'agda': None, 'ada': 'Ada'}), 'tools': ('dict', {'version': '~git7859817b', 'name': 'universal-ctags', 'configuration': {'command_line': 'ctags --fields=+lnz --sort=no --links=no --output-format=json <filepath>'}}), 'workdir': ('str', '/tmp/swh/indexer.ctags')}
prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

filter(ids)[source]

Filter out known sha1s and return only missing ones.

compute_ctags(path, lang)[source]

Compute ctags on file at path with language lang.

index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

a dict representing a content_mimetype with keys:

  • id (bytes): content’s identifier (sha1)
  • ctags ([dict]): ctags list of symbols

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - ctags ([dict]): ctags list of symbols
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

swh.indexer.fossology_license module

swh.indexer.fossology_license.compute_license(path, log=None)[source]

Determine license from file at path.

Parameters:path – filepath to determine the license
Returns:A dict with the following keys:
  • licenses ([str]): associated detected licenses to path
  • path (bytes): content filepath
Return type:dict
class swh.indexer.fossology_license.MixinFossologyLicenseIndexer[source]

Bases: object

Mixin fossology license indexer.

See FossologyLicenseIndexer and FossologyLicenseRangeIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'version': '3.1.0rc2-31-ga2cbb8c', 'name': 'nomos', 'configuration': {'command_line': 'nomossa <filepath>'}}), 'workdir': ('str', '/tmp/swh/indexer.fossology.license'), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/fossology_license'
prepare()[source]
compute_license(path, log=None)[source]

Determine license from file at path.

Parameters:path – filepath to determine the license
Returns:A dict with the following keys:
  • licenses ([str]): associated detected licenses to path
  • path (bytes): content filepath
Return type:dict
index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • raw_content (bytes) – associated raw content to content id
Returns:

A dict, representing a content_license, with keys:

  • id (bytes): content’s identifier (sha1)
  • license (bytes): license in bytes
  • path (bytes): path
  • indexer_configuration_id (int): tool used to compute the output

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) –

    list of content_license, dict with the following keys:

    • id (bytes): content’s identifier (sha1)
    • license (bytes): license in bytes
    • path (bytes): path
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
class swh.indexer.fossology_license.FossologyLicenseIndexer[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.DiskIndexer, swh.indexer.indexer.ContentIndexer

Indexer in charge of:

  • filtering out content already indexed
  • reading content from objstorage per the content’s id (sha1)
  • computing {license, encoding} from that content
  • store result in storage
filter(ids)[source]

Filter out known sha1s and return only missing ones.

class swh.indexer.fossology_license.FossologyLicenseRangeIndexer[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.DiskIndexer, swh.indexer.indexer.ContentRangeIndexer

FossologyLicense Range Indexer working on range of content identifiers.

  • filters out the non textual content
  • (optionally) filters out content already indexed (cf indexed_contents_in_range())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
indexed_contents_in_range(start, end)[source]

Retrieve indexed content id within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Returns:

a dict with keys:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at this sha1 if any

Return type:

dict

swh.indexer.indexer module

class swh.indexer.indexer.DiskIndexer[source]

Bases: object

Mixin intended to be used with other SomethingIndexer classes.

Indexers inheriting from this class are a category of indexers which needs the disk for their computations.

Note

This expects self.working_directory variable defined at runtime.

write_to_temp(filename, data)[source]

Write the sha1’s content in a temporary file.

Parameters:
  • filename (str) – one of sha1’s many filenames
  • data (bytes) – the sha1’s content to write in temporary file
Returns:

The path to the temporary file created. That file is filled in with the raw content’s data.

cleanup(content_path)[source]

Remove content_path from working directory.

Parameters:content_path (str) – the file to remove
class swh.indexer.indexer.BaseIndexer[source]

Bases: swh.core.config.SWHConfig

Base class for indexers to inherit from.

The main entry point is the run() function which is in charge of triggering the computations on the batch dict/ids received.

Indexers can:

  • filter out ids whose data has already been indexed.
  • retrieve ids data from storage or objstorage
  • index this data depending on the object and store the result in storage.

To implement a new object type indexer, inherit from the BaseIndexer and implement indexing:

run():
object_ids are different depending on object. For example: sha1 for content, sha1_git for revision, directory, release, and id for origin

To implement a new concrete indexer, inherit from the object level classes: ContentIndexer, RevisionIndexer, OriginIndexer.

Then you need to implement the following functions:

filter():
filter out data already indexed (in storage).
index_object():
compute index on id with data (retrieved from the storage or the objstorage by the id key) and return the resulting index computation.
persist_index_computations():
persist the results of multiple index computations in the storage.

The new indexer implementation can also override the following functions:

prepare():
Configuration preparation for the indexer. When overriding, this must call the super().prepare() instruction.
check():
Configuration check for the indexer. When overriding, this must call the super().check() instruction.
register_tools():
This should return a dict of the tool(s) to use when indexing or filtering.
CONFIG = 'indexer/base'
DEFAULT_CONFIG = {'indexer_storage': ('dict', {'args': {'url': 'http://localhost:5007/'}, 'cls': 'remote'}), 'objstorage': ('dict', {'args': {'url': 'http://localhost:5003/'}, 'cls': 'remote'}), 'storage': ('dict', {'args': {'url': 'http://localhost:5002/'}, 'cls': 'remote'})}
ADDITIONAL_CONFIG = {}
prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

check(*, check_tools=True)[source]

Check the indexer’s configuration is ok before proceeding. If ok, does nothing. If not raise error.

register_tools(tools)[source]

Permit to register tools to the storage.

Add a sensible default which can be overridden if not sufficient. (For now, all indexers use only one tool)

Expects the self.config[‘tools’] property to be set with one or more tools.

Parameters:tools (dict/[dict]) – Either a dict or a list of dict.
Returns:List of dicts with additional id key.
Return type:list
Raises:ValueError – if not a list nor a dict.
index(id, data)[source]

Index computation for the id and associated raw data.

Parameters:
  • id (bytes) – identifier
  • data (bytes) – id’s data from storage or objstorage depending on object type
Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the computation resulting from the index.

Parameters:
  • results ([result]) – List of results. One result is the result of the index function.
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
Returns:

None

next_step(results, task)[source]

Do something else with computations results (e.g. send to another queue, …).

(This is not an abstractmethod since it is optional).

Parameters:
  • results ([result]) – List of results (dict) as returned by index function.
  • task (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
Returns:

None

run(ids, policy_update, next_step=None, **kwargs)[source]

Given a list of ids:

  • retrieves the data from the storage
  • executes the indexing computations
  • stores the results (according to policy_update)
Parameters:
  • ids ([bytes]) – id’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus a result_name key.
  • **kwargs – passed to the index method
class swh.indexer.indexer.ContentIndexer[source]

Bases: swh.indexer.indexer.BaseIndexer

A content indexer working on a list of ids directly.

To work on indexer range, use the ContentRangeIndexer instead.

Note: ContentIndexer is not an instantiable object. To use it, one should inherit from this class and override the methods mentioned in the BaseIndexer class.

filter(ids)[source]

Filter missing ids for that particular indexer.

Parameters:ids ([bytes]) – list of ids
Yields:iterator of missing ids
run(ids, policy_update, next_step=None, **kwargs)[source]

Given a list of ids:

  • retrieve the content from the storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids (Iterable[Union[bytes, str]]) – sha1’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
  • **kwargs – passed to the index method
class swh.indexer.indexer.ContentRangeIndexer[source]

Bases: swh.indexer.indexer.BaseIndexer

A content range indexer.

This expects as input a range of ids to index.

To work on a list of ids, use the ContentIndexer instead.

Note: ContentRangeIndexer is not an instantiable object. To use it, one should inherit from this class and override the methods mentioned in the BaseIndexer class.

indexed_contents_in_range(start, end)[source]

Retrieve indexed contents within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Yields:

bytes – Content identifier present in the range [start, end]

run(start, end, skip_existing=True, **kwargs)[source]
Given a range of content ids, compute the indexing computations on
the contents within. Either the indexer is incremental (filter out existing computed data) or not (compute everything from scratch).
Parameters:
  • start (Union[bytes, str]) – Starting range identifier
  • end (Union[bytes, str]) – Ending range identifier
  • skip_existing (bool) – Skip existing indexed data (default) or not
  • **kwargs – passed to the index method
Returns:

True if data was indexed, False otherwise.

Return type:

bool

class swh.indexer.indexer.OriginIndexer[source]

Bases: swh.indexer.indexer.BaseIndexer

An object type indexer, inherits from the BaseIndexer and implements Origin indexing using the run method

Note: the OriginIndexer is not an instantiable object. To use it in another context one should inherit from this class and override the methods mentioned in the BaseIndexer class.

run(ids, policy_update='update-dups', parse_ids=True, next_step=None, **kwargs)[source]

Given a list of origin ids:

  • retrieve origins from storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids ([Union[int, Tuple[str, bytes]]]) – list of origin ids or (type, url) tuples.
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates (default) or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
  • parse_ids (bool) – Do we need to parse id or not (default)
  • **kwargs – passed to the index method
class swh.indexer.indexer.RevisionIndexer[source]

Bases: swh.indexer.indexer.BaseIndexer

An object type indexer, inherits from the BaseIndexer and implements Revision indexing using the run method

Note: the RevisionIndexer is not an instantiable object. To use it in another context one should inherit from this class and override the methods mentioned in the BaseIndexer class.

run(ids, policy_update, next_step=None)[source]

Given a list of sha1_gits:

  • retrieve revisions from storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids ([bytes or str]) – sha1_git’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

swh.indexer.journal_client module

class swh.indexer.journal_client.IndexerJournalClient[source]

Bases: swh.journal.client.JournalClient

Client in charge of listing new received origins and origin_visits in the swh journal.

CONFIG_BASE_FILENAME = 'indexer/journal_client'
ADDITIONAL_CONFIG = {'origin_visit_tasks': ('List[dict]', [{'kwargs': {'policy_update': 'update-dups', 'parse_ids': False}, 'type': 'indexer_origin_head'}]), 'scheduler': ('dict', {'args': {'url': 'http://localhost:5008/'}, 'cls': 'remote'})}
process_objects(messages)[source]

Process the objects (store, compute, etc…)

Parameters:
  • messages (dict) – Dict of key object_type (as per
  • and their associated values. (configuration)) –
process_origin_visit(origin_visit)[source]

swh.indexer.language module

swh.indexer.language.compute_language_from_chunk(encoding, length, raw_content, max_size, log=None)[source]

Determine the raw content’s language.

Parameters:
  • encoding (str) – Encoding to use to decode the content
  • length (int) – raw_content’s length
  • raw_content (bytes) – raw content to work with
  • max_size (int) – max size to split the raw content at
Returns:

Dict with keys: - lang: None if nothing found or the possible language

Return type:

dict

swh.indexer.language.compute_language(raw_content, encoding=None, log=None)[source]

Determine the raw content’s language.

Parameters:raw_content (bytes) – raw content to work with
Returns:Dict with keys: - lang: None if nothing found or the possible language
Return type:dict
class swh.indexer.language.LanguageIndexer[source]

Bases: swh.indexer.indexer.ContentIndexer

Indexer in charge of:

  • filtering out content already indexed
  • reading content from objstorage per the content’s id (sha1)
  • computing {mimetype, encoding} from that content
  • store result in storage
CONFIG_BASE_FILENAME = 'indexer/language'
ADDITIONAL_CONFIG = {'tools': ('dict', {'version': '2.0.1+dfsg-1.1+deb8u1', 'name': 'pygments', 'configuration': {'debian-package': 'python3-pygments', 'max_content_size': 10240, 'type': 'library'}})}
prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

Dict that represents a content_mimetype, with keys: - id (bytes): content’s identifier (sha1) - lang (bytes): detected language

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - lang (bytes): detected language
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

swh.indexer.metadata module

class swh.indexer.metadata.ContentMetadataIndexer(tool, config)[source]

Bases: swh.indexer.indexer.ContentIndexer

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata
  • reading content from objstorage with the content’s id sha1
  • computing translated_metadata by given context
  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool
  • store result in content_metadata table
CONFIG_BASE_FILENAME = 'indexer/content_metadata'
filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

dictionary representing a content_metadata. If the translation wasn’t successful the translated_metadata keys will be returned as None

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - translated_metadata (jsonb): detected metadata
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
class swh.indexer.metadata.RevisionMetadataIndexer[source]

Bases: swh.indexer.indexer.RevisionIndexer

Revision-level indexer

This indexer is in charge of:

  • filtering revisions already indexed in revision_metadata table with defined computation tool
  • retrieve all entry_files in root directory
  • use metadata_detector for file_names containing metadata
  • compute metadata translation if necessary and possible (depends on tool)
  • send sha1s to content indexing if possible
  • store the results for revision
CONFIG_BASE_FILENAME = 'indexer/revision_metadata'
ADDITIONAL_CONFIG = {'tools': ('dict', {'version': '0.0.2', 'name': 'swh-metadata-detector', 'configuration': {'context': ['NpmMapping', 'CodemetaMapping'], 'type': 'local'}})}
class ContentMetadataIndexer(tool, config)

Bases: swh.indexer.indexer.ContentIndexer

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata
  • reading content from objstorage with the content’s id sha1
  • computing translated_metadata by given context
  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool
  • store result in content_metadata table
CONFIG_BASE_FILENAME = 'indexer/content_metadata'
filter(ids)

Filter out known sha1s and return only missing ones.

index(id, data)

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

dictionary representing a content_metadata. If the translation wasn’t successful the translated_metadata keys will be returned as None

Return type:

dict

persist_index_computations(results, policy_update)

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - translated_metadata (jsonb): detected metadata
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

filter(sha1_gits)[source]

Filter out known sha1s and return only missing ones.

index(rev)[source]

Index rev by processing it and organizing result.

use metadata_detector to iterate on filenames

  • if one filename detected -> sends file to content indexer
  • if multiple file detected -> translation needed at revision level
Parameters:rev (bytes) – revision artifact from storage
Returns:dictionary representing a revision_metadata, with keys:
  • id (str): rev’s identifier (sha1_git)
  • indexer_configuration_id (bytes): tool used
  • translated_metadata: dict of retrieved metadata
Return type:dict
persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
translate_revision_metadata(detected_files)[source]

Determine plan of action to translate metadata when containing one or multiple detected files:

Parameters:detected_files (dict) – dictionary mapping context names (e.g., “npm”, “authors”) to list of sha1
Returns:dict with translated metadata according to the CodeMeta vocabulary
Return type:dict
class swh.indexer.metadata.OriginMetadataIndexer[source]

Bases: swh.indexer.indexer.OriginIndexer

CONFIG_BASE_FILENAME = 'indexer/origin_intrinsic_metadata'
ADDITIONAL_CONFIG = {'tools': ('list', [])}
check(**kwargs)[source]

Check the indexer’s configuration is ok before proceeding. If ok, does nothing. If not raise error.

filter(ids)[source]
run(origin_head, policy_update)[source]

Expected to be called with the result of RevisionMetadataIndexer as first argument; ie. not a list of ids as other indexers would.

Parameters:
  • origin_head (dict) – {str(origin_id): rev_id} keys origin_id and revision_id, which is the result of OriginHeadIndexer.
  • policy_update (str) – ‘ignore-dups’ or ‘update-dups’
index(origin, *, origin_head_map)[source]

Index computation for the id and associated raw data.

Parameters:
  • id (bytes) – identifier
  • data (bytes) – id’s data from storage or objstorage depending on object type
Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the computation resulting from the index.

Parameters:
  • results ([result]) – List of results. One result is the result of the index function.
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
Returns:

None

swh.indexer.metadata_detector module

swh.indexer.metadata_detector.detect_metadata(files)[source]

Detects files potentially containing metadata

Parameters:file_entries (list) – list of files
Returns:{mapping_filenames[name]:f[‘sha1’]} (may be empty)
Return type:dict
swh.indexer.metadata_detector.extract_minimal_metadata_dict(metadata_list)[source]

Every item in the metadata_list is a dict of translated_metadata in the CodeMeta vocabulary.

We wish to extract a minimal set of terms and keep all values corresponding to this term without duplication.

Parameters:metadata_list (list) – list of dicts of translated_metadata
Returns:minimal_dict; dict with selected values of metadata
Return type:dict

swh.indexer.metadata_dictionary module

swh.indexer.metadata_dictionary.register_mapping(cls)[source]
class swh.indexer.metadata_dictionary.BaseMapping[source]

Bases: object

Base class for mappings to inherit from

To implement a new mapping:

  • inherit this class
  • override translate function
detect_metadata_files(files)[source]

Detects files potentially containing metadata

Parameters:file_entries (list) – list of files
Returns:list of sha1 (possibly empty)
Return type:list
translate(file_content)[source]
normalize_translation(metadata)[source]
class swh.indexer.metadata_dictionary.SingleFileMapping[source]

Bases: swh.indexer.metadata_dictionary.BaseMapping

Base class for all mappings that use a single file as input.

filename

The .json file to extract metadata from.

detect_metadata_files(file_entries)[source]

Detects files potentially containing metadata

Parameters:file_entries (list) – list of files
Returns:list of sha1 (possibly empty)
Return type:list
class swh.indexer.metadata_dictionary.DictMapping[source]

Bases: swh.indexer.metadata_dictionary.BaseMapping

Base class for mappings that take as input a file that is mostly a key-value store (eg. a shallow JSON dict).

mapping

A translation dict to map dict keys into a canonical name.

translate_dict(content_dict, *, normalize=True)[source]

Translates content by parsing content from a dict object and translating with the appropriate mapping

Parameters:content_dict (dict) – content dict to translate
Returns:translated metadata in json-friendly form needed for the indexer
Return type:dict
class swh.indexer.metadata_dictionary.JsonMapping[source]

Bases: swh.indexer.metadata_dictionary.DictMapping, swh.indexer.metadata_dictionary.SingleFileMapping

Base class for all mappings that use a JSON file as input.

translate(raw_content)[source]

Translates content by parsing content from a bytestring containing json data and translating with the appropriate mapping

Parameters:raw_content (bytes) – raw content to translate
Returns:translated metadata in json-friendly form needed for the indexer
Return type:dict
class swh.indexer.metadata_dictionary.NpmMapping[source]

Bases: swh.indexer.metadata_dictionary.JsonMapping

dedicated class for NPM (package.json) mapping and translation

mapping = {'author': 'http://schema.org/author', 'author.email': 'http://schema.org/email', 'author.name': 'http://schema.org/name', 'bugs': 'https://codemeta.github.io/terms/issueTracker', 'contributor': 'http://schema.org/contributor', 'cpu': 'http://schema.org/processorRequirements', 'description': 'http://schema.org/description', 'engines': 'http://schema.org/processorRequirements', 'homepage': 'http://schema.org/url', 'keywords': 'http://schema.org/keywords', 'license': 'http://schema.org/license', 'name': 'http://schema.org/name', 'os': 'http://schema.org/operatingSystem', 'repository': 'http://schema.org/codeRepository', 'version': 'http://schema.org/version'}
filename = b'package.json'
normalize_repository(d)[source]

https://docs.npmjs.com/files/package.json#repository

normalize_bugs(d)[source]

https://docs.npmjs.com/files/package.json#bugs

normalize_author(d)[source]

https://docs.npmjs.com/files/package.json#people-fields-author-contributors

normalize_license(s)[source]
normalize_homepage(s)[source]
class swh.indexer.metadata_dictionary.CodemetaMapping[source]

Bases: swh.indexer.metadata_dictionary.SingleFileMapping

dedicated class for CodeMeta (codemeta.json) mapping and translation

filename = b'codemeta.json'
translate(content)[source]
class swh.indexer.metadata_dictionary.MavenMapping[source]

Bases: swh.indexer.metadata_dictionary.DictMapping, swh.indexer.metadata_dictionary.SingleFileMapping

dedicated class for Maven (pom.xml) mapping and translation

filename = b'pom.xml'
mapping = {'ciManagement': 'https://codemeta.github.io/terms/contIntegration', 'description': 'http://schema.org/description', 'groupId': 'http://schema.org/identifier', 'issuesManagement': 'https://codemeta.github.io/terms/issueTracker', 'license': 'http://schema.org/license', 'name': 'http://schema.org/name', 'repositories': 'http://schema.org/codeRepository', 'version': 'http://schema.org/version'}
translate(content)[source]
parse_repositories(d)[source]

https://maven.apache.org/pom.html#Repositories

parse_repository(d, repo)[source]
normalize_groupId(id_)[source]
parse_licenses(d)[source]

https://maven.apache.org/pom.html#Licenses

The origin XML has the form:

<licenses>
<license>
<name>Apache License, Version 2.0</name> <url>https://www.apache.org/licenses/LICENSE-2.0.txt</url>

</license>

</licenses>

Which was translated to a dict by xmltodict and is given as d:

>>> d = {
...     # ...
...     "licenses": {
...         "license": {
...             "name": "Apache License, Version 2.0",
...             "url":
...             "https://www.apache.org/licenses/LICENSE-2.0.txt"
...         }
...     }
... }
>>> MavenMapping().parse_licenses(d)
[{'@id': 'https://www.apache.org/licenses/LICENSE-2.0.txt'}]

or, if there are more than one license:

>>> from pprint import pprint
>>> d = {
...     # ...
...     "licenses": {
...         "license": [
...             {
...                 "name": "Apache License, Version 2.0",
...                 "url":
...                 "https://www.apache.org/licenses/LICENSE-2.0.txt"
...             },
...             {
...                 "name": "MIT License, ",
...                 "url": "https://opensource.org/licenses/MIT"
...             }
...         ]
...     }
... }
>>> pprint(MavenMapping().parse_licenses(d))
[{'@id': 'https://www.apache.org/licenses/LICENSE-2.0.txt'},
 {'@id': 'https://opensource.org/licenses/MIT'}]
class swh.indexer.metadata_dictionary.PythonPkginfoMapping[source]

Bases: swh.indexer.metadata_dictionary.DictMapping, swh.indexer.metadata_dictionary.SingleFileMapping

Dedicated class for Python’s PKG-INFO mapping and translation.

https://www.python.org/dev/peps/pep-0314/

filename = b'PKG-INFO'
mapping = {'author': 'http://schema.org/author', 'author-email': 'http://schema.org/email', 'description': 'http://schema.org/description', 'download-url': 'http://schema.org/downloadUrl', 'home-page': 'http://schema.org/url', 'keywords': 'http://schema.org/keywords', 'license': 'http://schema.org/license', 'name': 'http://schema.org/name', 'summary': 'http://schema.org/description', 'version': 'http://schema.org/version'}
translate(content)[source]
translate_summary(translated_metadata, v)[source]
translate_description(translated_metadata, v)[source]
normalize_home_page(urls)[source]
normalize_license(licenses)[source]
swh.indexer.metadata_dictionary.main()[source]

swh.indexer.mimetype module

swh.indexer.mimetype.compute_mimetype_encoding(raw_content)[source]

Determine mimetype and encoding from the raw content.

Parameters:raw_content (bytes) – content’s raw data
Returns:mimetype and encoding key and corresponding values (as bytes).
Return type:dict
class swh.indexer.mimetype.MixinMimetypeIndexer[source]

Bases: object

Mixin mimetype indexer.

See MimetypeIndexer and MimetypeRangeIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'version': '1:5.30-1+deb9u1', 'name': 'file', 'configuration': {'debian-package': 'python3-magic', 'type': 'library'}}), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/mimetype'
prepare()[source]
index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

content’s mimetype; dict keys being

  • id (bytes): content’s identifier (sha1)
  • mimetype (bytes): mimetype in bytes
  • encoding (bytes): encoding in bytes

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content’s mimetype dicts (see index())
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
class swh.indexer.mimetype.MimetypeIndexer[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentIndexer

Mimetype Indexer working on list of content identifiers.

It:

  • (optionally) filters out content already indexed (cf. filter())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
filter(ids)[source]

Filter out known sha1s and return only missing ones.

class swh.indexer.mimetype.MimetypeRangeIndexer[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentRangeIndexer

Mimetype Range Indexer working on range of content identifiers.

It:

  • (optionally) filters out content already indexed (cf indexed_contents_in_range())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
indexed_contents_in_range(start, end)[source]

Retrieve indexed content id within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Returns:

a dict with keys:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at this sha1 if any

Return type:

dict

swh.indexer.origin_head module

class swh.indexer.origin_head.OriginHeadIndexer[source]

Bases: swh.indexer.indexer.OriginIndexer

Origin-level indexer.

This indexer is in charge of looking up the revision that acts as the “head” of an origin.

In git, this is usually the commit pointed to by the ‘master’ branch.

ADDITIONAL_CONFIG = {'tasks': ('dict', {'origin_intrinsic_metadata': 'origin_metadata', 'revision_metadata': 'revision_metadata'}), 'tools': ('dict', {'version': '0.0.1', 'name': 'origin-metadata', 'configuration': {}})}
CONFIG_BASE_FILENAME = 'indexer/origin_head'
filter(ids)[source]
persist_index_computations(results, policy_update)[source]

Do nothing. The indexer’s results are not persistent, they should only be piped to another indexer.

next_step(results, task)[source]

Once the head is found, call the RevisionMetadataIndexer on these revisions, then call the OriginMetadataIndexer with both the origin_id and the revision metadata, so it can copy the revision metadata to the origin’s metadata.

Parameters:results (Iterable[dict]) – Iterable of return values from index.
index(origin)[source]

Index computation for the id and associated raw data.

Parameters:
  • id (bytes) – identifier
  • data (bytes) – id’s data from storage or objstorage depending on object type
Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

swh.indexer.rehash module

class swh.indexer.rehash.RecomputeChecksums[source]

Bases: swh.core.config.SWHConfig

Class in charge of (re)computing content’s hashes.

Hashes to compute are defined across 2 configuration options:

compute_checksums ([str])
list of hash algorithms that py:func:swh.model.hashutil.MultiHash.from_data function should be able to deal with. For variable-length checksums, a desired checksum length should also be provided. Their format is <algorithm’s name>:<variable-length> e.g: blake2:512
recompute_checksums (bool)
a boolean to notify that we also want to recompute potential existing hashes specified in compute_checksums. Default to False.
DEFAULT_CONFIG = {'batch_size_retrieve_content': ('int', 10), 'batch_size_update': ('int', 100), 'compute_checksums': ('list[str]', []), 'objstorage': ('dict', {'args': {'slicing': '0:2/2:4/4:6', 'root': '/srv/softwareheritage/objects'}, 'cls': 'pathslicing'}), 'recompute_checksums': ('bool', False), 'storage': ('dict', {'args': {'url': 'http://localhost:5002/'}, 'cls': 'remote'})}
CONFIG_BASE_FILENAME = 'indexer/rehash'
get_new_contents_metadata(all_contents)[source]
Retrieve raw contents and compute new checksums on the
contents. Unknown or corrupted contents are skipped.
Parameters:
  • all_contents ([dict]) – List of contents as dictionary with the necessary primary keys
  • checksum_algorithms ([str]) – List of checksums to compute
Yields:

tuple – tuple of (content to update, list of checksums computed)

run(contents)[source]

Given a list of content:

  • (re)compute a given set of checksums on contents available in our object storage
  • update those contents with the new metadata
Parameters:contents (dict) – contents as dictionary with necessary keys. key present in such dictionary should be the ones defined in the ‘primary_key’ option.

swh.indexer.tasks module

class swh.indexer.tasks.Task[source]

Bases: swh.scheduler.task.Task

Task whose results is needed for other computations.

run_task(*args, **kwargs)[source]

Perform the task.

Must return a json-serializable value as it is passed back to the task scheduler using a celery event.

ignore_result = False
rate_limit = None
reject_on_worker_lost = None
request_stack = <celery.utils.threads._LocalStack object>
serializer = 'json'
store_errors_even_if_ignored = False
track_started = False
typing = True
class swh.indexer.tasks.StatusTask[source]

Bases: swh.scheduler.task.Task

Task which returns a status either eventful or uneventful.

run_task(*args, **kwargs)[source]

Perform the task.

Must return a json-serializable value as it is passed back to the task scheduler using a celery event.

ignore_result = False
rate_limit = None
reject_on_worker_lost = None
request_stack = <celery.utils.threads._LocalStack object>
serializer = 'json'
store_errors_even_if_ignored = False
track_started = False
typing = True
class swh.indexer.tasks.RevisionMetadata[source]

Bases: swh.indexer.tasks.Task

task_queue = 'swh_indexer_revision_metadata'
serializer = 'msgpack'
Indexer

alias of swh.indexer.metadata.RevisionMetadataIndexer

class swh.indexer.tasks.OriginMetadata[source]

Bases: swh.indexer.tasks.Task

task_queue = 'swh_indexer_origin_intrinsic_metadata'
Indexer

alias of swh.indexer.metadata.OriginMetadataIndexer

class swh.indexer.tasks.OriginHead[source]

Bases: swh.indexer.tasks.Task

task_queue = 'swh_indexer_origin_head'
Indexer

alias of swh.indexer.origin_head.OriginHeadIndexer

class swh.indexer.tasks.ContentMimetype[source]

Bases: swh.indexer.tasks.StatusTask

Compute (mimetype, encoding) on a list of sha1s’ content.

task_queue = 'swh_indexer_content_mimetype'
Indexer

alias of swh.indexer.mimetype.MimetypeIndexer

class swh.indexer.tasks.ContentRangeMimetype[source]

Bases: swh.indexer.tasks.StatusTask

Compute (mimetype, encoding) on a range of sha1s.

task_queue = 'swh_indexer_content_mimetype_range'
Indexer

alias of swh.indexer.mimetype.MimetypeRangeIndexer

class swh.indexer.tasks.ContentLanguage[source]

Bases: swh.indexer.tasks.Task

Task which computes the language from the sha1’s content.

task_queue = 'swh_indexer_content_language'
Indexer

alias of swh.indexer.language.LanguageIndexer

class swh.indexer.tasks.Ctags[source]

Bases: swh.indexer.tasks.Task

Task which computes ctags from the sha1’s content.

task_queue = 'swh_indexer_content_ctags'
Indexer

alias of swh.indexer.ctags.CtagsIndexer

class swh.indexer.tasks.ContentFossologyLicense[source]

Bases: swh.indexer.tasks.Task

Compute fossology licenses on a list of sha1s’ content.

task_queue = 'swh_indexer_content_fossology_license'
Indexer

alias of swh.indexer.fossology_license.FossologyLicenseIndexer

class swh.indexer.tasks.ContentRangeFossologyLicense[source]

Bases: swh.indexer.tasks.StatusTask

Compute fossology license on a range of sha1s.

task_queue = 'swh_indexer_content_fossology_license_range'
Indexer

alias of swh.indexer.fossology_license.FossologyLicenseRangeIndexer

class swh.indexer.tasks.RecomputeChecksums[source]

Bases: swh.indexer.tasks.Task

Task which recomputes hashes and possibly new ones.

task_queue = 'swh_indexer_content_rehash'
Indexer

alias of swh.indexer.rehash.RecomputeChecksums

Module contents