swh.indexer package

Submodules

swh.indexer.cli module

swh.indexer.cli._get_api(getter, config, config_key, url)[source]
swh.indexer.cli.list_origins_by_producer(idx_storage, mappings, tool_ids)[source]
swh.indexer.cli.main()[source]

swh.indexer.codemeta module

swh.indexer.codemeta.make_absolute_uri(local_name)[source]
swh.indexer.codemeta._read_crosstable(fd)[source]
swh.indexer.codemeta._document_loader(url)[source]

Document loader for pyld.

Reads the local codemeta.jsonld file instead of fetching it from the Internet every single time.

swh.indexer.codemeta.compact(doc)[source]

Same as pyld.jsonld.compact, but in the context of CodeMeta.

swh.indexer.codemeta.expand(doc)[source]

Same as pyld.jsonld.expand, but in the context of CodeMeta.

swh.indexer.ctags module

swh.indexer.ctags.compute_language(content, log=None)[source]
swh.indexer.ctags.run_ctags(path, lang=None, ctags_command='ctags')[source]

Run ctags on file path with optional language.

Parameters:
  • path – path to the file
  • lang – language for that path (optional)
Yields:

dict – ctags’ output

class swh.indexer.ctags.CtagsIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.ContentIndexer

CONFIG_BASE_FILENAME = 'indexer/ctags'
ADDITIONAL_CONFIG = {'languages': ('dict', {'ada': 'Ada', 'agda': None, 'adl': None}), 'tools': ('dict', {'name': 'universal-ctags', 'configuration': {'command_line': 'ctags --fields=+lnz --sort=no --links=no --output-format=json <filepath>'}, 'version': '~git7859817b'}), 'workdir': ('str', '/tmp/swh/indexer.ctags')}
prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

a dict representing a content_mimetype with keys:

  • id (bytes): content’s identifier (sha1)
  • ctags ([dict]): ctags list of symbols

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - ctags ([dict]): ctags list of symbols
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.ctags'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>

swh.indexer.fossology_license module

swh.indexer.fossology_license.compute_license(path, log=None)[source]

Determine license from file at path.

Parameters:path – filepath to determine the license
Returns:A dict with the following keys:
  • licenses ([str]): associated detected licenses to path
  • path (bytes): content filepath
Return type:dict
class swh.indexer.fossology_license.MixinFossologyLicenseIndexer[source]

Bases: object

Mixin fossology license indexer.

See FossologyLicenseIndexer and FossologyLicenseRangeIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'nomos', 'configuration': {'command_line': 'nomossa <filepath>'}, 'version': '3.1.0rc2-31-ga2cbb8c'}), 'workdir': ('str', '/tmp/swh/indexer.fossology.license'), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/fossology_license'
prepare()[source]
index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • raw_content (bytes) – associated raw content to content id
Returns:

A dict, representing a content_license, with keys:

  • id (bytes): content’s identifier (sha1)
  • license (bytes): license in bytes
  • path (bytes): path
  • indexer_configuration_id (int): tool used to compute the output

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) –

    list of content_license, dict with the following keys:

    • id (bytes): content’s identifier (sha1)
    • license (bytes): license in bytes
    • path (bytes): path
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
__dict__ = mappingproxy({'__doc__': 'Mixin fossology license indexer.\n\n See :class:`FossologyLicenseIndexer` and\n :class:`FossologyLicenseRangeIndexer`\n\n ', '__module__': 'swh.indexer.fossology_license', 'CONFIG_BASE_FILENAME': 'indexer/fossology_license', 'prepare': <function MixinFossologyLicenseIndexer.prepare>, 'ADDITIONAL_CONFIG': {'write_batch_size': ('int', 1000), 'workdir': ('str', '/tmp/swh/indexer.fossology.license'), 'tools': ('dict', {'name': 'nomos', 'configuration': {'command_line': 'nomossa <filepath>'}, 'version': '3.1.0rc2-31-ga2cbb8c'})}, 'index': <function MixinFossologyLicenseIndexer.index>, '__weakref__': <attribute '__weakref__' of 'MixinFossologyLicenseIndexer' objects>, 'persist_index_computations': <function MixinFossologyLicenseIndexer.persist_index_computations>, '__dict__': <attribute '__dict__' of 'MixinFossologyLicenseIndexer' objects>})
__module__ = 'swh.indexer.fossology_license'
__weakref__

list of weak references to the object (if defined)

class swh.indexer.fossology_license.FossologyLicenseIndexer(config=None, **kw)[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.ContentIndexer

Indexer in charge of:

  • filtering out content already indexed
  • reading content from objstorage per the content’s id (sha1)
  • computing {license, encoding} from that content
  • store result in storage
filter(ids)[source]

Filter out known sha1s and return only missing ones.

__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.fossology_license'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.fossology_license.FossologyLicenseRangeIndexer(config=None, **kw)[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.ContentRangeIndexer

FossologyLicense Range Indexer working on range of content identifiers.

  • filters out the non textual content
  • (optionally) filters out content already indexed (cf indexed_contents_in_range())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
indexed_contents_in_range(start, end)[source]

Retrieve indexed content id within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Returns:

a dict with keys:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at this sha1 if any

Return type:

dict

__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.fossology_license'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>

swh.indexer.indexer module

swh.indexer.indexer.write_to_temp(filename, data, working_directory)[source]

Write the sha1’s content in a temporary file.

Parameters:
  • filename (str) – one of sha1’s many filenames
  • data (bytes) – the sha1’s content to write in temporary file
Returns:

The path to the temporary file created. That file is filled in with the raw content’s data.

class swh.indexer.indexer.BaseIndexer(config=None, **kw)[source]

Bases: swh.core.config.SWHConfig

Base class for indexers to inherit from.

The main entry point is the run() function which is in charge of triggering the computations on the batch dict/ids received.

Indexers can:

  • filter out ids whose data has already been indexed.
  • retrieve ids data from storage or objstorage
  • index this data depending on the object and store the result in storage.

To implement a new object type indexer, inherit from the BaseIndexer and implement indexing:

run():
object_ids are different depending on object. For example: sha1 for content, sha1_git for revision, directory, release, and id for origin

To implement a new concrete indexer, inherit from the object level classes: ContentIndexer, RevisionIndexer, OriginIndexer.

Then you need to implement the following functions:

filter():
filter out data already indexed (in storage).
index_object():
compute index on id with data (retrieved from the storage or the objstorage by the id key) and return the resulting index computation.
persist_index_computations():
persist the results of multiple index computations in the storage.

The new indexer implementation can also override the following functions:

prepare():
Configuration preparation for the indexer. When overriding, this must call the super().prepare() instruction.
check():
Configuration check for the indexer. When overriding, this must call the super().check() instruction.
register_tools():
This should return a dict of the tool(s) to use when indexing or filtering.
CONFIG = 'indexer/base'
DEFAULT_CONFIG = {'indexer_storage': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5007/'}}), 'objstorage': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5003/'}}), 'storage': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}})}
ADDITIONAL_CONFIG = {}
USE_TOOLS = True
catch_exceptions = True

Prevents exceptions in index() from raising too high. Set to False in tests to properly catch all exceptions.

__init__(config=None, **kw)[source]

Prepare and check that the indexer is ready to run.

prepare()[source]

Prepare the indexer’s needed runtime configuration. Without this step, the indexer cannot possibly run.

tool
check()[source]

Check the indexer’s configuration is ok before proceeding. If ok, does nothing. If not raise error.

_prepare_tool(tool)[source]

Prepare the tool dict to be compliant with the storage api.

register_tools(tools)[source]

Permit to register tools to the storage.

Add a sensible default which can be overridden if not sufficient. (For now, all indexers use only one tool)

Expects the self.config[‘tools’] property to be set with one or more tools.

Parameters:tools (dict/[dict]) – Either a dict or a list of dict.
Returns:List of dicts with additional id key.
Return type:list
Raises:ValueError – if not a list nor a dict.
index(id, data)[source]

Index computation for the id and associated raw data.

Parameters:
  • id (bytes) – identifier
  • data (bytes) – id’s data from storage or objstorage depending on object type
Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

filter(ids)[source]

Filter missing ids for that particular indexer.

Parameters:ids ([bytes]) – list of ids
Yields:iterator of missing ids
persist_index_computations(results, policy_update)[source]

Persist the computation resulting from the index.

Parameters:
  • results ([result]) – List of results. One result is the result of the index function.
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
Returns:

None

next_step(results, task)[source]

Do something else with computations results (e.g. send to another queue, …).

(This is not an abstractmethod since it is optional).

Parameters:
  • results ([result]) – List of results (dict) as returned by index function.
  • task (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
Returns:

None

run(ids, policy_update, next_step=None, **kwargs)[source]

Given a list of ids:

  • retrieves the data from the storage
  • executes the indexing computations
  • stores the results (according to policy_update)
Parameters:
  • ids ([bytes]) – id’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus a result_name key.
  • **kwargs – passed to the index method
__abstractmethods__ = frozenset({'persist_index_computations', 'run'})
__module__ = 'swh.indexer.indexer'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.indexer.ContentIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.BaseIndexer

A content indexer working on a list of ids directly.

To work on indexer range, use the ContentRangeIndexer instead.

Note: ContentIndexer is not an instantiable object. To use it, one should inherit from this class and override the methods mentioned in the BaseIndexer class.

run(ids, policy_update, next_step=None, **kwargs)[source]

Given a list of ids:

  • retrieve the content from the storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids (Iterable[Union[bytes, str]]) – sha1’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
  • **kwargs – passed to the index method
__abstractmethods__ = frozenset({'persist_index_computations'})
__module__ = 'swh.indexer.indexer'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.indexer.ContentRangeIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.BaseIndexer

A content range indexer.

This expects as input a range of ids to index.

To work on a list of ids, use the ContentIndexer instead.

Note: ContentRangeIndexer is not an instantiable object. To use it, one should inherit from this class and override the methods mentioned in the BaseIndexer class.

indexed_contents_in_range(start, end)[source]

Retrieve indexed contents within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Yields:

bytes – Content identifier present in the range [start, end]

_list_contents_to_index(start, end, indexed)[source]
Compute from storage the new contents to index in the range [start,
end]. The already indexed contents are skipped.
Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
  • indexed (Set[bytes]) – Set of content already indexed.
Yields:

bytes – Identifier of contents to index.

_index_contents(start, end, indexed, **kwargs)[source]

Index the contents from within range [start, end]

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
  • indexed (Set[bytes]) – Set of content already indexed.
Yields:

dict – Data indexed to persist using the indexer storage

_index_with_skipping_already_done(start, end)[source]

Index not already indexed contents in range [start, end].

Parameters:
  • start** (Union[bytes, str]) – Starting range identifier
  • end (Union[bytes, str]) – Ending range identifier
Yields:

bytes – Content identifier present in the range [start, end] which are not already indexed.

run(start, end, skip_existing=True, **kwargs)[source]
Given a range of content ids, compute the indexing computations on
the contents within. Either the indexer is incremental (filter out existing computed data) or not (compute everything from scratch).
Parameters:
  • start (Union[bytes, str]) – Starting range identifier
  • end (Union[bytes, str]) – Ending range identifier
  • skip_existing (bool) – Skip existing indexed data (default) or not
  • **kwargs – passed to the index method
Returns:

True if data was indexed, False otherwise.

Return type:

bool

__abstractmethods__ = frozenset({'persist_index_computations', 'indexed_contents_in_range'})
__module__ = 'swh.indexer.indexer'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
swh.indexer.indexer.origin_get_params(id_)[source]

From any of the two types of origin identifiers (int or type+url), returns a dict that can be passed to Storage.origin_get. Also accepts JSON-encoded forms of these (used via the task scheduler).

>>> from pprint import pprint
>>> origin_get_params(123)
{'id': 123}
>>> pprint(origin_get_params(['git', 'https://example.com/foo.git']))
{'type': 'git', 'url': 'https://example.com/foo.git'}
>>> origin_get_params("123")
{'id': 123}
>>> pprint(origin_get_params('["git", "https://example.com/foo.git"]'))
{'type': 'git', 'url': 'https://example.com/foo.git'}
class swh.indexer.indexer.OriginIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.BaseIndexer

An object type indexer, inherits from the BaseIndexer and implements Origin indexing using the run method

Note: the OriginIndexer is not an instantiable object. To use it in another context one should inherit from this class and override the methods mentioned in the BaseIndexer class.

run(ids, policy_update='update-dups', parse_ids=True, next_step=None, **kwargs)[source]

Given a list of origin ids:

  • retrieve origins from storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids ([Union[int, Tuple[str, bytes]]]) – list of origin ids or (type, url) tuples.
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates (default) or ignore them
  • next_step (dict) – a dict in the form expected by scheduler.backend.SchedulerBackend.create_tasks without next_run, plus an optional result_name key.
  • parse_ids (bool) – Do we need to parse id or not (default)
  • **kwargs – passed to the index method
__abstractmethods__ = frozenset({'persist_index_computations'})
__module__ = 'swh.indexer.indexer'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
index_list(origins, **kwargs)[source]
class swh.indexer.indexer.RevisionIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.BaseIndexer

An object type indexer, inherits from the BaseIndexer and implements Revision indexing using the run method

Note: the RevisionIndexer is not an instantiable object. To use it in another context one should inherit from this class and override the methods mentioned in the BaseIndexer class.

__abstractmethods__ = frozenset({'persist_index_computations'})
__module__ = 'swh.indexer.indexer'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
run(ids, policy_update, next_step=None)[source]

Given a list of sha1_gits:

  • retrieve revisions from storage
  • execute the indexing computations
  • store the results (according to policy_update)
Parameters:
  • ids ([bytes or str]) – sha1_git’s identifier list
  • policy_update (str) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

swh.indexer.journal_client module

class swh.indexer.journal_client.IndexerJournalClient[source]

Bases: swh.journal.client.JournalClient

Client in charge of listing new received origins and origin_visits in the swh journal.

CONFIG_BASE_FILENAME = 'indexer/journal_client'
ADDITIONAL_CONFIG = {'origin_visit_tasks': ('List[dict]', [{'type': 'index-origin-metadata', 'kwargs': {'policy_update': 'update-dups', 'parse_ids': False}}]), 'scheduler': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}})}
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

process_objects(messages)[source]
process_origin_visit(origin_visit)[source]
__module__ = 'swh.indexer.journal_client'

swh.indexer.metadata module

class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.ContentIndexer

Content-level indexer

This indexer is in charge of:

  • filtering out content already indexed in content_metadata
  • reading content from objstorage with the content’s id sha1
  • computing metadata by given context
  • using the metadata_dictionary as the ‘swh-metadata-translator’ tool
  • store result in content_metadata table
filter(ids)[source]

Filter out known sha1s and return only missing ones.

index(id, data, log_suffix='unknown revision')[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_metadata, dict with the following keys: - id (bytes): content’s identifier (sha1) - metadata (jsonb): detected metadata
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.metadata'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.metadata.RevisionMetadataIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.RevisionIndexer

Revision-level indexer

This indexer is in charge of:

  • filtering revisions already indexed in revision_intrinsic_metadata table with defined computation tool
  • retrieve all entry_files in root directory
  • use metadata_detector for file_names containing metadata
  • compute metadata translation if necessary and possible (depends on tool)
  • send sha1s to content indexing if possible
  • store the results for revision
ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'swh-metadata-detector', 'configuration': {}, 'version': '0.0.2'})}
filter(sha1_gits)[source]

Filter out known sha1s and return only missing ones.

index(rev)[source]

Index rev by processing it and organizing result.

use metadata_detector to iterate on filenames

  • if one filename detected -> sends file to content indexer
  • if multiple file detected -> translation needed at revision level
Parameters:rev (dict) – revision artifact from storage
Returns:dictionary representing a revision_intrinsic_metadata, with keys:
  • id (str): rev’s identifier (sha1_git)
  • indexer_configuration_id (bytes): tool used
  • metadata: dict of retrieved metadata
Return type:dict
persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content_mimetype, dict with the following keys: - id (bytes): content’s identifier (sha1) - mimetype (bytes): mimetype in bytes - encoding (bytes): encoding in bytes
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
translate_revision_intrinsic_metadata(detected_files, log_suffix)[source]

Determine plan of action to translate metadata when containing one or multiple detected files:

Parameters:detected_files (dict) – dictionary mapping context names (e.g., “npm”, “authors”) to list of sha1
Returns:list of mappings used and dict with translated metadata according to the CodeMeta vocabulary
Return type:(List[str], dict)
__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.metadata'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]

Bases: swh.indexer.indexer.OriginIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'swh-metadata-detector', 'configuration': {}, 'version': '0.0.2'})}
USE_TOOLS = False
__init__(config=None, **kwargs)[source]

Prepare and check that the indexer is ready to run.

index_list(origins)[source]
persist_index_computations(results, policy_update)[source]

Persist the computation resulting from the index.

Parameters:
  • results ([result]) – List of results. One result is the result of the index function.
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
Returns:

None

__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.metadata'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>

swh.indexer.metadata_detector module

swh.indexer.metadata_detector.detect_metadata(files)[source]

Detects files potentially containing metadata

Parameters:file_entries (list) – list of files
Returns:{mapping_filenames[name]:f[‘sha1’]} (may be empty)
Return type:dict
swh.indexer.metadata_detector.extract_minimal_metadata_dict(metadata_list)[source]

Every item in the metadata_list is a dict of translated_metadata in the CodeMeta vocabulary.

We wish to extract a minimal set of terms and keep all values corresponding to this term without duplication.

Parameters:metadata_list (list) – list of dicts of translated_metadata
Returns:minimal_dict; dict with selected values of metadata
Return type:dict

swh.indexer.mimetype module

swh.indexer.mimetype.compute_mimetype_encoding(raw_content)[source]

Determine mimetype and encoding from the raw content.

Parameters:raw_content (bytes) – content’s raw data
Returns:mimetype and encoding key and corresponding values (as bytes).
Return type:dict
class swh.indexer.mimetype.MixinMimetypeIndexer[source]

Bases: object

Mixin mimetype indexer.

See MimetypeIndexer and MimetypeRangeIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'file', 'configuration': {'type': 'library', 'debian-package': 'python3-magic'}, 'version': '1:5.30-1+deb9u1'}), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/mimetype'
index(id, data)[source]

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier
  • data (bytes) – raw content in bytes
Returns:

content’s mimetype; dict keys being

  • id (bytes): content’s identifier (sha1)
  • mimetype (bytes): mimetype in bytes
  • encoding (bytes): encoding in bytes

Return type:

dict

persist_index_computations(results, policy_update)[source]

Persist the results in storage.

Parameters:
  • results ([dict]) – list of content’s mimetype dicts (see index())
  • policy_update ([str]) – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them
__dict__ = mappingproxy({'__weakref__': <attribute '__weakref__' of 'MixinMimetypeIndexer' objects>, '__doc__': 'Mixin mimetype indexer.\n\n See :class:`MimetypeIndexer` and :class:`MimetypeRangeIndexer`\n\n ', '__module__': 'swh.indexer.mimetype', 'persist_index_computations': <function MixinMimetypeIndexer.persist_index_computations>, 'CONFIG_BASE_FILENAME': 'indexer/mimetype', '__dict__': <attribute '__dict__' of 'MixinMimetypeIndexer' objects>, 'ADDITIONAL_CONFIG': {'write_batch_size': ('int', 1000), 'tools': ('dict', {'name': 'file', 'configuration': {'type': 'library', 'debian-package': 'python3-magic'}, 'version': '1:5.30-1+deb9u1'})}, 'index': <function MixinMimetypeIndexer.index>})
__module__ = 'swh.indexer.mimetype'
__weakref__

list of weak references to the object (if defined)

class swh.indexer.mimetype.MimetypeIndexer(config=None, **kw)[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentIndexer

Mimetype Indexer working on list of content identifiers.

It:

  • (optionally) filters out content already indexed (cf. filter())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
filter(ids)[source]

Filter out known sha1s and return only missing ones.

__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.mimetype'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>
class swh.indexer.mimetype.MimetypeRangeIndexer(config=None, **kw)[source]

Bases: swh.indexer.mimetype.MixinMimetypeIndexer, swh.indexer.indexer.ContentRangeIndexer

Mimetype Range Indexer working on range of content identifiers.

It:

  • (optionally) filters out content already indexed (cf indexed_contents_in_range())
  • reads content from objstorage per the content’s id (sha1)
  • computes {mimetype, encoding} from that content
  • stores result in storage
indexed_contents_in_range(start, end)[source]

Retrieve indexed content id within range [start, end].

Parameters:
  • start (bytes) – Starting bound from range identifier
  • end (bytes) – End range identifier
Returns:

a dict with keys:

  • ids [bytes]: iterable of content ids within the range.
  • next (Optional[bytes]): The next range of sha1 starts at this sha1 if any

Return type:

dict

__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.mimetype'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>

swh.indexer.origin_head module

class swh.indexer.origin_head.OriginHeadIndexer(config=None, **kw)[source]

Bases: swh.indexer.indexer.OriginIndexer

Origin-level indexer.

This indexer is in charge of looking up the revision that acts as the “head” of an origin.

In git, this is usually the commit pointed to by the ‘master’ branch.

USE_TOOLS = False
persist_index_computations(results, policy_update)[source]

Do nothing. The indexer’s results are not persistent, they should only be piped to another indexer.

index(origin)[source]

Index computation for the id and associated raw data.

Parameters:
  • id (bytes) – identifier
  • data (bytes) – id’s data from storage or objstorage depending on object type
Returns:

a dict that makes sense for the persist_index_computations() method.

Return type:

dict

_try_get_vcs_head(snapshot)[source]
_try_get_hg_head(snapshot)
_try_get_git_head(snapshot)
_archive_filename_re = re.compile(b'^(?P<pkgname>.*)[-_](?P<version>[0-9]+(\\.[0-9])*)(?P<preversion>[-+][a-zA-Z0-9.~]+?)?(?P<extension>(\\.[a-zA-Z0-9]+)+)$')
classmethod _parse_version(filename)[source]

Extracts the release version from an archive filename, to get an ordering whose maximum is likely to be the last version of the software

>>> OriginHeadIndexer._parse_version(b'foo')
(-inf,)
>>> OriginHeadIndexer._parse_version(b'foo.tar.gz')
(-inf,)
>>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1.tar.gz')
(0, 0, 1, 0)
>>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1-beta2.tar.gz')
(0, 0, 1, -1, 'beta2')
>>> OriginHeadIndexer._parse_version(b'gnu-hello-0.0.1+foobar.tar.gz')
(0, 0, 1, 1, 'foobar')
_try_get_ftp_head(snapshot)[source]
_try_get_head_generic(snapshot)[source]
_try_resolve_target(branches, target_name)[source]
__abstractmethods__ = frozenset()
__module__ = 'swh.indexer.origin_head'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 114
_abc_registry = <_weakrefset.WeakSet object>

swh.indexer.rehash module

class swh.indexer.rehash.RecomputeChecksums[source]

Bases: swh.core.config.SWHConfig

Class in charge of (re)computing content’s hashes.

Hashes to compute are defined across 2 configuration options:

compute_checksums ([str])
list of hash algorithms that py:func:swh.model.hashutil.MultiHash.from_data function should be able to deal with. For variable-length checksums, a desired checksum length should also be provided. Their format is <algorithm’s name>:<variable-length> e.g: blake2:512
recompute_checksums (bool)
a boolean to notify that we also want to recompute potential existing hashes specified in compute_checksums. Default to False.
DEFAULT_CONFIG = {'batch_size_retrieve_content': ('int', 10), 'batch_size_update': ('int', 100), 'compute_checksums': ('list[str]', []), 'objstorage': ('dict', {'cls': 'pathslicing', 'args': {'root': '/srv/softwareheritage/objects', 'slicing': '0:2/2:4/4:6'}}), 'recompute_checksums': ('bool', False), 'storage': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}})}
CONFIG_BASE_FILENAME = 'indexer/rehash'
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

_read_content_ids(contents)[source]

Read the content identifiers from the contents.

get_new_contents_metadata(all_contents)[source]
Retrieve raw contents and compute new checksums on the
contents. Unknown or corrupted contents are skipped.
Parameters:
  • all_contents ([dict]) – List of contents as dictionary with the necessary primary keys
  • checksum_algorithms ([str]) – List of checksums to compute
Yields:

tuple – tuple of (content to update, list of checksums computed)

run(contents)[source]

Given a list of content:

  • (re)compute a given set of checksums on contents available in our object storage
  • update those contents with the new metadata
Parameters:contents (dict) – contents as dictionary with necessary keys. key present in such dictionary should be the ones defined in the ‘primary_key’ option.
__module__ = 'swh.indexer.rehash'

swh.indexer.tasks module

Module contents