swh.indexer.fossology_license module

swh.indexer.fossology_license.compute_license(path)[source]

Determine license from file at path.

Parameters

path – filepath to determine the license

Returns

A dict with the following keys:

  • licenses ([str]): associated detected licenses to path

  • path (bytes): content filepath

Return type

dict

class swh.indexer.fossology_license.MixinFossologyLicenseIndexer[source]

Bases: object

Mixin fossology license indexer.

See FossologyLicenseIndexer and FossologyLicenseRangeIndexer

ADDITIONAL_CONFIG = {'tools': ('dict', {'name': 'nomos', 'version': '3.1.0rc2-31-ga2cbb8c', 'configuration': {'command_line': 'nomossa <filepath>'}}), 'workdir': ('str', '/tmp/swh/indexer.fossology.license'), 'write_batch_size': ('int', 1000)}
CONFIG_BASE_FILENAME = 'indexer/fossology_license'
tool: Any
idx_storage: Any
prepare()[source]
index(id: bytes, data: Optional[bytes] = None, **kwargs) → Dict[str, Any][source]

Index sha1s’ content and store result.

Parameters
  • id (bytes) – content’s identifier

  • raw_content (bytes) – associated raw content to content id

Returns

A dict, representing a content_license, with keys:

  • id (bytes): content’s identifier (sha1)

  • license (bytes): license in bytes

  • path (bytes): path

  • indexer_configuration_id (int): tool used to compute the output

Return type

dict

persist_index_computations(results: List[Dict], policy_update: str) → Dict[str, int][source]

Persist the results in storage.

Parameters
  • results

    list of content_license dict with the following keys:

    • id (bytes): content’s identifier (sha1)

    • license (bytes): license in bytes

    • path (bytes): path

  • policy_update – either ‘update-dups’ or ‘ignore-dups’ to respectively update duplicates or ignore them

class swh.indexer.fossology_license.FossologyLicenseIndexer(config=None, **kw)[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.ContentIndexer

Indexer in charge of:

  • filtering out content already indexed

  • reading content from objstorage per the content’s id (sha1)

  • computing {license, encoding} from that content

  • store result in storage

filter(ids)[source]

Filter out known sha1s and return only missing ones.

idx_storage
class swh.indexer.fossology_license.FossologyLicenseRangeIndexer(config=None, **kw)[source]

Bases: swh.indexer.fossology_license.MixinFossologyLicenseIndexer, swh.indexer.indexer.ContentRangeIndexer

FossologyLicense Range Indexer working on range of content identifiers.

  • filters out the non textual content

  • (optionally) filters out content already indexed (cf indexed_contents_in_range())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

indexed_contents_in_range(start, end)[source]

Retrieve indexed content id within range [start, end].

Parameters
  • start (bytes) – Starting bound from range identifier

  • end (bytes) – End range identifier

Returns

a dict with keys:

  • ids [bytes]: iterable of content ids within the range.

  • next (Optional[bytes]): The next range of sha1 starts at this sha1 if any

Return type

dict

idx_storage