swh.indexer.fossology_license module#

swh.indexer.fossology_license.compute_license(path) Dict[source]#

Determine license from file at path.

Parameters:

path – filepath to determine the license

Returns:

A dict with the following keys:

  • licenses ([str]): associated detected licenses to path

  • path (bytes): content filepath

Return type:

dict

class swh.indexer.fossology_license.MixinFossologyLicenseIndexer(*args, **kwargs)[source]#

Bases: object

Mixin fossology license indexer.

See FossologyLicenseIndexer and FossologyLicensePartitionIndexer

tool: Any#
idx_storage: IndexerStorageInterface#
index(id: bytes, data: Optional[bytes] = None, **kwargs) List[ContentLicenseRow][source]#

Index sha1s’ content and store result.

Parameters:
  • id (bytes) – content’s identifier

  • raw_content (bytes) – associated raw content to content id

Returns:

A dict, representing a content_license, with keys:

  • id (bytes): content’s identifier (sha1)

  • license (bytes): license in bytes

  • path (bytes): path

  • indexer_configuration_id (int): tool used to compute the output

Return type:

dict

persist_index_computations(results: List[ContentLicenseRow]) Dict[str, int][source]#

Persist the results in storage.

Parameters:

results

list of content_license dict with the following keys:

  • id (bytes): content’s identifier (sha1)

  • license (bytes): license in bytes

  • path (bytes): path

class swh.indexer.fossology_license.FossologyLicenseIndexer(*args, **kwargs)[source]#

Bases: MixinFossologyLicenseIndexer, ContentIndexer[ContentLicenseRow]

Indexer in charge of:

  • filtering out content already indexed

  • reading content from objstorage per the content’s id (sha1)

  • computing {license, encoding} from that content

  • store result in storage

Prepare and check that the indexer is ready to run.

filter(ids)[source]#

Filter out known sha1s and return only missing ones.

idx_storage: IndexerStorageInterface#
class swh.indexer.fossology_license.FossologyLicensePartitionIndexer(*args, **kwargs)[source]#

Bases: MixinFossologyLicenseIndexer, ContentPartitionIndexer[ContentLicenseRow]

FossologyLicense Range Indexer working on range/partition of content identifiers.

  • filters out the non textual content

  • (optionally) filters out content already indexed (cf indexed_contents_in_partition())

  • reads content from objstorage per the content’s id (sha1)

  • computes {mimetype, encoding} from that content

  • stores result in storage

Prepare and check that the indexer is ready to run.

indexed_contents_in_partition(partition_id: int, nb_partitions: int, page_token: Optional[str] = None) Iterable[bytes][source]#

Retrieve indexed content id within the partition id

Parameters:
  • partition_id – Index of the partition to fetch

  • nb_partitions – Total number of partitions to split into

  • page_token – opaque token used for pagination

idx_storage: IndexerStorageInterface#