swh.datasets.luigi package#
Submodules:
- swh.datasets.luigi.aggregate_datasets module
- Luigi tasks for producing the aggregated derived datasets
ExportNodesTableAggregateContentDatasetsAggregateContentDatasets.local_graph_pathAggregateContentDatasets.graph_nameAggregateContentDatasets.popular_content_names_pathAggregateContentDatasets.provenance_dirAggregateContentDatasets.aggregate_datasets_pathAggregateContentDatasets.requires()AggregateContentDatasets.output()AggregateContentDatasets.run()
UploadNodesTableUploadAggregatedContentDatasetRunAggregatedDatasets
- swh.datasets.luigi.blobs_datasets module
- Luigi tasks for blob-centric datasets
atomic_zstd_writer()atomic_csv_zstd_writer()check_csv()SelectBlobsDownloadBlobsMakeBlobTarballMakeSampleBlobTarballComputeBlobFileinfoBlobScancodeBlobScancode.blob_filterBlobScancode.derived_datasets_pathBlobScancode.FIELDNAMESBlobScancode.DEFAULT_MIN_SCOREBlobScancode.DEFAULT_JOBSBlobScancode.DEFAULT_TIMEOUTBlobScancode.MAP_CHUNKSIZEBlobScancode.WORKER_MAX_TASKSBlobScancode.FIELD_SEPBlobScancode.READABLE_ENCODINGSBlobScancode.requires()BlobScancode.output()BlobScancode.run()
FindBlobOriginsCountBlobOriginsFindEarliestRevisionsRunBlobDataset
- swh.datasets.luigi.file_names module
- Luigi tasks for producing the most common names of every content and datasets based on file names
PopularContentNamesPopularContentNames.local_graph_pathPopularContentNames.popular_contents_pathPopularContentNames.graph_namePopularContentNames.max_results_per_contentPopularContentNames.popularity_thresholdPopularContentNames.resourcesPopularContentNames.requires()PopularContentNames.output()PopularContentNames.run()
PopularContentPathsPopularContentNamesOrcToS3ListFilesByName
- swh.datasets.luigi.impact module
- Luigi tasks to measure institutional impact
SelectPersonsComputeRawImpactComputeRawImpact.local_graph_pathComputeRawImpact.local_sensitive_graph_pathComputeRawImpact.graph_nameComputeRawImpact.impact_dirComputeRawImpact.output_emailsComputeRawImpact.include_rangesComputeRawImpact.exclude_rangesComputeRawImpact.requires()ComputeRawImpact.output()ComputeRawImpact.run()
opt_int()typeddict_parser()OriginWithEmailsOriginWithEmails.origin_idOriginWithEmails.origin_urlOriginWithEmails.num_contributed_revsOriginWithEmails.num_contributed_relsOriginWithEmails.num_contributed_revs_in_main_branchOriginWithEmails.total_revsOriginWithEmails.total_relsOriginWithEmails.total_revs_in_main_branchOriginWithEmails.first_contributed_revrelOriginWithEmails.first_contributed_revrel_tsOriginWithEmails.first_contributed_revrel_authorOriginWithEmails.first_contributed_revrel_committerOriginWithEmails.last_contributed_revrelOriginWithEmails.last_contributed_revrel_tsOriginWithEmails.last_contributed_revrel_authorOriginWithEmails.last_contributed_revrel_committerOriginWithEmails.first_revrel_tsOriginWithEmails.last_revrel_tsOriginWithEmails.codemeta_jsonOriginWithEmails.citation_cffOriginWithEmails.forward_pathsOriginWithEmails.forward_paths_to_rootsOriginWithEmails.forward_descendantsOriginWithEmails.histhost_forward_pathsOriginWithEmails.histhost_forward_paths_to_rootsOriginWithEmails.histhost_forward_descendants
OriginWithoutEmailsOriginWithoutEmails.origin_idOriginWithoutEmails.origin_urlOriginWithoutEmails.num_contributed_revsOriginWithoutEmails.num_contributed_relsOriginWithoutEmails.num_contributed_revs_in_main_branchOriginWithoutEmails.total_revsOriginWithoutEmails.total_relsOriginWithoutEmails.total_revs_in_main_branchOriginWithoutEmails.first_contributed_revrelOriginWithoutEmails.first_contributed_revrel_tsOriginWithoutEmails.last_contributed_revrelOriginWithoutEmails.last_contributed_revrel_tsOriginWithoutEmails.first_revrel_tsOriginWithoutEmails.last_revrel_tsOriginWithoutEmails.codemeta_jsonOriginWithoutEmails.citation_cffOriginWithoutEmails.forward_pathsOriginWithoutEmails.forward_paths_to_rootsOriginWithoutEmails.forward_descendantsOriginWithoutEmails.histhost_forward_pathsOriginWithoutEmails.histhost_forward_paths_to_rootsOriginWithoutEmails.histhost_forward_descendants
ComputeIndexedImpactComputeIndexedImpact.indexer_storage_urlComputeIndexedImpact.swh_scheduler_urlComputeIndexedImpact.FORK_FILTERSComputeIndexedImpact.local_graph_pathComputeIndexedImpact.graph_nameComputeIndexedImpact.impact_dirComputeIndexedImpact.output_emailsComputeIndexedImpact.fork_filterComputeIndexedImpact.requires()ComputeIndexedImpact.output()ComputeIndexedImpact.run()
DenormalizeImpactedRevrelsDenormalizeImpactedRevrels.indexer_storage_urlDenormalizeImpactedRevrels.swh_scheduler_urlDenormalizeImpactedRevrels.local_graph_pathDenormalizeImpactedRevrels.local_sensitive_graph_pathDenormalizeImpactedRevrels.graph_nameDenormalizeImpactedRevrels.impact_dirDenormalizeImpactedRevrels.write_revrelsDenormalizeImpactedRevrels.write_origin_revrelsDenormalizeImpactedRevrels.revrel_filterDenormalizeImpactedRevrels.algorithmDenormalizeImpactedRevrels.include_rangesDenormalizeImpactedRevrels.exclude_rangesDenormalizeImpactedRevrels.requires()DenormalizeImpactedRevrels.output()DenormalizeImpactedRevrels.run()
- swh.datasets.luigi.origin_contributors module
- Luigi tasks for contribution graph
ListOriginContributorsListOriginContributors.local_graph_pathListOriginContributors.topology_dirListOriginContributors.origin_contributors_pathListOriginContributors.origin_urls_pathListOriginContributors.graph_nameListOriginContributors.max_ram_mbListOriginContributors.resourcesListOriginContributors.requires()ListOriginContributors.output()ListOriginContributors.run()
ExportDeanonymizationTableDeanonymizeContributorsDeanonymizeContributors.local_export_pathDeanonymizeContributors.local_graph_pathDeanonymizeContributors.graph_nameDeanonymizeContributors.deanonymization_table_pathDeanonymizeContributors.mph_algoDeanonymizeContributors.deanonymization_mapping_pathDeanonymizeContributors.requires()DeanonymizeContributors.output()DeanonymizeContributors.run()
DeanonymizeOriginContributorsDeanonymizeOriginContributors.local_graph_pathDeanonymizeOriginContributors.graph_nameDeanonymizeOriginContributors.origin_contributors_pathDeanonymizeOriginContributors.deanonymization_table_pathDeanonymizeOriginContributors.deanonymized_origin_contributors_pathDeanonymizeOriginContributors.mph_algoDeanonymizeOriginContributors.deanonymization_mapping_pathDeanonymizeOriginContributors.requires()DeanonymizeOriginContributors.output()DeanonymizeOriginContributors.run()
RunOriginContributorsRunOriginContributors.local_graph_pathRunOriginContributors.graph_nameRunOriginContributors.origin_urls_pathRunOriginContributors.origin_contributors_pathRunOriginContributors.deanonymized_origin_contributors_pathRunOriginContributors.skip_integrity_checkRunOriginContributors.test_originRunOriginContributors.test_personRunOriginContributors.test_yearsRunOriginContributors.requires()RunOriginContributors.run()
- swh.datasets.luigi.specific_languages_datasets module
ExtractFileExtensionJoinFilteredContentsWithNodesJoinFilteredContentsWithNodes.aggregate_datasets_pathJoinFilteredContentsWithNodes.specific_languages_pathJoinFilteredContentsWithNodes.digestmap_pathJoinFilteredContentsWithNodes.extensionsJoinFilteredContentsWithNodes.requires()JoinFilteredContentsWithNodes.output()JoinFilteredContentsWithNodes.run()
Module contents:
Luigi tasks#
This package contains Luigi tasks. These come in two kinds:
in
swh.graph.luigi.compressed_graph: an alternative to the ‘swh graph compress’ CLI that can be composed with other tasks, such as swh-dataset’sin other submodules: tasks driving the creation of specific datasets that are generated using the compressed graph
The overall directory structure is:
base_dir/
<date>[_<flavor>]/
edges/
...
orc/
...
compressed/
graph.graph
graph.mph
...
meta/
export.json
compression.json
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.datasets.luigi.UploadExportAndCompressedGraphToS3(*args, **kwargs)[source]#
Bases:
WrapperTaskUploads the local dataset export to s3, then the compressed graph to S3.
It will create automatically the missing local dataset or compressed graph if missing.
This task is a combination of UploadExportToS3 and UploadGraphToS3.
Example invocation:
luigi --local-scheduler --module swh.graph.luigi UploadExportAndCompressedGraphToS3 --local-graph-path=graph/ --s3-graph-path=s3://softwareheritage/graph/swh_2022-11-08/compressed/ ...
- requires()[source]#
Returns instances of
swh.export.luigi.UploadExportToS3, andswh.graph.luigi.compressed_graph.UploadGraphToS3, which recursively depend on the whole export and compression pipeline.
- class swh.datasets.luigi.RunNewGraph(*args, **kwargs)[source]#
Bases:
WrapperTaskRuns dataset export, graph compression, and generates datasets using the graph.
- requires() List[Task][source]#
Returns instances of
swh.export.luigi.RunExportAllandswh.export.luigi.UploadExportAndCompressedGraphToS3, which recursively depend on the whole export and compression pipeline. Also runs some of the derived datasets throughswh.datasets.topology.UploadGenerationsToS3andswh.datasets.aggregate_datasets.RunAggregatedDatasets.