swh.datasets.luigi package#
Submodules#
- swh.datasets.luigi.aggregate_datasets module
- Luigi tasks for producing the aggregated derived datasets
ExportNodesTableAggregateContentDatasetsAggregateContentDatasets.local_graph_pathAggregateContentDatasets.graph_nameAggregateContentDatasets.popular_content_names_pathAggregateContentDatasets.provenance_dirAggregateContentDatasets.aggregate_datasets_pathAggregateContentDatasets.requires()AggregateContentDatasets.output()AggregateContentDatasets.run()
UploadNodesTableUploadAggregatedContentDatasetRunAggregatedDatasets
- swh.datasets.luigi.blobs_datasets module
- Luigi tasks for blob-centric datasets
atomic_zstd_writer()atomic_csv_zstd_writer()check_csv()SelectBlobsDownloadBlobsMakeBlobTarballMakeSampleBlobTarballComputeBlobFileinfoBlobScancodeBlobScancode.blob_filterBlobScancode.derived_datasets_pathBlobScancode.FIELDNAMESBlobScancode.DEFAULT_MIN_SCOREBlobScancode.DEFAULT_JOBSBlobScancode.DEFAULT_TIMEOUTBlobScancode.MAP_CHUNKSIZEBlobScancode.WORKER_MAX_TASKSBlobScancode.FIELD_SEPBlobScancode.READABLE_ENCODINGSBlobScancode.requires()BlobScancode.output()BlobScancode.run()BlobScancode.previous_derived_datasets_path
FindBlobOriginsCountBlobOriginsFindEarliestRevisionsRunBlobDataset
- swh.datasets.luigi.file_names module
- Luigi tasks for producing the most common names of every content and datasets based on file names
PopularContentNamesPopularContentNames.local_graph_pathPopularContentNames.popular_contents_pathPopularContentNames.graph_namePopularContentNames.max_results_per_contentPopularContentNames.popularity_thresholdPopularContentNames.resourcesPopularContentNames.requires()PopularContentNames.output()PopularContentNames.run()
PopularContentPathsPopularContentNamesOrcToS3ListFilesByName
- swh.datasets.luigi.impact module
- Luigi tasks to measure institutional impact
ComputeRawImpactComputeIndexedImpactComputeIndexedImpact.indexer_storage_urlComputeIndexedImpact.swh_scheduler_urlComputeIndexedImpact.FORK_FILTERSComputeIndexedImpact.local_graph_pathComputeIndexedImpact.graph_nameComputeIndexedImpact.persons_pathComputeIndexedImpact.raw_impact_pathComputeIndexedImpact.output_emailsComputeIndexedImpact.indexed_impact_pathComputeIndexedImpact.fork_filterComputeIndexedImpact.requires()ComputeIndexedImpact.output()ComputeIndexedImpact.run()
- swh.datasets.luigi.origin_contributors module
- Luigi tasks for contribution graph
ListOriginContributorsListOriginContributors.local_graph_pathListOriginContributors.topological_order_dirListOriginContributors.origin_contributors_pathListOriginContributors.origin_urls_pathListOriginContributors.graph_nameListOriginContributors.max_ram_mbListOriginContributors.resourcesListOriginContributors.requires()ListOriginContributors.output()ListOriginContributors.run()
ExportDeanonymizationTableDeanonymizeContributorsDeanonymizeContributors.local_graph_pathDeanonymizeContributors.graph_nameDeanonymizeContributors.deanonymization_table_pathDeanonymizeContributors.mph_algoDeanonymizeContributors.deanonymization_mapping_pathDeanonymizeContributors.requires()DeanonymizeContributors.output()DeanonymizeContributors.run()
DeanonymizeOriginContributorsDeanonymizeOriginContributors.local_graph_pathDeanonymizeOriginContributors.graph_nameDeanonymizeOriginContributors.origin_contributors_pathDeanonymizeOriginContributors.deanonymization_table_pathDeanonymizeOriginContributors.deanonymized_origin_contributors_pathDeanonymizeOriginContributors.mph_algoDeanonymizeOriginContributors.deanonymization_mapping_pathDeanonymizeOriginContributors.requires()DeanonymizeOriginContributors.output()DeanonymizeOriginContributors.run()
RunOriginContributorsRunOriginContributors.local_graph_pathRunOriginContributors.graph_nameRunOriginContributors.origin_urls_pathRunOriginContributors.origin_contributors_pathRunOriginContributors.deanonymized_origin_contributors_pathRunOriginContributors.skip_integrity_checkRunOriginContributors.test_originRunOriginContributors.test_personRunOriginContributors.test_yearsRunOriginContributors.requires()RunOriginContributors.run()
Module contents#
Luigi tasks#
This package contains Luigi tasks. These come in two kinds:
in
swh.graph.luigi.compressed_graph: an alternative to the ‘swh graph compress’ CLI that can be composed with other tasks, such as swh-dataset’sin other submodules: tasks driving the creation of specific datasets that are generated using the compressed graph
The overall directory structure is:
base_dir/
<date>[_<flavor>]/
edges/
...
orc/
...
compressed/
graph.graph
graph.mph
...
meta/
export.json
compression.json
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.datasets.luigi.UploadExportAndCompressedGraphToS3(*args, **kwargs)[source]#
Bases:
TaskUploads the local dataset export to s3, then the compressed graph to S3.
It will create automatically the missing local dataset or compressed graph if missing.
This task is a combination of UploadExportToS3 and UploadGraphToS3.
Example invocation:
luigi --local-scheduler --module swh.graph.luigi UploadExportAndCompressedGraphToS3 --local-graph-path=graph/ --s3-graph-path=s3://softwareheritage/graph/swh_2022-11-08/compressed/ ...
- requires()[source]#
Returns instances of
swh.export.luigi.UploadExportToS3, andswh.graph.luigi.compressed_graph.UploadGraphToS3, which recursively depend on the whole export and compression pipeline.
- class swh.datasets.luigi.RunNewGraph(*args, **kwargs)[source]#
Bases:
TaskRuns dataset export, graph compression, and generates datasets using the graph.
- requires() List[Task][source]#
Returns instances of
swh.export.luigi.RunExportAllandswh.export.luigi.UploadExportAndCompressedGraphToS3, which recursively depend on the whole export and compression pipeline. Also runs some of the derived datasets throughswh.datasets.topology.UploadGenerationsToS3andswh.datasets.aggregate_datasets.RunAggregatedDatasets.