swh.datasets.luigi package#
Submodules#
- swh.datasets.luigi.aggregate_datasets module
- Luigi tasks for producing the aggregated derived datasets
ExportNodesTable
AggregateContentDatasets
AggregateContentDatasets.local_graph_path
AggregateContentDatasets.graph_name
AggregateContentDatasets.popular_content_names_path
AggregateContentDatasets.provenance_dir
AggregateContentDatasets.aggregate_datasets_path
AggregateContentDatasets.requires()
AggregateContentDatasets.output()
AggregateContentDatasets.run()
UploadNodesTable
UploadAggregatedContentDataset
RunAggregatedDatasets
- swh.datasets.luigi.blobs_datasets module
- Luigi tasks for blob-centric datasets
atomic_zstd_writer()
atomic_csv_zstd_writer()
check_csv()
SelectBlobs
DownloadBlobs
MakeBlobTarball
MakeSampleBlobTarball
ComputeBlobFileinfo
BlobScancode
BlobScancode.blob_filter
BlobScancode.derived_datasets_path
BlobScancode.FIELDNAMES
BlobScancode.DEFAULT_MIN_SCORE
BlobScancode.DEFAULT_JOBS
BlobScancode.DEFAULT_TIMEOUT
BlobScancode.MAP_CHUNKSIZE
BlobScancode.WORKER_MAX_TASKS
BlobScancode.FIELD_SEP
BlobScancode.READABLE_ENCODINGS
BlobScancode.requires()
BlobScancode.output()
BlobScancode.run()
BlobScancode.previous_derived_datasets_path
FindBlobOrigins
CountBlobOrigins
FindEarliestRevisions
RunBlobDataset
- swh.datasets.luigi.file_names module
- Luigi tasks for producing the most common names of every content and datasets based on file names
PopularContentNames
PopularContentNames.local_graph_path
PopularContentNames.popular_contents_path
PopularContentNames.graph_name
PopularContentNames.max_results_per_content
PopularContentNames.popularity_threshold
PopularContentNames.resources
PopularContentNames.requires()
PopularContentNames.output()
PopularContentNames.run()
PopularContentPaths
PopularContentNamesOrcToS3
ListFilesByName
- swh.datasets.luigi.origin_contributors module
- Luigi tasks for contribution graph
ListOriginContributors
ListOriginContributors.local_graph_path
ListOriginContributors.topological_order_dir
ListOriginContributors.origin_contributors_path
ListOriginContributors.origin_urls_path
ListOriginContributors.graph_name
ListOriginContributors.max_ram_mb
ListOriginContributors.resources
ListOriginContributors.requires()
ListOriginContributors.output()
ListOriginContributors.run()
ExportDeanonymizationTable
DeanonymizeOriginContributors
DeanonymizeOriginContributors.local_graph_path
DeanonymizeOriginContributors.graph_name
DeanonymizeOriginContributors.origin_contributors_path
DeanonymizeOriginContributors.deanonymization_table_path
DeanonymizeOriginContributors.deanonymized_origin_contributors_path
DeanonymizeOriginContributors.mph_algo
DeanonymizeOriginContributors.requires()
DeanonymizeOriginContributors.output()
DeanonymizeOriginContributors.run()
RunOriginContributors
RunOriginContributors.local_graph_path
RunOriginContributors.graph_name
RunOriginContributors.origin_urls_path
RunOriginContributors.origin_contributors_path
RunOriginContributors.deanonymized_origin_contributors_path
RunOriginContributors.skip_integrity_check
RunOriginContributors.test_origin
RunOriginContributors.test_person
RunOriginContributors.test_years
RunOriginContributors.requires()
RunOriginContributors.run()
Module contents#
Luigi tasks#
This package contains Luigi tasks. These come in two kinds:
in
swh.graph.luigi.compressed_graph
: an alternative to the ‘swh graph compress’ CLI that can be composed with other tasks, such as swh-dataset’sin other submodules: tasks driving the creation of specific datasets that are generated using the compressed graph
The overall directory structure is:
base_dir/
<date>[_<flavor>]/
edges/
...
orc/
...
compressed/
graph.graph
graph.mph
...
meta/
export.json
compression.json
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.datasets.luigi.RunNewGraph(*args, **kwargs)[source]#
Bases:
Task
Runs dataset export, graph compression, and generates datasets using the graph.
- requires() List[Task] [source]#
Returns instances of
swh.export.luigi.RunExportAll
,swh.export.luigi.UploadExportToS3
, andswh.graph.luigi.compressed_graph.UploadGraphToS3
, which recursively depend on the whole export and compression pipeline. Also runs some of the derived datasets throughswh.datasets.topology.UploadGenerationsToS3
andswh.datasets.aggregate_datasets.RunAggregatedDatasets
.