swh.graph.luigi package#
Submodules:
- swh.graph.luigi.blobs_datasets module
- Luigi tasks for blob-centric datasets
atomic_zstd_writer()
atomic_csv_zstd_writer()
check_csv()
SelectBlobs
DownloadBlobs
MakeBlobTarball
MakeSampleBlobTarball
ComputeBlobFileinfo
BlobScancode
BlobScancode.blob_filter
BlobScancode.derived_datasets_path
BlobScancode.FIELDNAMES
BlobScancode.DEFAULT_MIN_SCORE
BlobScancode.DEFAULT_JOBS
BlobScancode.DEFAULT_TIMEOUT
BlobScancode.MAP_CHUNKSIZE
BlobScancode.WORKER_MAX_TASKS
BlobScancode.FIELD_SEP
BlobScancode.READABLE_ENCODINGS
BlobScancode.requires()
BlobScancode.output()
BlobScancode.run()
BlobScancode.previous_derived_datasets_path
FindBlobOrigins
CountBlobOrigins
FindEarliestRevisions
RunBlobDataset
- swh.graph.luigi.compressed_graph module
- Luigi tasks for compression
ObjectTypesParameter
ExtractNodes
Mph
Bv
Bfs
PermuteBfs
TransposeBfs
Simplify
Llp
PermuteLlp
Obl
ComposeOrders
Stats
Transpose
TransposeObl
Maps
ExtractPersons
MphPersons
NodeProperties
MphLabels
FclLabels
EdgeLabels
EdgeLabelsObl
EdgeLabelsTransposeObl
CompressGraph
UploadGraphToS3
DownloadGraphFromS3
LocalGraph
- swh.graph.luigi.file_names module
- Luigi tasks for producing the most common names of every content
PopularContentNames
PopularContentNames.local_graph_path
PopularContentNames.popular_contents_path
PopularContentNames.graph_name
PopularContentNames.max_results_per_content
PopularContentNames.popularity_threshold
PopularContentNames.resources
PopularContentNames.requires()
PopularContentNames.output()
PopularContentNames.run()
PopularContentPaths
PopularContentNamesOrcToS3
- swh.graph.luigi.origin_contributors module
- Luigi tasks for contribution graph
ListOriginContributors
ListOriginContributors.local_graph_path
ListOriginContributors.topological_order_dir
ListOriginContributors.origin_contributors_path
ListOriginContributors.origin_urls_path
ListOriginContributors.graph_name
ListOriginContributors.max_ram_mb
ListOriginContributors.resources
ListOriginContributors.requires()
ListOriginContributors.output()
ListOriginContributors.run()
ExportDeanonymizationTable
DeanonymizeOriginContributors
DeanonymizeOriginContributors.local_graph_path
DeanonymizeOriginContributors.graph_name
DeanonymizeOriginContributors.origin_contributors_path
DeanonymizeOriginContributors.deanonymization_table_path
DeanonymizeOriginContributors.deanonymized_origin_contributors_path
DeanonymizeOriginContributors.requires()
DeanonymizeOriginContributors.output()
DeanonymizeOriginContributors.run()
RunOriginContributors
RunOriginContributors.local_graph_path
RunOriginContributors.graph_name
RunOriginContributors.origin_urls_path
RunOriginContributors.origin_contributors_path
RunOriginContributors.deanonymized_origin_contributors_path
RunOriginContributors.skip_integrity_check
RunOriginContributors.test_origin
RunOriginContributors.test_person
RunOriginContributors.test_years
RunOriginContributors.requires()
RunOriginContributors.run()
- swh.graph.luigi.provenance module
- swh.graph.luigi.shell module
- swh.graph.luigi.topology module
- swh.graph.luigi.utils module
Module contents:
Luigi tasks#
This package contains Luigi tasks. These come in two kinds:
in
swh.graph.luigi.compressed_graph
: an alternative to the ‘swh graph compress’ CLI that can be composed with other tasks, such as swh-dataset’sin other submodules: tasks driving the creation of specific datasets that are generated using the compressed graph
The overall directory structure is:
base_dir/
<date>[_<flavor>]/
edges/
...
orc/
...
compressed/
graph.graph
graph.mph
...
meta/
export.json
compression.json
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.graph.luigi.RunExportCompressUpload(*args, **kwargs)[source]#
Bases:
Task
Runs dataset export, graph compression, and generates datasets using the graph.
- requires() List[Task] [source]#
Returns instances of
swh.dataset.luigi.RunExportAll
andswh.graph.luigi.compressed_graph.UploadGraphToS3
, which recursively depend on the whole export and compression pipeline.