swh.graph.luigi package#
Submodules#
- swh.graph.luigi.aggregate_datasets module
- Luigi tasks for producing the aggregated derived datasets
ExportNodesTable
AggregateContentDatasets
AggregateContentDatasets.local_graph_path
AggregateContentDatasets.graph_name
AggregateContentDatasets.popular_content_names_path
AggregateContentDatasets.provenance_dir
AggregateContentDatasets.aggregate_datasets_path
AggregateContentDatasets.requires()
AggregateContentDatasets.output()
AggregateContentDatasets.run()
UploadNodesTable
UploadAggregatedContentDataset
RunAggregatedDatasets
- swh.graph.luigi.blobs_datasets module
- Luigi tasks for blob-centric datasets
atomic_zstd_writer()
atomic_csv_zstd_writer()
check_csv()
SelectBlobs
DownloadBlobs
MakeBlobTarball
MakeSampleBlobTarball
ComputeBlobFileinfo
BlobScancode
BlobScancode.blob_filter
BlobScancode.derived_datasets_path
BlobScancode.FIELDNAMES
BlobScancode.DEFAULT_MIN_SCORE
BlobScancode.DEFAULT_JOBS
BlobScancode.DEFAULT_TIMEOUT
BlobScancode.MAP_CHUNKSIZE
BlobScancode.WORKER_MAX_TASKS
BlobScancode.FIELD_SEP
BlobScancode.READABLE_ENCODINGS
BlobScancode.requires()
BlobScancode.output()
BlobScancode.run()
FindBlobOrigins
CountBlobOrigins
FindEarliestRevisions
RunBlobDataset
- swh.graph.luigi.compressed_graph module
- Luigi tasks for compression
ObjectTypesParameter
ExtractNodes
ExtractLabels
NodeStats
EdgeStats
LabelStats
Mph
Bv
BvEf
BfsRoots
Bfs
PermuteAndSimplifyBfs
BfsEf
BfsDcf
Llp
PermuteLlp
Offsets
Ef
ComposeOrders
Transpose
TransposeOffsets
TransposeEf
Maps
ExtractPersons
PersonsStats
MphPersons
NodeProperties
PthashLabels
LabelsOrder
FclLabels
EdgeLabels
EdgeLabelsTranspose
EdgeLabelsEf
EdgeLabelsTransposeEf
Stats
CompressGraph
UploadGraphToS3
DownloadGraphFromS3
LocalGraph
- swh.graph.luigi.file_names module
- Luigi tasks for producing the most common names of every content and datasets based on file names
PopularContentNames
PopularContentNames.local_graph_path
PopularContentNames.popular_contents_path
PopularContentNames.graph_name
PopularContentNames.max_results_per_content
PopularContentNames.popularity_threshold
PopularContentNames.resources
PopularContentNames.requires()
PopularContentNames.output()
PopularContentNames.run()
PopularContentPaths
PopularContentNamesOrcToS3
ListFilesByName
- swh.graph.luigi.origin_contributors module
- Luigi tasks for contribution graph
ListOriginContributors
ListOriginContributors.local_graph_path
ListOriginContributors.topological_order_dir
ListOriginContributors.origin_contributors_path
ListOriginContributors.origin_urls_path
ListOriginContributors.graph_name
ListOriginContributors.max_ram_mb
ListOriginContributors.resources
ListOriginContributors.requires()
ListOriginContributors.output()
ListOriginContributors.run()
ExportDeanonymizationTable
DeanonymizeOriginContributors
DeanonymizeOriginContributors.local_graph_path
DeanonymizeOriginContributors.graph_name
DeanonymizeOriginContributors.origin_contributors_path
DeanonymizeOriginContributors.deanonymization_table_path
DeanonymizeOriginContributors.deanonymized_origin_contributors_path
DeanonymizeOriginContributors.mph_algo
DeanonymizeOriginContributors.requires()
DeanonymizeOriginContributors.output()
DeanonymizeOriginContributors.run()
RunOriginContributors
RunOriginContributors.local_graph_path
RunOriginContributors.graph_name
RunOriginContributors.origin_urls_path
RunOriginContributors.origin_contributors_path
RunOriginContributors.deanonymized_origin_contributors_path
RunOriginContributors.skip_integrity_check
RunOriginContributors.test_origin
RunOriginContributors.test_person
RunOriginContributors.test_years
RunOriginContributors.requires()
RunOriginContributors.run()
- swh.graph.luigi.provenance module
- Luigi tasks to help compute the provenance of content blobs
default_max_ram_mb()
ListProvenanceNodes
ComputeEarliestTimestamps
ComputeEarliestTimestamps.local_export_path
ComputeEarliestTimestamps.local_graph_path
ComputeEarliestTimestamps.graph_name
ComputeEarliestTimestamps.provenance_dir
ComputeEarliestTimestamps.provenance_node_filter
ComputeEarliestTimestamps.resources
ComputeEarliestTimestamps.requires()
ComputeEarliestTimestamps.output()
ComputeEarliestTimestamps.run()
ListDirectoryMaxLeafTimestamp
ListDirectoryMaxLeafTimestamp.local_export_path
ListDirectoryMaxLeafTimestamp.local_graph_path
ListDirectoryMaxLeafTimestamp.graph_name
ListDirectoryMaxLeafTimestamp.provenance_dir
ListDirectoryMaxLeafTimestamp.provenance_node_filter
ListDirectoryMaxLeafTimestamp.resources
ListDirectoryMaxLeafTimestamp.requires()
ListDirectoryMaxLeafTimestamp.output()
ListDirectoryMaxLeafTimestamp.run()
ComputeDirectoryFrontier
ComputeDirectoryFrontier.local_export_path
ComputeDirectoryFrontier.local_graph_path
ComputeDirectoryFrontier.graph_name
ComputeDirectoryFrontier.provenance_dir
ComputeDirectoryFrontier.provenance_node_filter
ComputeDirectoryFrontier.max_ram_mb
ComputeDirectoryFrontier.resources
ComputeDirectoryFrontier.requires()
ComputeDirectoryFrontier.output()
ComputeDirectoryFrontier.run()
ListFrontierDirectoriesInRevisions
ListFrontierDirectoriesInRevisions.local_export_path
ListFrontierDirectoriesInRevisions.local_graph_path
ListFrontierDirectoriesInRevisions.graph_name
ListFrontierDirectoriesInRevisions.provenance_dir
ListFrontierDirectoriesInRevisions.provenance_node_filter
ListFrontierDirectoriesInRevisions.max_ram_mb
ListFrontierDirectoriesInRevisions.resources
ListFrontierDirectoriesInRevisions.requires()
ListFrontierDirectoriesInRevisions.output()
ListFrontierDirectoriesInRevisions.run()
ListContentsInRevisionsWithoutFrontier
ListContentsInRevisionsWithoutFrontier.local_export_path
ListContentsInRevisionsWithoutFrontier.local_graph_path
ListContentsInRevisionsWithoutFrontier.graph_name
ListContentsInRevisionsWithoutFrontier.provenance_dir
ListContentsInRevisionsWithoutFrontier.provenance_node_filter
ListContentsInRevisionsWithoutFrontier.max_ram_mb
ListContentsInRevisionsWithoutFrontier.resources
ListContentsInRevisionsWithoutFrontier.requires()
ListContentsInRevisionsWithoutFrontier.output()
ListContentsInRevisionsWithoutFrontier.run()
ListContentsInFrontierDirectories
ListContentsInFrontierDirectories.local_export_path
ListContentsInFrontierDirectories.local_graph_path
ListContentsInFrontierDirectories.graph_name
ListContentsInFrontierDirectories.provenance_dir
ListContentsInFrontierDirectories.provenance_node_filter
ListContentsInFrontierDirectories.max_ram_mb
ListContentsInFrontierDirectories.resources
ListContentsInFrontierDirectories.requires()
ListContentsInFrontierDirectories.output()
ListContentsInFrontierDirectories.run()
RunProvenance
- swh.graph.luigi.subdataset module
SelectTopGithubOrigins
ListSwhidsForSubdataset
CreateSubdatasetOnAthena
CreateSubdatasetOnAthena.local_export_path
CreateSubdatasetOnAthena.s3_parent_export_path
CreateSubdatasetOnAthena.s3_export_path
CreateSubdatasetOnAthena.s3_athena_output_location
CreateSubdatasetOnAthena.athena_db_name
CreateSubdatasetOnAthena.athena_parent_db_name
CreateSubdatasetOnAthena.object_types
CreateSubdatasetOnAthena.requires()
CreateSubdatasetOnAthena.output()
CreateSubdatasetOnAthena.run()
- swh.graph.luigi.topology module
- swh.graph.luigi.utils module
Module contents#
Luigi tasks#
This package contains Luigi tasks. These come in two kinds:
in
swh.graph.luigi.compressed_graph
: an alternative to the ‘swh graph compress’ CLI that can be composed with other tasks, such as swh-dataset’sin other submodules: tasks driving the creation of specific datasets that are generated using the compressed graph
The overall directory structure is:
base_dir/
<date>[_<flavor>]/
edges/
...
orc/
...
compressed/
graph.graph
graph.mph
...
meta/
export.json
compression.json
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.graph.luigi.RunExportCompressUpload(*args, **kwargs)[source]#
Bases:
Task
Runs dataset export, graph compression, and generates datasets using the graph.
- requires() List[Task] [source]#
Returns instances of
swh.dataset.luigi.RunExportAll
andswh.graph.luigi.compressed_graph.UploadGraphToS3
, which recursively depend on the whole export and compression pipeline.