swh.datasets.luigi.aggregate_datasets module#
Luigi tasks for producing the aggregated derived datasets#
- class swh.datasets.luigi.aggregate_datasets.ExportNodesTable(*args, **kwargs)[source]#
Bases:
TaskCreates a Parquet dataset that contains the id and SWHID of each node
- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- aggregate_datasets_path = <luigi.parameter.PathParameter object>#
- class swh.datasets.luigi.aggregate_datasets.AggregateContentDatasets(*args, **kwargs)[source]#
Bases:
TaskCreates a Parquet dataset that contains a column for each of:
the content id
the content’s length
the most popular name of each content
number of occurrences of that name for the content
its date of first occurrence in a revision or release, if any
said revision or release, if any
an origin containing said revision or release, if any
- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- popular_content_names_path = <luigi.parameter.PathParameter object>#
- provenance_dir = <luigi.parameter.PathParameter object>#
- aggregate_datasets_path = <luigi.parameter.PathParameter object>#
- class swh.datasets.luigi.aggregate_datasets.UploadNodesTable(*args, **kwargs)[source]#
Bases:
_ParquetToS3ToAthenaTaskUploads the result of
AggregateContentDatasetsto S3 and registers a table on Athena to query it- aggregate_datasets_path = <luigi.parameter.PathParameter object>#
- dataset_name = <luigi.parameter.Parameter object>#
- s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#
- requires() Task[source]#
Returns an instance of
ExportNodesTable.
- class swh.datasets.luigi.aggregate_datasets.UploadAggregatedContentDataset(*args, **kwargs)[source]#
Bases:
_ParquetToS3ToAthenaTaskUploads the result of
AggregateContentDatasetsto S3 and registers a table on Athena to query it- aggregate_datasets_path = <luigi.parameter.PathParameter object>#
- dataset_name = <luigi.parameter.Parameter object>#
- s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#
- requires() Task[source]#
Returns an instance of
AggregateContentDatasets.
- class swh.datasets.luigi.aggregate_datasets.RunAggregatedDatasets(*args, **kwargs)[source]#
Bases:
WrapperTaskRuns
UploadNodesTable,UploadAggregatedContentDataset, and their recursive dependencies.- aggregate_datasets_path = <luigi.parameter.PathParameter object>#
- dataset_name = <luigi.parameter.Parameter object>#
- s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#
- requires() Dict[str, Task][source]#
Returns an instance of
AggregateContentDatasets.