swh.datasets.luigi.aggregate_datasets module#

Luigi tasks for producing the aggregated derived datasets#

class swh.datasets.luigi.aggregate_datasets.ExportNodesTable(*args, **kwargs)[source]#

Bases: Task

Creates a Parquet dataset that contains the id and SWHID of each node

local_graph_path = <luigi.parameter.PathParameter object>#

graph_name = <luigi.parameter.Parameter object>#

aggregate_datasets_path = <luigi.parameter.PathParameter object>#

requires() → Dict[str, Task][source]#: Returns an instance of LocalGraph.

output() → LocalTarget[source]#: Directory of Parquet files.

run() → None[source]#: Runs export-nodes from tools/aggregate

class swh.datasets.luigi.aggregate_datasets.AggregateContentDatasets(*args, **kwargs)[source]#

Bases: Task

Creates a Parquet dataset that contains a column for each of:

the content id
the content’s length
the most popular name of each content
number of occurrences of that name for the content
its date of first occurrence in a revision or release, if any
said revision or release, if any
an origin containing said revision or release, if any

local_graph_path = <luigi.parameter.PathParameter object>#

graph_name = <luigi.parameter.Parameter object>#

popular_content_names_path = <luigi.parameter.PathParameter object>#

provenance_dir = <luigi.parameter.PathParameter object>#

aggregate_datasets_path = <luigi.parameter.PathParameter object>#

requires() → Dict[str, Task][source]#: Returns an instance of LocalGraph.

output() → LocalTarget[source]#: Directory of Parquet files.

run() → None[source]#: Runs aggregate-content-datasets from tools/aggregate

class swh.datasets.luigi.aggregate_datasets.UploadNodesTable(*args, **kwargs)[source]#

Bases: _ParquetToS3ToAthenaTask

Uploads the result of AggregateContentDatasets to S3 and registers a table on Athena to query it

aggregate_datasets_path = <luigi.parameter.PathParameter object>#

dataset_name = <luigi.parameter.Parameter object>#

s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#

requires() → Task[source]#: Returns an instance of ExportNodesTable.

create_table_extras() → str[source]#: Extra clauses to add to the CREATE EXTERNAL TABLE statement.

class swh.datasets.luigi.aggregate_datasets.UploadAggregatedContentDataset(*args, **kwargs)[source]#

Bases: _ParquetToS3ToAthenaTask

Uploads the result of AggregateContentDatasets to S3 and registers a table on Athena to query it

aggregate_datasets_path = <luigi.parameter.PathParameter object>#

dataset_name = <luigi.parameter.Parameter object>#

s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#

requires() → Task[source]#: Returns an instance of AggregateContentDatasets.

class swh.datasets.luigi.aggregate_datasets.RunAggregatedDatasets(*args, **kwargs)[source]#

Bases: WrapperTask

Runs UploadNodesTable, UploadAggregatedContentDataset, and their recursive dependencies.

aggregate_datasets_path = <luigi.parameter.PathParameter object>#

dataset_name = <luigi.parameter.Parameter object>#

s3_athena_output_location = <swh.export.luigi.S3PathParameter object>#

requires() → Dict[str, Task][source]#: Returns an instance of AggregateContentDatasets.