swh.graph.luigi.subdataset module#

class swh.graph.luigi.subdataset.SelectTopGithubOrigins(*args, **kwargs)[source]#

Bases: Task

Writes a list of origins selected from popular Github repositories

local_export_path = <luigi.parameter.PathParameter object>#
num_origins = <luigi.parameter.IntParameter object>#
query = <luigi.parameter.Parameter object>#
output() Target[source]#

Text file with a list of origin URLs

run() None[source]#

Sends a query to the Github API to get a list of origins

class swh.graph.luigi.subdataset.ListSwhidsForSubdataset(*args, **kwargs)[source]#

Bases: Task

Lists all SWHIDs reachable from a set of origins

select_task = <luigi.parameter.ChoiceParameter object>#
local_export_path = <luigi.parameter.PathParameter object>#
grpc_api = <luigi.parameter.Parameter object>#
requires() Task[source]#

Returns an instance of self.select_task

output() Target[source]#

Text file with a list of SWHIDs

run() None[source]#

Builds the list

class swh.graph.luigi.subdataset.CreateSubdatasetOnAthena(*args, **kwargs)[source]#

Bases: Task

Generates an ORC export from an existing ORC export, filtering out SWHIDs not in the given list.

local_export_path = <luigi.parameter.PathParameter object>#
s3_parent_export_path = <swh.dataset.luigi.S3PathParameter object>#
s3_export_path = <swh.dataset.luigi.S3PathParameter object>#
s3_athena_output_location = <swh.dataset.luigi.S3PathParameter object>#
athena_db_name = <luigi.parameter.Parameter object>#
athena_parent_db_name = <luigi.parameter.Parameter object>#
object_types = <luigi.parameter.EnumListParameter object>#
requires() Dict[str, Task][source]#

Returns an instance of ListSwhidsForSubdataset and one of CreateAthena

output() Dict[str, Target][source]#

Returns the S3 location and Athena database for the subdataset

run() None[source]#

Runs a query on Athena, producing files on S3