swh.graph.luigi.file_names module#
Luigi tasks for producing the most common names of every content and datasets based on file names#
- class swh.graph.luigi.file_names.PopularContentNames(*args, **kwargs)[source]#
Bases:
Task
Creates a CSV file that contains the most popular name(s) of each content
- local_graph_path = <luigi.parameter.PathParameter object>#
- popular_contents_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- max_results_per_content = <luigi.parameter.IntParameter object>#
- popularity_threshold = <luigi.parameter.IntParameter object>#
- property resources#
Return the estimated RAM use of this task.
- class swh.graph.luigi.file_names.PopularContentPaths(*args, **kwargs)[source]#
Bases:
Task
Creates a CSV file that contains the most popular path of each content/directory given as input
- local_graph_path = <luigi.parameter.PathParameter object>#
- popular_contents_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- input_swhids = <luigi.parameter.PathParameter object>#
- max_depth = <luigi.parameter.IntParameter object>#
- property resources#
Return the estimated RAM use of this task.
- class swh.graph.luigi.file_names.PopularContentNamesOrcToS3(*args, **kwargs)[source]#
Bases:
_ParquetToS3ToAthenaTask
Reads the CSV from
PopularContents
, converts it to ORC, upload the ORC to S3, and create an Athena table for it.- popular_contents_path = <luigi.parameter.PathParameter object>#
- dataset_name = <luigi.parameter.Parameter object>#
- s3_athena_output_location = <swh.dataset.luigi.S3PathParameter object>#
- requires() PopularContentNames [source]#
Returns corresponding PopularContentNames instance
- class swh.graph.luigi.file_names.ListFilesByName(*args, **kwargs)[source]#
Bases:
Task
From every refs/heads/master, refs/heads/main, or HEAD branch in any snapshot, browse the whole directory tree looking for files named <filename>, and lists them to stdout.
- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- output_path = <luigi.parameter.PathParameter object>#
- file_name = <luigi.parameter.Parameter object>#
- property resources#
Return the estimated RAM use of this task.