swh.graph.luigi.file_names module#
Luigi tasks for producing the most common names of every content and datasets based on file names#
- class swh.graph.luigi.file_names.PopularContentNames(*args, **kwargs)[source]#
Bases:
Task
Creates a CSV file that contains the most popular name(s) of each content
- local_graph_path = <luigi.parameter.PathParameter object>#
- popular_contents_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- max_results_per_content = <luigi.parameter.IntParameter object>#
- popularity_threshold = <luigi.parameter.IntParameter object>#
- property resources#
Return the estimated RAM use of this task.
- class swh.graph.luigi.file_names.PopularContentPaths(*args, **kwargs)[source]#
Bases:
Task
Creates a CSV file that contains the most popular path of each content/directory given as input
- local_graph_path = <luigi.parameter.PathParameter object>#
- popular_contents_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- input_swhids = <luigi.parameter.PathParameter object>#
- max_depth = <luigi.parameter.IntParameter object>#
- property resources#
Return the estimated RAM use of this task.
- class swh.graph.luigi.file_names.PopularContentNamesOrcToS3(*args, **kwargs)[source]#
Bases:
_CsvToOrcToS3ToAthenaTask
Reads the CSV from
PopularContents
, converts it to ORC, upload the ORC to S3, and create an Athena table for it.- popular_contents_path = <luigi.parameter.PathParameter object>#
- dataset_name = <luigi.parameter.Parameter object>#
- s3_athena_output_location = <swh.dataset.luigi.S3PathParameter object>#
- requires() PopularContentNames [source]#
Returns corresponding PopularContentNames instance
- class swh.graph.luigi.file_names.ListFilesByName(*args, **kwargs)[source]#
Bases:
Task
Creates a CSV file that contains the most popular name(s) of each content
- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- output_path = <luigi.parameter.PathParameter object>#
- file_name = <luigi.parameter.Parameter object>#
- num_threads = <luigi.parameter.IntParameter object>#
- batch_size = <luigi.parameter.IntParameter object>#
- property resources#
Return the estimated RAM use of this task.