swh.graph.luigi.file_names module#

Luigi tasks for producing the most common names of every content and datasets based on file names#

class swh.graph.luigi.file_names.PopularContentNames(*args, **kwargs)[source]#

Bases: Task

Creates a CSV file that contains the most popular name(s) of each content

local_graph_path = <luigi.parameter.PathParameter object>#
popular_contents_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
max_results_per_content = <luigi.parameter.IntParameter object>#
popularity_threshold = <luigi.parameter.IntParameter object>#
property resources#

Return the estimated RAM use of this task.

requires() List[Task][source]#

Returns an instance of LocalGraph.

output() LocalTarget[source]#

.csv.zst file that contains the topological order.

run() None[source]#

Runs popular-content-names from tools/file_names

class swh.graph.luigi.file_names.PopularContentPaths(*args, **kwargs)[source]#

Bases: Task

Creates a CSV file that contains the most popular path of each content/directory given as input

local_graph_path = <luigi.parameter.PathParameter object>#
popular_contents_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
input_swhids = <luigi.parameter.PathParameter object>#
max_depth = <luigi.parameter.IntParameter object>#
property resources#

Return the estimated RAM use of this task.

requires() List[Task][source]#

Returns an instance of LocalGraph.

output() LocalTarget[source]#

.csv.zst file that contains the topological order.

run() None[source]#

Runs org.softwareheritage.graph.utils.PopularContentPaths and compresses

class swh.graph.luigi.file_names.PopularContentNamesOrcToS3(*args, **kwargs)[source]#

Bases: _ParquetToS3ToAthenaTask

Reads the CSV from PopularContents, converts it to ORC, upload the ORC to S3, and create an Athena table for it.

popular_contents_path = <luigi.parameter.PathParameter object>#
dataset_name = <luigi.parameter.Parameter object>#
s3_athena_output_location = <swh.dataset.luigi.S3PathParameter object>#
requires() PopularContentNames[source]#

Returns corresponding PopularContentNames instance

class swh.graph.luigi.file_names.ListFilesByName(*args, **kwargs)[source]#

Bases: Task

From every refs/heads/master, refs/heads/main, or HEAD branch in any snapshot, browse the whole directory tree looking for files named <filename>, and lists them to stdout.

local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
output_path = <luigi.parameter.PathParameter object>#
file_name = <luigi.parameter.Parameter object>#
property resources#

Return the estimated RAM use of this task.

requires() List[Task][source]#

Returns an instance of LocalGraph.

output() LocalTarget[source]#

Directory of .csv.zst files containing the list of file occurrences with that name.

run() None[source]#

Runs org.softwareheritage.graph.utils.PopularContentNames and compresses