swh.graph.luigi.origin_contributors module#
Luigi tasks for contribution graph#
This module contains Luigi tasks driving the creation of the graph of contributions of people (pseudonymized by default).
File layout#
This assumes a local compressed graph (from swh.graph.luigi.compressed_graph
)
is present, and generates/manipulates the following files:
base_dir/
<date>[_<flavor>]/
datasets/
contribution_graph.csv.zst
topology/
topological_order_dfs.csv.zst
And optionally:
sensitive_base_dir/
<date>[_<flavor>]/
persons_sha256_to_name.csv.zst
datasets/
contribution_graph.deanonymized.csv.zst
- class swh.graph.luigi.origin_contributors.ListOriginContributors(*args, **kwargs)[source]#
Bases:
Task
Creates a file that contains all SWHIDs in topological order from a compressed graph.
- local_graph_path = <luigi.parameter.PathParameter object>#
- topological_order_dir = <luigi.parameter.PathParameter object>#
- origin_contributors_path = <luigi.parameter.PathParameter object>#
- origin_urls_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- max_ram_mb = <luigi.parameter.IntParameter object>#
- property resources#
Returns the value of
self.max_ram_mb
- requires() Dict[str, Task] [source]#
Returns an instance of
swh.graph.luigi.compressed_graph.LocalGraph
andswh.graph.luigi.topology.ComputeGenerations
.
- class swh.graph.luigi.origin_contributors.ExportDeanonymizationTable(*args, **kwargs)[source]#
Bases:
Task
Exports (from swh-storage) a .csv.zst file that contains the columns:
base64(sha256(full_name))`, ``base64(full_name)
, andescape(full_name)
.The first column is the anonymized full name found in
graph.persons.csv.zst
in the compressed graph, and the latter two are the original name.- storage_dsn = <luigi.parameter.Parameter object>#
- deanonymization_table_path = <luigi.parameter.PathParameter object>#
- class swh.graph.luigi.origin_contributors.DeanonymizeOriginContributors(*args, **kwargs)[source]#
Bases:
Task
Generates a .csv.zst file similar to
ListOriginContributors
’s, but withcontributor_base64
andcontributor_escaped
columns in addition tocontributor_id
.This assumes that
graph.persons.csv.zst
is anonymized (SHA256 of names instead of names); which may not be true depending on how the swh-dataset export was configured.- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- origin_contributors_path = <luigi.parameter.PathParameter object>#
- deanonymization_table_path = <luigi.parameter.PathParameter object>#
- deanonymized_origin_contributors_path = <luigi.parameter.PathParameter object>#
- mph_algo = <luigi.parameter.ChoiceParameter object>#
- requires() List[Task] [source]#
Returns instances of
LocalGraph
,ListOriginContributors
, andExportDeanonymizationTable
.
- output() Target [source]#
.csv.zst file similar to
ListOriginContributors.output()
’s, but withcontributor_base64
andcontributor_escaped
columns in addition tocontributor_id
- class swh.graph.luigi.origin_contributors.RunOriginContributors(*args, **kwargs)[source]#
Bases:
Task
- local_graph_path = <luigi.parameter.PathParameter object>#
- graph_name = <luigi.parameter.Parameter object>#
- origin_urls_path = <luigi.parameter.PathParameter object>#
- origin_contributors_path = <luigi.parameter.PathParameter object>#
- deanonymized_origin_contributors_path = <luigi.parameter.PathParameter object>#
- skip_integrity_check = <luigi.parameter.BoolParameter object>#
- test_origin = <luigi.parameter.Parameter object>#
- test_person = <luigi.parameter.Parameter object>#
- test_years = <luigi.parameter.Parameter object>#
- requires() List[Task] [source]#
Returns instances of
LocalGraph
,ListOriginContributors
, andExportDeanonymizationTable
.