swh.graph.luigi.origin_contributors module#

Luigi tasks for contribution graph#

This module contains Luigi tasks driving the creation of the graph of contributions of people (pseudonymized by default).

File layout#

This assumes a local compressed graph (from swh.graph.luigi.compressed_graph) is present, and generates/manipulates the following files:

base_dir/
    <date>[_<flavor>]/
        datasets/
            contribution_graph.csv.zst
        topology/
            topological_order_dfs.csv.zst

And optionally:

sensitive_base_dir/
    <date>[_<flavor>]/
        persons_sha256_to_name.csv.zst
        datasets/
            contribution_graph.deanonymized.csv.zst
class swh.graph.luigi.origin_contributors.ListOriginContributors(*args, **kwargs)[source]#

Bases: Task

Creates a file that contains all SWHIDs in topological order from a compressed graph.

local_graph_path = <luigi.parameter.PathParameter object>#
topological_order_dir = <luigi.parameter.PathParameter object>#
origin_contributors_path = <luigi.parameter.PathParameter object>#
origin_urls_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
max_ram_mb = <luigi.parameter.IntParameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns an instance of swh.graph.luigi.compressed_graph.LocalGraph and swh.graph.luigi.topology.ComputeGenerations.

output() List[Target][source]#

.csv.zst file that contains the origin_id<->contributor_id map and the list of origins

run() None[source]#

Runs org.softwareheritage.graph.utils.ListOriginContributors and compresses

class swh.graph.luigi.origin_contributors.ExportDeanonymizationTable(*args, **kwargs)[source]#

Bases: Task

Exports (from swh-storage) a .csv.zst file that contains the columns: base64(sha256(full_name))`, ``base64(full_name), and escape(full_name).

The first column is the anonymized full name found in graph.persons.csv.zst in the compressed graph, and the latter two are the original name.

storage_dsn = <luigi.parameter.Parameter object>#
deanonymization_table_path = <luigi.parameter.PathParameter object>#
output() Target[source]#

.csv.zst file that contains the table.

run() None[source]#

Runs a postgresql query to compute the table.

class swh.graph.luigi.origin_contributors.DeanonymizeOriginContributors(*args, **kwargs)[source]#

Bases: Task

Generates a .csv.zst file similar to ListOriginContributors’s, but with contributor_base64 and contributor_escaped columns in addition to contributor_id.

This assumes that graph.persons.csv.zst is anonymized (SHA256 of names instead of names); which may not be true depending on how the swh-dataset export was configured.

local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
origin_contributors_path = <luigi.parameter.PathParameter object>#
deanonymization_table_path = <luigi.parameter.PathParameter object>#
deanonymized_origin_contributors_path = <luigi.parameter.PathParameter object>#
mph_algo = <luigi.parameter.ChoiceParameter object>#
requires() List[Task][source]#

Returns instances of LocalGraph, ListOriginContributors, and ExportDeanonymizationTable.

output() Target[source]#

.csv.zst file similar to ListOriginContributors.output()’s, but with contributor_base64 and contributor_escaped columns in addition to contributor_id

run() None[source]#

Loads the list of persons (graph.persons.csv.zst in the graph dataset and the deanonymization table in memory, then uses them to map each row in the original (anonymized) contributors list to the deanonymized one.

class swh.graph.luigi.origin_contributors.RunOriginContributors(*args, **kwargs)[source]#

Bases: Task

local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
origin_urls_path = <luigi.parameter.PathParameter object>#
origin_contributors_path = <luigi.parameter.PathParameter object>#
deanonymized_origin_contributors_path = <luigi.parameter.PathParameter object>#
skip_integrity_check = <luigi.parameter.BoolParameter object>#
test_origin = <luigi.parameter.Parameter object>#
test_person = <luigi.parameter.Parameter object>#
test_years = <luigi.parameter.Parameter object>#
requires() List[Task][source]#

Returns instances of LocalGraph, ListOriginContributors, and ExportDeanonymizationTable.

run() None[source]#

Checks integrity of the produced dataset using a well-known example