swh.datasets.luigi.impact module#

Luigi tasks to measure institutional impact#

This module contains Luigi tasks computing the impact of an institution across all origins. Institutions are identified by a regular expression on authors’ email address, such as .*@softwareheritage.org.

Output is written to the directory configured with impact_dir, which contains:

  • persons.csv: a CSV file with header fullname_base64,email_base64. It is filled from the “sensitive export”; which complements public exports dated 2026-03-02 or newer. For older exports, it can be generated with:

    $ psql service=swh
    softwareheritage=> \copy (select translate(encode(fullname, 'base64'), E'\n', '') as fullname_base64, translate(encode(email, 'base64'), E'\n', '') as email_base64 from person where encode(email, 'escape') ~ '.*@(.*\.)?softwareheritage.org$') to '/path/to/impact/dir/persons.csv' with (delimiter ',', header true)
    
  • raw_origins.csv.zst: all origins that contain a commit authored or committed by someone listed in persons.csv

  • indexed_origins.csv.zst: similar to raw_origins.csv.zst but with some irrelevant origins filtered out, and enriched with metadata from Software Heritage - Indexer and swh-scheduler.

  • revrels: a directory containing all revisions in any origin listed in indexed_origins.csv.zst, as Parquet files

  • origin_revrels: a directory describing the many-to-many relation between origins listed in indexed_origins.csv.zst and revisions in revrels. To avoid the combinatorial explosion of writing every (origin, revrel) pair, the relation is decomposed via an intermediate “oriset” (origin set) identifier:

    • origin_revrels/orisets_of_origin

    • origin_revrels/revrels_in_oriset

class swh.datasets.luigi.impact.SelectPersons(*args, **kwargs)[source]#

Bases: Task

Extract persons from the output of Software Heritage Datasets based on the provided regexp

local_sensitive_export_path#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
impact_dir#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
email_regexp#

Parameter whose value is a str.

requires() Dict[str, Task][source]#

Returns an instance of LocalExport

output() LocalTarget[source]#

Returns impact_dir / "persons.csv"

run() None[source]#

Reads the input .orc, filters persons matching the regexp, and writes them to the output .csv

class swh.datasets.luigi.impact.ComputeRawImpact(*args, **kwargs)[source]#

Bases: Task

Creates a file that list all origins that contains revrels from a given set of persons, as well as the number of revrels and first/latest timestamp for each origin.

local_graph_path#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
local_sensitive_graph_path#

Class to parse optional path parameters.

graph_name#

Parameter whose value is a str.

impact_dir#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
output_emails#

A Parameter whose value is a bool. This parameter has an implicit default value of False. For the command line interface this means that the value is False unless you add "--the-bool-parameter" to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to be True. This is called explicit parsing. When omitting the parameter value, it is still considered True but to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.

You can toggle between the two parsing modes on a per-parameter base via

class MyTask(luigi.Task):
    implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING)
    explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)

or globally by

luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING

for all bool parameters instantiated after this line.

include_ranges#

Parameter whose value is a str.

exclude_ranges#

Parameter whose value is a str.

requires() Dict[str, Task][source]#

Returns an instance of swh.graph.luigi.compressed_graph.LocalGraph, swh.graph.libs.luigi.topology.ComputeGenerations, and two instances of each of swh.graph.libs.luigi.topology.CountPaths and swh.graph.libs.luigi.topology.CountDescendants (forward and backward)

output() Dict[str, LocalTarget][source]#

{"origin_csv": self.impact_dir / "raw_origins.csv.zst"}

run() None[source]#

Runs ‘impact’ and compresses

swh.datasets.luigi.impact.opt_int(s: str) int | None[source]#
swh.datasets.luigi.impact.typeddict_parser(dt: Type[D]) Callable[[Dict[str, str]], D][source]#
class swh.datasets.luigi.impact.OriginWithEmails[source]#

Bases: TypedDict

origin_id: int#
origin_url: str#
num_contributed_revs: int#
num_contributed_rels: int#
num_contributed_revs_in_main_branch: int | None#
total_revs: int#
total_rels: int#
total_revs_in_main_branch: int | None#
first_contributed_revrel: str#
first_contributed_revrel_ts: int | None#
first_contributed_revrel_author: str#
first_contributed_revrel_committer: str#
last_contributed_revrel: str#
last_contributed_revrel_ts: int | None#
last_contributed_revrel_author: str#
last_contributed_revrel_committer: str#
first_revrel_ts: int | None#
last_revrel_ts: int | None#
codemeta_json: str#
citation_cff: str#
forward_paths: float#
forward_paths_to_roots: float#
forward_descendants: int#
histhost_forward_paths: float#
histhost_forward_paths_to_roots: float#
histhost_forward_descendants: int#
class swh.datasets.luigi.impact.OriginWithoutEmails[source]#

Bases: TypedDict

origin_id: int#
origin_url: str#
num_contributed_revs: int#
num_contributed_rels: int#
num_contributed_revs_in_main_branch: int | None#
total_revs: int#
total_rels: int#
total_revs_in_main_branch: int | None#
first_contributed_revrel: str#
first_contributed_revrel_ts: int | None#
last_contributed_revrel: str#
last_contributed_revrel_ts: int | None#
first_revrel_ts: int | None#
last_revrel_ts: int | None#
codemeta_json: str#
citation_cff: str#
forward_paths: float#
forward_paths_to_roots: float#
forward_descendants: int#
histhost_forward_paths: float#
histhost_forward_paths_to_roots: float#
histhost_forward_descendants: int#
class swh.datasets.luigi.impact.ComputeIndexedImpact(*args, **kwargs)[source]#

Bases: Task

Removes forks from ComputeRawImpact’s output, unless they contain more revrels (or older/newer ones) than the upstream origin.

indexer_storage_url#

Parameter whose value is a str.

swh_scheduler_url#

Parameter whose value is a str.

FORK_FILTERS = ['all', 'none', 'without-upstream-contribution', 'with-original-content']#
local_graph_path#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
graph_name#

Parameter whose value is a str.

impact_dir#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
output_emails#

A Parameter whose value is a bool. This parameter has an implicit default value of False. For the command line interface this means that the value is False unless you add "--the-bool-parameter" to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to be True. This is called explicit parsing. When omitting the parameter value, it is still considered True but to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.

You can toggle between the two parsing modes on a per-parameter base via

class MyTask(luigi.Task):
    implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING)
    explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)

or globally by

luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING

for all bool parameters instantiated after this line.

fork_filter#
A parameter which takes two values:
  1. an instance of Iterable and

  2. the class of the variables to convert to.

In the task definition, use

class MyTask(luigi.Task):
    my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)

At the command line, use

$ luigi --module my_tasks MyTask --my-param 0.1

Consider using EnumParameter for a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.

requires() Dict[str, Task][source]#

Returns an instance of swh.graph.luigi.compressed_graph.LocalGraph and swh.graph.libs.luigi.topology.ComputeGenerations.

output() Dict[str, Target][source]#

{"origin_csv": self.impact_dir / "indexed_origins.csv.zst"}

run() None[source]#

Reads input CSV, optionally filters out some lines (forks), adds columns, writes it again.

class swh.datasets.luigi.impact.DenormalizeImpactedRevrels(*args, **kwargs)[source]#

Bases: Task

Denormalizes ComputeIndexedImpact’s output by writing Parquet files listing individual revrels and origin-to-revrel mappings.

indexer_storage_url#

Parameter whose value is a str.

swh_scheduler_url#

Parameter whose value is a str.

local_graph_path#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
local_sensitive_graph_path#

Class to parse optional path parameters.

graph_name#

Parameter whose value is a str.

impact_dir#

Parameter whose value is a path.

In the task definition, use

class MyTask(luigi.Task):
    existing_file_path = luigi.PathParameter(exists=True)
    new_file_path = luigi.PathParameter()

    def run(self):
        # Get data from existing file
        with self.existing_file_path.open("r", encoding="utf-8") as f:
            data = f.read()

        # Output message in new file
        self.new_file_path.parent.mkdir(parents=True, exist_ok=True)
        with self.new_file_path.open("w", encoding="utf-8") as f:
            f.write("hello from a PathParameter => ")
            f.write(data)

At the command line, use

$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
write_revrels#

A Parameter whose value is a bool. This parameter has an implicit default value of False. For the command line interface this means that the value is False unless you add "--the-bool-parameter" to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to be True. This is called explicit parsing. When omitting the parameter value, it is still considered True but to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.

You can toggle between the two parsing modes on a per-parameter base via

class MyTask(luigi.Task):
    implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING)
    explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)

or globally by

luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING

for all bool parameters instantiated after this line.

write_origin_revrels#

A Parameter whose value is a bool. This parameter has an implicit default value of False. For the command line interface this means that the value is False unless you add "--the-bool-parameter" to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to be True. This is called explicit parsing. When omitting the parameter value, it is still considered True but to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.

You can toggle between the two parsing modes on a per-parameter base via

class MyTask(luigi.Task):
    implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING)
    explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)

or globally by

luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING

for all bool parameters instantiated after this line.

revrel_filter#
A parameter which takes two values:
  1. an instance of Iterable and

  2. the class of the variables to convert to.

In the task definition, use

class MyTask(luigi.Task):
    my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)

At the command line, use

$ luigi --module my_tasks MyTask --my-param 0.1

Consider using EnumParameter for a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.

algorithm#
A parameter which takes two values:
  1. an instance of Iterable and

  2. the class of the variables to convert to.

In the task definition, use

class MyTask(luigi.Task):
    my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)

At the command line, use

$ luigi --module my_tasks MyTask --my-param 0.1

Consider using EnumParameter for a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.

include_ranges#

Parameter whose value is a str.

exclude_ranges#

Parameter whose value is a str.

requires() Dict[str, Task][source]#

Returns an instance of swh.graph.luigi.compressed_graph.LocalGraph, swh.graph.libs.luigi.topology.ComputeGenerations, and two instances of each of swh.graph.libs.luigi.topology.CountPaths and swh.graph.libs.luigi.topology.CountDescendants (forward and backward)

output() Dict[str, LocalTarget][source]#

self.impact_dir / "revrels" and/or self.impact_dir / "origin_revrels" depending on self.write_revrels and self.write_origin_revrels.

For algorithm="oriset", origin_revrels is a directory containing two subdirectories, orisets_of_origin and revrels_in_oriset, that together encode the many-to-many origin->revrel relation. For algorithm="bfs" it contains flat (origin, revrel) Parquet files directly.

run() None[source]#

Runs impact-denormalize