swh.datasets.luigi.impact module#
Luigi tasks to measure institutional impact#
This module contains Luigi tasks
computing the impact of an institution across all origins.
Institutions are identified by a regular expression on authors’ email address,
such as .*@softwareheritage.org.
Output is written to the directory configured with impact_dir, which contains:
persons.csv: a CSV file with headerfullname_base64,email_base64. It is filled from the “sensitive export”; which complements public exports dated 2026-03-02 or newer. For older exports, it can be generated with:$ psql service=swh softwareheritage=> \copy (select translate(encode(fullname, 'base64'), E'\n', '') as fullname_base64, translate(encode(email, 'base64'), E'\n', '') as email_base64 from person where encode(email, 'escape') ~ '.*@(.*\.)?softwareheritage.org$') to '/path/to/impact/dir/persons.csv' with (delimiter ',', header true)
raw_origins.csv.zst: all origins that contain a commit authored or committed by someone listed inpersons.csvindexed_origins.csv.zst: similar toraw_origins.csv.zstbut with some irrelevant origins filtered out, and enriched with metadata from Software Heritage - Indexer and swh-scheduler.revrels: a directory containing all revisions in any origin listed inindexed_origins.csv.zst, as Parquet filesorigin_revrels: a directory describing the many-to-many relation between origins listed inindexed_origins.csv.zstand revisions inrevrels. To avoid the combinatorial explosion of writing every (origin, revrel) pair, the relation is decomposed via an intermediate “oriset” (origin set) identifier:origin_revrels/orisets_of_originorigin_revrels/revrels_in_oriset
- class swh.datasets.luigi.impact.SelectPersons(*args, **kwargs)[source]#
Bases:
TaskExtract persons from the output of Software Heritage Datasets based on the provided regexp
- local_sensitive_export_path#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- impact_dir#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- email_regexp#
Parameter whose value is a
str.
- class swh.datasets.luigi.impact.ComputeRawImpact(*args, **kwargs)[source]#
Bases:
TaskCreates a file that list all origins that contains revrels from a given set of persons, as well as the number of revrels and first/latest timestamp for each origin.
- local_graph_path#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- local_sensitive_graph_path#
Class to parse optional path parameters.
- graph_name#
Parameter whose value is a
str.
- impact_dir#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- output_emails#
A Parameter whose value is a
bool. This parameter has an implicit default value ofFalse. For the command line interface this means that the value isFalseunless you add"--the-bool-parameter"to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to beTrue. This is called explicit parsing. When omitting the parameter value, it is still consideredTruebut to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.You can toggle between the two parsing modes on a per-parameter base via
class MyTask(luigi.Task): implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING) explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
for all bool parameters instantiated after this line.
- include_ranges#
Parameter whose value is a
str.
- exclude_ranges#
Parameter whose value is a
str.
- requires() Dict[str, Task][source]#
Returns an instance of
swh.graph.luigi.compressed_graph.LocalGraph,swh.graph.libs.luigi.topology.ComputeGenerations, and two instances of each ofswh.graph.libs.luigi.topology.CountPathsandswh.graph.libs.luigi.topology.CountDescendants(forward and backward)
- class swh.datasets.luigi.impact.ComputeIndexedImpact(*args, **kwargs)[source]#
Bases:
TaskRemoves forks from
ComputeRawImpact’s output, unless they contain more revrels (or older/newer ones) than the upstream origin.- indexer_storage_url#
Parameter whose value is a
str.
- swh_scheduler_url#
Parameter whose value is a
str.
- FORK_FILTERS = ['all', 'none', 'without-upstream-contribution', 'with-original-content']#
- local_graph_path#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- graph_name#
Parameter whose value is a
str.
- impact_dir#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- output_emails#
A Parameter whose value is a
bool. This parameter has an implicit default value ofFalse. For the command line interface this means that the value isFalseunless you add"--the-bool-parameter"to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to beTrue. This is called explicit parsing. When omitting the parameter value, it is still consideredTruebut to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.You can toggle between the two parsing modes on a per-parameter base via
class MyTask(luigi.Task): implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING) explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
for all bool parameters instantiated after this line.
- fork_filter#
- A parameter which takes two values:
an instance of
Iterableandthe class of the variables to convert to.
In the task definition, use
class MyTask(luigi.Task): my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)
At the command line, use
$ luigi --module my_tasks MyTask --my-param 0.1
Consider using
EnumParameterfor a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.
- requires() Dict[str, Task][source]#
Returns an instance of
swh.graph.luigi.compressed_graph.LocalGraphandswh.graph.libs.luigi.topology.ComputeGenerations.
- class swh.datasets.luigi.impact.DenormalizeImpactedRevrels(*args, **kwargs)[source]#
Bases:
TaskDenormalizes
ComputeIndexedImpact’s output by writing Parquet files listing individual revrels and origin-to-revrel mappings.- indexer_storage_url#
Parameter whose value is a
str.
- swh_scheduler_url#
Parameter whose value is a
str.
- local_graph_path#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- local_sensitive_graph_path#
Class to parse optional path parameters.
- graph_name#
Parameter whose value is a
str.
- impact_dir#
Parameter whose value is a path.
In the task definition, use
class MyTask(luigi.Task): existing_file_path = luigi.PathParameter(exists=True) new_file_path = luigi.PathParameter() def run(self): # Get data from existing file with self.existing_file_path.open("r", encoding="utf-8") as f: data = f.read() # Output message in new file self.new_file_path.parent.mkdir(parents=True, exist_ok=True) with self.new_file_path.open("w", encoding="utf-8") as f: f.write("hello from a PathParameter => ") f.write(data)
At the command line, use
$ luigi --module my_tasks MyTask --existing-file-path <path> --new-file-path <path>
- write_revrels#
A Parameter whose value is a
bool. This parameter has an implicit default value ofFalse. For the command line interface this means that the value isFalseunless you add"--the-bool-parameter"to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to beTrue. This is called explicit parsing. When omitting the parameter value, it is still consideredTruebut to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.You can toggle between the two parsing modes on a per-parameter base via
class MyTask(luigi.Task): implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING) explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
for all bool parameters instantiated after this line.
- write_origin_revrels#
A Parameter whose value is a
bool. This parameter has an implicit default value ofFalse. For the command line interface this means that the value isFalseunless you add"--the-bool-parameter"to your command without giving a parameter value. This is considered implicit parsing (the default). However, in some situations one might want to give the explicit bool value ("--the-bool-parameter true|false"), e.g. when you configure the default value to beTrue. This is called explicit parsing. When omitting the parameter value, it is still consideredTruebut to avoid ambiguities during argument parsing, make sure to always place bool parameters behind the task family on the command line when using explicit parsing.You can toggle between the two parsing modes on a per-parameter base via
class MyTask(luigi.Task): implicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.IMPLICIT_PARSING) explicit_bool = luigi.BoolParameter(parsing=luigi.BoolParameter.EXPLICIT_PARSING)
or globally by
luigi.BoolParameter.parsing = luigi.BoolParameter.EXPLICIT_PARSING
for all bool parameters instantiated after this line.
- revrel_filter#
- A parameter which takes two values:
an instance of
Iterableandthe class of the variables to convert to.
In the task definition, use
class MyTask(luigi.Task): my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)
At the command line, use
$ luigi --module my_tasks MyTask --my-param 0.1
Consider using
EnumParameterfor a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.
- algorithm#
- A parameter which takes two values:
an instance of
Iterableandthe class of the variables to convert to.
In the task definition, use
class MyTask(luigi.Task): my_param = luigi.ChoiceParameter(choices=[0.1, 0.2, 0.3], var_type=float)
At the command line, use
$ luigi --module my_tasks MyTask --my-param 0.1
Consider using
EnumParameterfor a typed, structured alternative. This class can perform the same role when all choices are the same type and transparency of parameter value on the command line is desired.
- include_ranges#
Parameter whose value is a
str.
- exclude_ranges#
Parameter whose value is a
str.
- requires() Dict[str, Task][source]#
Returns an instance of
swh.graph.luigi.compressed_graph.LocalGraph,swh.graph.libs.luigi.topology.ComputeGenerations, and two instances of each ofswh.graph.libs.luigi.topology.CountPathsandswh.graph.libs.luigi.topology.CountDescendants(forward and backward)
- output() Dict[str, LocalTarget][source]#
self.impact_dir / "revrels"and/orself.impact_dir / "origin_revrels"depending onself.write_revrelsandself.write_origin_revrels.For
algorithm="oriset",origin_revrelsis a directory containing two subdirectories,orisets_of_originandrevrels_in_oriset, that together encode the many-to-many origin->revrel relation. Foralgorithm="bfs"it contains flat(origin, revrel)Parquet files directly.