swh.graph.luigi.provenance module#

Luigi tasks to help compute the provenance of content blobs#

This module contains Luigi tasks driving the computation of the Provenance index.

class swh.graph.luigi.provenance.ListProvenanceNodes(*args, **kwargs)[source]#

Bases: Task

Lists all nodes reachable from releases and ‘head revisions’.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
requires() Dict[str, Task][source]#

Returns LocalGraph and SortRevrelByDate instances.

output() Dict[str, LocalTarget][source]#

Returns provenance_dir/nodes/

run() None[source]#

Runs list-provenance-nodes from tools/provenance

class swh.graph.luigi.provenance.ComputeEarliestTimestamps(*args, **kwargs)[source]#

Bases: Task

Creates an array storing, for each directory/content SWHIDs, the author date of the first revision/release that contains it.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and SortRevrelByDate instances.

output() Dict[str, LocalTarget][source]#

Returns provenance_dir/revrel_by_author_date/ and :file:`{provenance_dir}/earliest_timestamps.bin.

run() None[source]#

Runs compute-earliest-timestamps from tools/provenance

class swh.graph.luigi.provenance.ListDirectoryMaxLeafTimestamp(*args, **kwargs)[source]#

Bases: Task

Creates a file that contains all directory/content SWHIDs, along with the first revision/release author date and SWHIDs they occur in.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and ComputeEarliestTimestamps instances.

output() LocalTarget[source]#

Returns {provenance_dir}/max_leaf_timestamps.bin

run() None[source]#

Runs list-directory-with-max-leaf-timestamp from tools/provenance

class swh.graph.luigi.provenance.ComputeDirectoryFrontier(*args, **kwargs)[source]#

Bases: Task

Creates a file that contains the “directory frontier” as defined by swh-provenance.

In short, it is a directory which directly contains a file (not a directory), which is a non-root directory in a revision newer than the directory timestamp computed by ListDirectoryMaxLeafTimestamp.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
batch_size = <luigi.parameter.IntParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and ListDirectoryMaxLeafTimestamp instances.

output() LocalTarget[source]#

Returns {provenance_dir}/directory_frontier/

run() None[source]#

Runs compute-directory-frontier from tools/provenance

class swh.graph.luigi.provenance.ListFrontierDirectoriesInRevisions(*args, **kwargs)[source]#

Bases: Task

Creates a file that contains the list of revision any “frontier directory” (as defined by swh-provenance) is in.

While a directory is considered frontier only relative to a revision, the produced file contains the list of all revisions a directory is in, for directories which are frontier for any revision.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
batch_size = <luigi.parameter.IntParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and ComputeDirectoryFrontier instances.

output() LocalTarget[source]#

Returns {provenance_dir}/frontier_directories_in_revisions/

run() None[source]#

Runs org.softwareheritage.graph.utils.ListFrontierDirectoriesInRevisions

class swh.graph.luigi.provenance.ListContentsInRevisionsWithoutFrontier(*args, **kwargs)[source]#

Bases: Task

Creates a file that contains the list of (file, revision) where the file is reachable from the revision without going through any “directory frontier” as defined by swh-provenance.

In short, it is a directory which directly contains a file (not a directory), which is a non-root directory in a revision newer than the directory timestamp computed by ListDirectoryMaxLeafTimestamp.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
batch_size = <luigi.parameter.IntParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and ListDirectoryMaxLeafTimestamp instances.

output() LocalTarget[source]#

Returns {provenance_dir}/contents_in_revisions_without_frontiers

run() None[source]#

Runs contents-in-revisions-without-frontier from tools/provenance

class swh.graph.luigi.provenance.ListContentsInFrontierDirectories(*args, **kwargs)[source]#

Bases: Task

Enumerates all contents in all directories returned by ComputeDirectoryFrontier.

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
property resources#

Returns the value of self.max_ram_mb

requires() Dict[str, Task][source]#

Returns LocalGraph and ComputeDirectoryFrontier instances.

output() LocalTarget[source]#

Returns {provenance_dir}/contents_in_frontier_directories/

run() None[source]#

Runs contents-in-directories from tools/provenance

class swh.graph.luigi.provenance.RunProvenance(*args, **kwargs)[source]#

Bases: WrapperTask

(Transitively) depends on all provenance tasks

local_export_path = <luigi.parameter.PathParameter object>#
local_graph_path = <luigi.parameter.PathParameter object>#
graph_name = <luigi.parameter.Parameter object>#
provenance_dir = <luigi.parameter.PathParameter object>#
provenance_node_filter = <luigi.parameter.Parameter object>#
requires()[source]#

Returns ListContentsInFrontierDirectories and ListContentsInRevisionsWithoutFrontier