Luigi workflows#

The preferred way to create and transfer swh-graph data is through Luigi tasks/workflows rather than a regular CLI. Those include all common operations in swh-graph, except running the gRPC server ( swh graph grpc-serve).

Using Luigi allows automatically building missing prerequisites datasets in order to build a new dataset, which is common when working with swh-graph data.

The swh graph luigi CLI wraps Luigi’s CLI to simplify passing common parameters to tasks. Command lines usually look like this:

swh graph luigi <common_parameters> <task_name> -- <direct_task_parameters>

where:

  • <common_parameters> are the parameters exposed by swh graph luigi. See the CLI documentation for the list of common parameters The most importants are:

    • --dataset-name is the name of the dataset to work on, which is the date of the export, optionally with a suffix (eg. 2022-12-07 or 2022-12-07-history-hosting)

    • --base-directory which is the root directory for all datasets. It contains a subdirectory named after --dataset-name, which contains the data the workflow will work with.

    • --graph-base-directory, the location of the compressed graph. To obtain reasonable performance, it should be a tmpfs setup as described in the Memory & Performance tuning documentation

    • --s3-prefix, which is usually s3://softwareheritage/graph/

    • --athena-prefix which is a prefix for tables created in Amazon Athena. Actual table names will be this prefix, followed by _ and the dataset name (with dashes stripped)

  • <task_name> is a CamelCase (usually) imperative-form name of the last task to run (tasks that it depends on will automatically run first)

  • <direct_task_parameters> are parameters directly passed to Luigi. In short, there are --scheduler-url which takes an URL to the Luigi scheduler (or --local-scheduler if you do not want it), --log-level {INFO,DEBUG,...} parameters to the main task (--kebab-cased-parameter), and parameters to other tasks (--TaskName-kebab-cased-parameter). See the Luigi CLI documentation for details.

Note

Progress report for most tasks is currently displayed only to the standard output, and not to the luigi scheduler dashboard.

Statistics#

Tasks may write statistics of commands they ran, such as CPU time and memory usage. For example, the compression pipeline writes this for each task in meta/compression.json file:

"cgroup_stats": {
    "memory.events": "low 0\nhigh 0\nmax 0\noom 0\noom_kill 0\noom_group_kill 0",
    "memory.events.local": "low 0\nhigh 0\nmax 0\noom 0\noom_kill 0\noom_group_kill 0",
    "memory.swap.current": "0",
    "memory.zswap.current": "0",
    "memory.swap.events": "high 0\nmax 0\nfail 0",
    "cpu.stat": "usage_usec 531350\nuser_usec 424286\nsystem_usec 107063\n...",
    "memory.current": "614400",
    "memory.stat": "anon 0\nfile 110592\nkernel 176128\nkernel_stack 0\n...",
    "memory.numa_stat": "anon N0=0\nfile N0=110592\nkernel_stack N0=0\n...",
    "memory.peak": "49258496"
}

These rely on the parent control group allowing creation of sub-control groups.

This is generally not the case, as processes run in the /user.slice/user-XXXX.slice/session-YYYYY.scope cgroup, and systemd does not allow creation of sub-groups directly in /user.slice/user-XXXX.slice/.

A workaround is to start an interactive systemd container using systemd-run --user -S, which creates a new cgroup /user.slice/user-XXXX.slice/user@XXXX.service/app.slice/run-uZZZ.service and run swh-graph in that.

The user also needs permission to use some controllers, which can be configured with systemctl edit user@XXXX.service by adding:

[Service] Delegate=pids memory cpu cpuacct io

Graph export#

This section describes tasks which export a graph from the archive to ORC (and/or CSV) files. This is referred to as the “graph export”, not to be confused with the “compressed graph” (even though both are compressed).

There are three important tasks to deal with the graph export:

In details:

ExportGraph#

Implemented by swh.dataset.luigi.ExportGraph.

This consumes from the journal, and to write a bunch of ORC (and/or edges CSV) files which contain all data in the Software Heritage archive.

Example invocation:

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --dataset-name 2022-12-07 \
    ExportGraph \
    -- \
    --scheduler-url http://localhost:50092/ \
    --ExportGraph-config ~/luigid/graph.prod.yml \
    --ExportGraph-processes 96

or, equivalently:

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --dataset-name 2022-12-07 \
    ExportGraph \
    -- \
    --scheduler-url http://localhost:50092/ \
    --config ~/luigid/graph.prod.yml \
    --processes 96

~/luigid/graph.prod.yml must contain at least a journal block.

UploadExportToS3#

Implemented by swh.dataset.luigi.UploadExportToS3.

DownloadExportFromS3#

Implemented by swh.dataset.luigi.DownloadExportFromS3.

CreateAthena#

Implemented by swh.dataset.luigi.CreateAthena.

Depends on UploadExportToS3 and creates Amazon Athena tables for the ORC dataset.

LocalExport#

Implemented by swh.graph.dataset.LocalExport.

This is a pseudo-task used as a dependency by other tasks which need a graph, but do not care whether it should be generated locally or downloading if missing.

It is configured through either --LocalExport-export-task-type DownloadExportFromS3 (the default) or --LocalExport-export-task-type ExportGraph (to locally compress a new graph from scratch).

RunExportAll#

Implemented by swh.dataset.luigi.RunExportCompressUpload.

This is a pseudo-task which depends on ExportGraph, CreateAthena, and UploadExportToS3.

Compressed graph#

There are three important tasks to deal with the compressed graph:

In details:

CompressGraph#

Implemented by swh.graph.luigi.compressed_graph.CompressGraph. It depends on all leaf tasks of the compression pipeline, which don’t need to be called correctly.

An example call is:

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --s3-prefix s3://softwareheritage/graph/ \
    --athena-prefix swh \
    --dataset-name 2022-12-07 \
    CompressGraph \
    -- \
    --scheduler-url http://localhost:50092/ \
    --RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07/ \
    --ExportGraph-config ~/luigid/graph.prod.yml \
    --ExportGraph-processes 96

Note the final parameters: they are passed to dependent tasks, not directly to CompressGraph.

UploadGraphToS3#

Implemented by swh.graph.luigi.compressed_graph.UploadGraphToS3.

DownloadGraphFromS3#

Implemented by swh.graph.luigi.compressed_graph.DownloadGraphFromS3.

Example call:

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --dataset-name 2022-12-07 \
    --s3-prefix s3://softwareheritage/graph/ \
    -- \
    --scheduler-url http://localhost:50092/ \
    --log-level INFO
    DownloadGraphFromS3

RunExportCompressUpload#

Implemented by swh.graph.luigi.RunExportCompressUpload.

This is a pseudo-task which depends on ExportGraph, CreateAthena, CompressGraph, and UploadGraphToS3.

An example call is:

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --s3-prefix s3://softwareheritage/graph/ \
    --athena-prefix swh \
    --dataset-name 2022-12-07 \
    RunExportCompressUpload \
    -- \
    --scheduler-url http://localhost:50092/ \
    --RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07/ \
    --ExportGraph-config ~/luigid/graph.prod.yml \
    --ExportGraph-processes 96 \

Or, for a partial subgraph (not the --export-name is unchanged, because it uses the same export but produces a different compressed graph):

swh graph luigi \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --s3-prefix s3://softwareheritage/graph/ \
    --athena-prefix swh \
    --dataset-name 2022-12-07-history-hosting \
    --export-name 2022-12-07 \
    RunExportCompressUpload \
    -- \
    --scheduler-url http://localhost:50092/ \
    --RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07-history-hosting/ \
    --ExportGraph-config ~/luigid/graph.prod.yml \
    --ExportGraph-processes 96 \
    --CompressGraph-object-types ori,snp,rel,rev

LocalGraph#

Implemented by swh.graph.luigi.LocalGraph.

This is a pseudo-task used as a dependency by other tasks which need a graph, but do not care whether it should be generated locally or downloading if missing.

It is configured through either --LocalGraph-compression-task-type DownloadExportFromS3 (the default) or --LocalGraph-compression-task-type CompressGraph (to locally compress a new graph from scratch).

Blobs datasets#

swh.graph.luigi.blobs_datasets contains tasks to extract a subset of blobs from the archive, usually based on their names. It is normally triggered through RunBlobDataset. See the module’s documentation for details on other tasks.

RunBlobDataset#

Runs all tasks to select, download, and analyze a blob dataset.

Example call, to generate the license dataset:

swh graph luigi \
    --graph-base-directory /dev/shm/swh-graph/2022-12-07/ \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --previous-dataset-name 2022-04-25 \
    --dataset-name 2022-12-07 \
    --s3-prefix s3://softwareheritage/derived_datasets/ \
    --athena-prefix swh \
    --s3-athena-output-location s3://softwareheritage/tmp/athena \
    --grpc-api localhost:50093 \
    -- \
    --scheduler-url http://localhost:50092/ \
    --log-level INFO \
    RunBlobDataset \
    --blob-filter license \
    --DownloadBlobs-download-url 'https://softwareheritage.s3.amazonaws.com/content/{sha1}' \
    --DownloadBlobs-decompression-algo gzip

In particular, note the optional --previous-dataset-name parameter, which reuses a previous version of the blob dataset to speed-up tasks by running incrementally.

File names#

Provenance#

Origin contributors#

RunOriginContributors#

Example call:

swh graph luigi \
    --graph-base-directory /dev/shm/swh-graph/2022-12-07/ \
    --base-directory /poolswh/softwareheritage/vlorentz/ \
    --base-sensitive-directory /poolswh/softwareheritage/vlorentz/sensitive_datasets \
    --athena-prefix swh \
    --dataset-name 2022-12-07 \
    RunOriginContributors \
    -- \
    --scheduler-url http://localhost:50092/

Topology#