.. _swh-datasets-luigi:
Luigi workflows
===============
.. highlight:: bash
The preferred way to create and transfer swh-graph data is through
`Luigi `_ tasks/workflows rather than a regular CLI.
Those include all common operations in swh-graph, except running the gRPC server (
:program:`swh graph grpc-serve`).
Using Luigi allows automatically building missing prerequisites datasets in order
to build a new dataset, which is common when working with swh-graph data.
The :program:`swh datasets luigi` CLI wraps Luigi's CLI to simplify
passing common parameters to tasks.
Command lines usually look like this::
swh datasets luigi --
where:
* ```` are the parameters exposed by :program:`swh datasets luigi`.
See the :ref:`CLI documentation ` for the list of common parameters
The most importants are:
* ``--dataset-name`` is the name of the dataset to work on, which is the date of
the export, optionally with a suffix (eg. ``2022-12-07`` or
``2022-12-07-history-hosting``)
* ``--base-directory`` which is the root directory for all datasets. It contains
a subdirectory named after ``--dataset-name``, which contains the data the workflow
will work with.
* ``--graph-base-directory``, the location of the compressed graph. To obtain
reasonable performance, it should be a tmpfs setup as described in the
:ref:`swh-graph-memory` documentation
* ``--s3-prefix``, which is usually ``s3://softwareheritage/graph/``
* ``--athena-prefix`` which is a prefix for tables created in Amazon Athena.
Actual table names will be this prefix, followed by ``_`` and the dataset name
(with dashes stripped)
* ```` is a CamelCase (usually) imperative-form name of the last task
to run (tasks that it depends on will automatically run first)
* ```` are parameters directly passed to Luigi.
In short, there are ``--scheduler-url`` which takes an URL to the `Luigi scheduler
`_
(or ``--local-scheduler`` if you do not want it), ``--log-level {INFO,DEBUG,...}``
parameters to the main task (``--kebab-cased-parameter``), and parameters to other tasks
(``--TaskName-kebab-cased-parameter``).
See the `Luigi CLI documentation `_ for details.
Luigi also needs a configuration file with ``{hostname}_ram_mb = 32000`` (assuming 32GB of RAM)
`in the [resources] block `_,
given either to the scheduler (when using the central scheduler, ie. ``--scheduler-url``)
or through ``--luigi-config`` (when using ``--local-scheduler``).
.. note::
Progress report for most tasks is currently displayed only to the standard output,
and not to the luigi scheduler dashboard.
Statistics
----------
Tasks may write statistics of commands they ran, such as CPU time and memory usage.
For example, the compression pipeline writes this for each task in :file:`meta/compression.json` file::
"cgroup_stats": {
"memory.events": "low 0\nhigh 0\nmax 0\noom 0\noom_kill 0\noom_group_kill 0",
"memory.events.local": "low 0\nhigh 0\nmax 0\noom 0\noom_kill 0\noom_group_kill 0",
"memory.swap.current": "0",
"memory.zswap.current": "0",
"memory.swap.events": "high 0\nmax 0\nfail 0",
"cpu.stat": "usage_usec 531350\nuser_usec 424286\nsystem_usec 107063\n...",
"memory.current": "614400",
"memory.stat": "anon 0\nfile 110592\nkernel 176128\nkernel_stack 0\n...",
"memory.numa_stat": "anon N0=0\nfile N0=110592\nkernel_stack N0=0\n...",
"memory.peak": "49258496"
}
These rely on the parent `control group `_
allowing creation of sub-control groups.
This is generally not the case, as processes run in the :file:`/user.slice/user-XXXX.slice/session-YYYYY.scope`
cgroup, and systemd does not allow creation of sub-groups directly in :file:`/user.slice/user-XXXX.slice/`.
A workaround is to start an interactive systemd container using ``systemd-run --user -S``,
which creates a new cgroup :file:`/user.slice/user-XXXX.slice/user@XXXX.service/app.slice/run-uZZZ.service`
and run swh-graph in that.
The user also needs permission to use some controllers, which can be configured with
``systemctl edit user@XXXX.service`` by adding:
[Service]
Delegate=pids memory cpu cpuacct io
.. _swh-datasets-luigi-graph-export:
Graph export
------------
This section describes tasks which export a graph from the archive to ORC (and/or CSV)
files. This is referred to as the "graph export", not to be confused with the "compressed
graph" (even though both are compressed).
There are three important tasks to deal with the graph export:
* :ref:`swh-datasets-luigi-ExportGraph` does the export itself
* :ref:`swh-datasets-luigi-RunExportAll` is a pseudo-task which depends on
:ref:`swh-datasets-luigi-ExportGraph`, :ref:`swh-datasets-luigi-CreateAthena`,
and :ref:`swh-datasets-luigi-UploadExportToS3`.
* :ref:`swh-datasets-luigi-LocalExport` which is a pseudo-task that any other task should
depend on.
It can be configured by users to either trigger a new export or download it from S3.
In details:
.. _swh-datasets-luigi-ExportGraph:
ExportGraph
^^^^^^^^^^^
Implemented by :class:`swh.export.luigi.ExportGraph`.
This consumes from the :ref:`journal `, and to write a bunch of ORC
(and/or edges CSV) files which contain all data in the |swh| archive.
Example invocation::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--dataset-name 2022-12-07 \
ExportGraph \
-- \
--scheduler-url http://localhost:50092/ \
--ExportGraph-config ~/luigid/graph.prod.yml \
--ExportGraph-processes 96
or, equivalently::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--dataset-name 2022-12-07 \
ExportGraph \
-- \
--scheduler-url http://localhost:50092/ \
--config ~/luigid/graph.prod.yml \
--processes 96
:file:`~/luigid/graph.prod.yml` must contain at least a :ref:`journal `
block.
.. _swh-datasets-luigi-UploadExportToS3:
UploadExportToS3
^^^^^^^^^^^^^^^^
Implemented by :class:`swh.export.luigi.UploadExportToS3`.
.. _swh-datasets-luigi-DownloadExportFromS3:
DownloadExportFromS3
^^^^^^^^^^^^^^^^^^^^
Implemented by :class:`swh.export.luigi.DownloadExportFromS3`.
.. _swh-datasets-luigi-CreateAthena:
CreateAthena
^^^^^^^^^^^^
Implemented by :class:`swh.export.luigi.CreateAthena`.
Depends on :ref:`swh-datasets-luigi-UploadExportToS3` and creates Amazon Athena tables
for the ORC dataset.
.. _swh-datasets-luigi-LocalExport:
LocalExport
^^^^^^^^^^^
Implemented by :class:`swh.datasets.dataset.LocalExport`.
This is a pseudo-task used as a dependency by other tasks which need a graph,
but do not care whether it should be generated locally or downloading if missing.
It is configured through either ``--LocalExport-export-task-type DownloadExportFromS3``
(the default) or ``--LocalExport-export-task-type ExportGraph`` (to locally compress a new
graph from scratch).
.. _swh-datasets-luigi-RunExportAll:
RunExportAll
^^^^^^^^^^^^
Implemented by :class:`swh.export.luigi.RunExportCompressUpload`.
This is a pseudo-task which depends on :ref:`swh-datasets-luigi-ExportGraph`,
:ref:`swh-datasets-luigi-CreateAthena`,
and :ref:`swh-datasets-luigi-UploadExportToS3`.
.. _swh-datasets-luigi-compressed-graph:
Compressed graph
----------------
There are three important tasks to deal with the compressed graph:
* :ref:`swh-datasets-luigi-CompressGraph` does the compression itself (and depends on a graph export)
* :ref:`swh-datasets-luigi-RunExportCompressUpload` is a pseudo-task which depends on
:ref:`swh-datasets-luigi-LocalExport` (so, indirectly :ref:`swh-datasets-luigi-ExportGraph`),
:ref:`swh-datasets-luigi-CreateAthena`, :ref:`swh-datasets-luigi-CompressGraph`,
and :ref:`swh-datasets-luigi-UploadGraphToS3`.
* :ref:`swh-datasets-luigi-LocalGraph` which is a pseudo-task that any other task should
depend on.
It can be configured by users to either compress a new graph or download it from S3.
In details:
.. _swh-datasets-luigi-CompressGraph:
CompressGraph
^^^^^^^^^^^^^
Implemented by :class:`swh.datasets.luigi.compressed_graph.CompressGraph`.
It depends on all leaf tasks
of the compression pipeline, which don't need to be called correctly.
An example call is::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--s3-prefix s3://softwareheritage/graph/ \
--athena-prefix swh \
--dataset-name 2022-12-07 \
CompressGraph \
-- \
--scheduler-url http://localhost:50092/ \
--RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07/ \
--ExportGraph-config ~/luigid/graph.prod.yml \
--ExportGraph-processes 96
Note the final parameters: they are passed to dependent tasks, not directly to
``CompressGraph``.
.. _swh-datasets-luigi-UploadGraphToS3:
UploadGraphToS3
^^^^^^^^^^^^^^^^
Implemented by :class:`swh.datasets.luigi.compressed_graph.UploadGraphToS3`.
.. _swh-datasets-luigi-DownloadGraphFromS3:
DownloadGraphFromS3
^^^^^^^^^^^^^^^^^^^^
Implemented by :class:`swh.datasets.luigi.compressed_graph.DownloadGraphFromS3`.
Example call::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--dataset-name 2022-12-07 \
--s3-prefix s3://softwareheritage/graph/ \
-- \
--scheduler-url http://localhost:50092/ \
--log-level INFO
DownloadGraphFromS3
.. _swh-datasets-luigi-RunExportCompressUpload:
RunExportCompressUpload
^^^^^^^^^^^^^^^^^^^^^^^
Implemented by :class:`swh.datasets.luigi.RunExportCompressUpload`.
This is a pseudo-task which depends on :ref:`swh-datasets-luigi-ExportGraph`,
:ref:`swh-datasets-luigi-CreateAthena`, :ref:`swh-datasets-luigi-CompressGraph`,
and :ref:`swh-datasets-luigi-UploadGraphToS3`.
An example call is::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--s3-prefix s3://softwareheritage/graph/ \
--athena-prefix swh \
--dataset-name 2022-12-07 \
RunExportCompressUpload \
-- \
--scheduler-url http://localhost:50092/ \
--RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07/ \
--ExportGraph-config ~/luigid/graph.prod.yml \
--ExportGraph-processes 96 \
Or, for a partial subgraph (not the ``--export-name`` is unchanged, because it
uses the same export but produces a different compressed graph)::
swh datasets luigi \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--s3-prefix s3://softwareheritage/graph/ \
--athena-prefix swh \
--dataset-name 2022-12-07-history-hosting \
--export-name 2022-12-07 \
RunExportCompressUpload \
-- \
--scheduler-url http://localhost:50092/ \
--RunExportAll-s3-athena-output-location s3://softwareheritage/tmp/athena/import_of_2022-12-07-history-hosting/ \
--ExportGraph-config ~/luigid/graph.prod.yml \
--ExportGraph-processes 96 \
--CompressGraph-object-types ori,snp,rel,rev
.. _swh-datasets-luigi-LocalGraph:
LocalGraph
^^^^^^^^^^
Implemented by :class:`swh.datasets.luigi.LocalGraph`.
This is a pseudo-task used as a dependency by other tasks which need a graph,
but do not care whether it should be generated locally or downloading if missing.
It is configured through either ``--LocalGraph-compression-task-type DownloadExportFromS3``
(the default) or ``--LocalGraph-compression-task-type CompressGraph`` (to locally compress a new
graph from scratch).
Blobs datasets
--------------
:mod:`swh.datasets.luigi.blobs_datasets` contains tasks to extract a subset of blobs
from the archive, usually based on their names.
It is normally triggered through :ref:`swh-datasets-luigi-RunBlobDataset`.
See the module's documentation for details on other tasks.
.. _swh-datasets-luigi-RunBlobDataset:
RunBlobDataset
^^^^^^^^^^^^^^
Runs all tasks to select, download, and analyze a blob dataset.
Example call, to generate `the license dataset
`_::
swh datasets luigi \
--graph-base-directory /dev/shm/swh-graph/2022-12-07/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--previous-dataset-name 2022-04-25 \
--dataset-name 2022-12-07 \
--s3-prefix s3://softwareheritage/derived_datasets/ \
--athena-prefix swh \
--s3-athena-output-location s3://softwareheritage/tmp/athena \
--grpc-api localhost:50093 \
-- \
--scheduler-url http://localhost:50092/ \
--log-level INFO \
RunBlobDataset \
--blob-filter license \
--DownloadBlobs-download-url 'https://softwareheritage.s3.amazonaws.com/content/{sha1}' \
--DownloadBlobs-decompression-algo gzip
In particular, note the optional ``--previous-dataset-name`` parameter, which
reuses a previous version of the blob dataset to speed-up tasks by running incrementally.
File names
----------
.. attention:
This section is incomplete, see :mod:`swh.datasets.luigi.file_names` as documentation
Provenance
----------
.. attention:
This section is incomplete, see :mod:`swh.datasets.luigi.file_names` as documentation
Origin contributors
-------------------
.. attention:
This section is incomplete, see :mod:`swh.datasets.luigi.origin_contributors` as documentation
.. _swh-datasets-luigi-RunOriginContributors:
RunOriginContributors
^^^^^^^^^^^^^^^^^^^^^
Example call::
swh datasets luigi \
--graph-base-directory /dev/shm/swh-graph/2022-12-07/ \
--base-directory /poolswh/softwareheritage/vlorentz/ \
--base-sensitive-directory /poolswh/softwareheritage/vlorentz/sensitive_datasets \
--athena-prefix swh \
--dataset-name 2022-12-07 \
RunOriginContributors \
-- \
--scheduler-url http://localhost:50092/
Topology
--------
.. attention:
This section is incomplete, see :mod:`swh.graph.luigi.topology` as documentation