How to generate a compressed graph dataset?#

Intended audience

sysadm staff members

This page describes how to declaratively deploy the pipeline which generates a compressed graph dataset in kubernetes environment.

Over time, this will be evolved to also allow the generation of any derived datasets from the main compressed graph ones.

What are the services?#

It’s composed of 2 (kubernetes) deployments, the:

  • luigi-scheduler service which orchestrates the various luigi tasks in charge of reading, computing and generating the datasets

  • dataset generator instance which triggers the upper level luigi task of the compressed graph generation

Where does the dataset pipeline run?#

The dataset pipeline requires a lot of resources (disk, memory). Once computed, the current production dataset is about 5TB and the staging one is about 2TB.

There is no label set on machines yet, it’s up to the human operator to create the persistent-volume on a machine that meets those requirements.

For production, that means the future maxxi2 instance. For staging, it’s rancher-node-staging-rke2-metal01

How to connect to the dataset services?#

To access it, you need access to the kubernetes cluster targeted (e.g. next-version, staging, production).

Either use k9s and in the pods view, select the pod dataset-$version and then hit s to shell into it. Or use kubectl to connect to the pod.

How to generate a new compressed graph dataset (no publication)?#

In the following, we will detail how to generate a new compressed dataset step by step.

We will be using sampled yaml of the swh-charts from a staging deployment. Adapt according to your need.

  1. As mentioned previously, we are using persistent volumes (associated to a node) to store the dataset generated by the pipeline.

    So first, we need to declare 1 persistent volume (pv). This will hold the publicly exposed data (and will eventually be published in s3). And then we declare 2 persistent volume claims (pvc), one will use the newly created pv for the public data, the other will be about the sensitive data (we let the pv get created accordingly).

In the following, for the staging environment, we are creating 1 persistent volume

modified   swh/values/staging/swh-cassandra.yaml
@@ -2678,6 +2678,27 @@ volumes:
                 operator: In
                 values:
                 - rancher-node-staging-rke2-metal01
+    graph-20251107-generated-persistent-pv:
+      enabled: true
+      appName: datasets-2025-11-07
+      spec:
+        capacity:
+          storage: 500Gi
+        volumeMode: Filesystem
+        accessModes:
+        - ReadWriteOnce
+        persistentVolumeReclaimPolicy: Retain
+        storageClassName: local-persistent
+        local:
+          path: /srv/kubernetes/volumes/generated-datasets/2025-11-07
+        nodeAffinity:
+          required:
+            nodeSelectorTerms:
+            - matchExpressions:
+              - key: kubernetes.io/hostname
+                operator: In
+                values:
+                - rancher-node-staging-rke2-metal01

   persistentVolumeClaims:
     # grpc provenance volume
@@ -2770,6 +2791,27 @@ volumes:
         - ReadWriteOnce
         volumeMode: Filesystem
         storageClassName: local-persistent
+    graph-20251107-generated-persistent-pvc:
+      appName: datasets-2025-11-07
+      spec:
+        resources:
+          requests:
+            storage: 500Gi
+        accessModes:
+        - ReadWriteOnce
+        volumeMode: Filesystem
+        volumeName: graph-20251107-generated-persistent-pv
+        storageClassName: local-persistent
+    graph-sensitive-20251107-generated-persistent-pvc:
+      appName: datasets-2025-11-07
+      spec:
+        resources:
+          requests:
+            storage: 10Gi
+        accessModes:
+        - ReadWriteOnce
+        volumeMode: Filesystem
+        storageClassName: local-persistent
  1. Once we have the volumes ready, we can declare a new dataset $version instance using those newly created volumes. It’s currently in toolbox mode so we can connect and trigger the generation (future work should happen to make it reliably automatic).

modified   swh/values/staging/swh-cassandra.yaml
@@ -2987,21 +2987,6 @@ datasetsGenerator:
     nbProcesses: 32
     # Datasets to generate
     deployments:
+      2025-11-07:
+        enabled: true
+        # So we can trigger the dataset generation manually
+        toolbox: true
+        extraVolumes:
+          dataset-persistent:
+            volumeDefinition:
+              persistentVolumeClaim:
+                claimName: graph-20251107-generated-persistent-pvc
+          dataset-sensitive-persistent:
+            mountPath: /srv/dataset-sensitive
+            volumeDefinition:
+              persistentVolumeClaim:
+                claimName: graph-sensitive-20251107-generated-persistent-pvc
  1. Connect to the newly deployed pod dataset-$version and trigger the dataset pipeline.

swh@dataset-2025-05-07:~$ source venv/bin/activate
(venv) swh@dataset-2025-05-07:~$ time /script/generate-and-publish-compressed-graph-dataset.sh
...

Note: As the generation of such dataset takes some time (around 1 week for staging, ~2 weeks for production), it would be best to trigger this running inside a tmux session.

  1. Once the dataset generation is complete, the /srv/dataset and /srv/dataset-sensitive folders (mountpoints) hold the dataset. It’s fine to reuse the public dataset mountpoint and provide it to a graph grpc instance as backend. The /srv/dataset-sensitive is not currently used.

How to generate and publish a new compressed graph dataset?#

We can also enable the pipeline to generate the compressed graph dataset and publish it to our public s3 bucket. It will publish the compressed graph dataset and the orc files it used to build it. Those orc files could then be reused to generate other derived datasets.

The pipeline mechanism is working the same way as described in the previous chapter.

Its declaration needs a bit more configuration. First, the the publish: true instance configuration option must be set (by default it’s false). To be complete, it also requires the reference to the s3 bucket credentials publicationCredentialsRef.

For example, in the following, we create a new dataset pipeline instance 2025-11-28 which will generate the compressed graph dataset and publish it to s3. Only the access to the bucket is necessary.

modified   swh/values/production/swh-cassandra.yaml
@@ -2870,6 +2870,20 @@ alter:
 # Luigi's scheduler url to access
 luigiSchedulerUrl: http://luigi-scheduler-ingress:80

+# s3 access to bucket for dataset upload
+datasetUploadS3Configuration:
+  region: ${AWS_REGION}
+  secrets:
+    AWS_ACCESS_KEY_ID:
+      secretKeyRef: s3-dataset-upload-secret
+      secretKeyName: access-key
+    AWS_SECRET_ACCESS_KEY:
+      secretKeyRef: s3-dataset-upload-secret
+      secretKeyName: secret-key
+    AWS_REGION:
+      secretKeyRef: s3-dataset-upload-secret
+      secretKeyName: region

+  # The datasets export to build
+  datasets:
+    enabled: true
+    luigiSchedulerUrlRef: luigiSchedulerUrl
+    graphDirectoryPathRef: graphDirectoryPath
+    journalClientConfigurationRef: datasetsJournalClientConfiguration
+    maskingDbConfigurationRef: maskingQueryPostgresqlConfiguration
+    publicationCredentialsRef: datasetUploadS3Configuration
+    nbProcesses: 128
+    # Datasets to generate
+    deployments:
+      2025-11-28:
+        enabled: true
+        # So we can trigger the dataset generation manually
+        toolbox: true
+        publish: true
+        extraVolumes:
+          dataset-persistent:
+            volumeDefinition:
+              persistentVolumeClaim:
+                claimName: comp-graph-dataset-gen-20251128-persistent-local-pvc
+          dataset-sensitive-persistent:
+            mountPath: /srv/dataset-sensitive
+            volumeDefinition:
+              persistentVolumeClaim:
+                claimName: comp-graph-dataset-gen-sensitive-20251128-persistent-local-pvc

And then trigger the generation and publish script as explained before.

The location where the graph will be published in s3 is not configurable in the declaration.

Once the generation and publication of the compressed graph dataset is done, the dataset will be available in s3://softwareheritage/graph/${VERSION} as explained in the dedicated page.

VERSION=2025-11-28
aws s3 ls --no-sign-request "s3://softwareheritage/graph/${VERSION}"

Post compressed graph dataset generation actions#

Decommission previous instance#

As mentioned, once the new dataset is generated, we can decommission the previous dataset instance (to avoid unnecessary disk consumption). In some environment (staging, next-version), such resource is a bit constraint.

Clean up unused data#

Depending on the environment, as already mentioned, the sensitive data is not used, so it could be very well be dropped immediately at the end of the dataset generation.

As the current pipeline’s evolution, only the compressed graph is generated but we keep all the orc files which are the heaviest in terms of data size but we are not using them when running the graph. Those are to be used in the future steps about generating derived datasets. In the mean time, we could drop them to free some disk space.