Exporting a dataset#
This repository aims to contain various pipelines to generate datasets of Software Heritage data, so that they can be used internally or by external researchers.
Exporting the full dataset#
Right now, the only supported export pipeline is the Graph Dataset, a set of
relational tables representing the Software Heritage Graph, as documented in
Software Heritage Graph Dataset. It can be run using the
swh dataset graph export
This dataset can be exported in two different formats:
To export a graph, you need to provide a comma-separated list of formats to
export with the
--formats option. You also need an export ID, a unique
identifier used by the Kafka server to store the current progress of the
Note: exporting as the
edges format is discouraged, as it is redundant
and can easily be generated directly from the ORC format.
Here is an example command to start a graph dataset export:
swh dataset -C graph_export_config.yml graph export \ --formats orc \ --export-id 2022-04-25 \ -p 64 \ /srv/softwareheritage/hdd/graph/2022-04-25
This command usually takes more than a week for a full export, it is therefore advised to run it in a service or a tmux session.
The configuration file should contain the configuration for the swh-journal clients, as well as various configuration options for the exporters. Here is an example configuration file:
journal: brokers: - kafka1.internal.softwareheritage.org:9094 - kafka2.internal.softwareheritage.org:9094 - kafka3.internal.softwareheritage.org:9094 - kafka4.internal.softwareheritage.org:9094 security.protocol: SASL_SSL sasl.mechanisms: SCRAM-SHA-512 max.poll.interval.ms: 1000000 remove_pull_requests: true
The following configuration options can be used for the export:
remove_pull_requests: remove all edges from origin to snapshot matching
refs/*but not matching
refs/tags/*. This removes all the pull requests that are present in Software Heritage (archived with
git clone --mirror).
Uploading on S3 & on the annex#
The dataset should then be made available publicly by uploading it on S3 and on the public annex.
aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc
For the annex:
scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/ ssh saam.internal.softwareheritage.org cd /srv/softwareheritage/annex/public/dataset/graph git annex add 2022-04-25 git annex sync --content
Documenting the new dataset#
swh-dataset repository, edit the the file
to document the availability of the new dataset. You should usually mention:
the name of the dataset version (e.g., 2022-04-25)
the number of nodes
the number of edges
the available formats (notably whether the graph is also available in its compressed representation).
the total on-disk size of the dataset
the buckets/URIs to obtain the graph from S3 and from the annex