Exporting a dataset#
This repository aims to contain various pipelines to generate datasets of Software Heritage data, so that they can be used internally or by external researchers.
Graph dataset#
Exporting the full dataset#
Right now, the only supported export pipeline is the Graph Dataset, a set of
relational tables representing the Software Heritage Graph, as documented in
Software Heritage Graph Dataset. It can be run using the swh dataset graph export
command.
This dataset can be exported in two different formats: orc
and edges
.
To export a graph, you need to provide a comma-separated list of formats to
export with the --formats
option. You also need an export ID, a unique
identifier used by the Kafka server to store the current progress of the
export.
Note: exporting as the edges
format is discouraged, as it is redundant
and can easily be generated directly from the ORC format.
Here is an example command to start a graph dataset export:
swh dataset -C graph_export_config.yml graph export \
--formats orc \
--export-id 2022-04-25 \
-p 64 \
/srv/softwareheritage/hdd/graph/2022-04-25
This command usually takes more than a week for a full export, it is therefore advised to run it in a service or a tmux session.
The configuration file should contain the configuration for the swh-journal clients, as well as various configuration options for the exporters. Here is an example configuration file:
journal:
brokers:
- kafka1.internal.softwareheritage.org:9094
- kafka2.internal.softwareheritage.org:9094
- kafka3.internal.softwareheritage.org:9094
- kafka4.internal.softwareheritage.org:9094
security.protocol: SASL_SSL
sasl.mechanisms: SCRAM-SHA-512
max.poll.interval.ms: 1000000
remove_pull_requests: true
The following configuration options can be used for the export:
remove_pull_requests
: remove all edges from origin to snapshot matchingrefs/*
but not matchingrefs/heads/*
orrefs/tags/*
. This removes all the pull requests that are present in Software Heritage (archived withgit clone --mirror
).
Uploading on S3 & on the annex#
The dataset should then be made available publicly by uploading it on S3 and on the public annex.
For S3:
aws s3 cp --recursive /srv/softwareheritage/hdd/graph/2022-04-25/orc s3://softwareheritage/graph/2022-04-25/orc
For the annex:
scp -r 2022-04-25/orc saam.internal.softwareheritage.org:/srv/softwareheritage/annex/public/dataset/graph/2022-04-25/
ssh saam.internal.softwareheritage.org
cd /srv/softwareheritage/annex/public/dataset/graph
git annex add 2022-04-25
git annex sync --content
Documenting the new dataset#
In the swh-dataset
repository, edit the the file docs/graph/dataset.rst
to document the availability of the new dataset. You should usually mention:
the name of the dataset version (e.g., 2022-04-25)
the number of nodes
the number of edges
the available formats (notably whether the graph is also available in its compressed representation).
the total on-disk size of the dataset
the buckets/URIs to obtain the graph from S3 and from the annex