Dataset#
We aim to provide regular exports of the Software Heritage graph in two different formats:
Columnar data storage: a set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
Compressed graph: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.
See also Using Software Heritage data.
Terms of Use
Usage of the datasets from the Software Heritage archive is covered by our Ethical Charter and the Terms of use for bulk access.
Downloading the datasets#
All datasets below are available publicly and with no login required, subject to the terms of use above. After installing awscli, datasets hosted on Amazon S3 can be downloaded with this command:
aws s3 cp s3://softwareheritage/graph/... ./target/path/ --recursive --no-sign-request
The latest compressed graphs contain some .zst
files, which must be
decompressed with unzstd
before they can be used with swh-graph.
Summary of dataset versions#
Full graph:
Name |
# Nodes |
# Edges |
Columnar |
Compressed |
---|---|---|---|---|
38,977,225,252 |
604,179,689,399 |
✔ |
✔ |
|
34,121,566,250 |
517,399,308,984 |
✔ |
✔ |
|
27,397,574,122 |
416,565,871,870 |
✔ |
✔ |
|
25,340,003,875 |
375,867,687,011 |
✔ |
✔ |
|
20,667,308,808 |
232,748,148,441 |
✔ |
✔ |
|
19,330,739,526 |
213,848,749,638 |
✗ |
✔ |
|
17,075,708,289 |
203,351,589,619 |
✗ |
✔ |
|
11,683,687,950 |
159,578,271,511 |
✔ |
✔ |
Teaser datasets:
Name |
# Nodes |
# Edges |
Columnar |
Compressed |
---|---|---|---|---|
176,569,127 |
11,322,432,687 |
✔ |
✔ |
|
45,691,499 |
1,221,283,907 |
✔ |
✔ |
|
1,083,011,764 |
27,919,670,049 |
✗ |
✔ |
|
304,037,235 |
9,516,984,175 |
✗ |
✔ |
|
? |
? |
✔ |
✗ |
|
27,363,226 |
346,413,337 |
✔ |
✔ |
Full graph datasets#
Because of their size, some of the latest datasets are only available for downside from Amazon S3.
2024-05-16#
A full export of the graph dated from May 2024
Columnar tables (Apache ORC):
Total size: 18 TiB
S3:
s3://softwareheritage/graph/2024-05-16/orc
Compressed graph:
Total size: 11 TiB
S3:
s3://softwareheritage/graph/2024-05-16/compressed
This graph export contains all files needed by the Rust implementation of swh-graph, so running
swh-graph/tools/swh-graph-java2rust.sh
is no longer necessary.
2023-09-06#
A full export of the graph dated from September 2023
Columnar tables (Apache ORC):
Total size: 15 TiB
S3:
s3://softwareheritage/graph/2023-09-06/orc
Compressed graph:
Total size: 8.8 TiB
S3:
s3://softwareheritage/graph/2023-09-06/compressed
2022-12-07#
A full export of the graph dated from December 2022
Columnar tables (Apache ORC):
Total size: 13 TiB
S3:
s3://softwareheritage/graph/2022-12-07/orc
Compressed graph:
Total size: 7.1 TiB
S3:
s3://softwareheritage/graph/2022-12-07/compressed
“History and hosting” Compressed graph:
This is a compressed graph of only the “history and hosting” layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded
Total size: 1 TiB
S3:
s3://softwareheritage/graph/2022-12-07-history-hosting/compressed
Erratum:
2022-04-25#
A full export of the graph dated from April 2022
Columnar tables (Apache ORC):
Total size: 11 TiB
S3:
s3://softwareheritage/graph/2022-04-25/orc
Compressed graph:
Total size: 6.5 TiB
S3:
s3://softwareheritage/graph/2022-04-25/compressed
2021-03-23#
A full export of the graph dated from March 2021.
Columnar tables (Apache ORC):
Total size: 8.4 TiB
S3:
s3://softwareheritage/graph/2021-03-23/orc
Compressed graph:
S3:
s3://softwareheritage/graph/2021-03-23/compressed
2020-12-15#
A full export of the graph dated from December 2020.
This export has a CSV representation of nodes and edges instead of columnar:
edges as
graph.edges.cnt,ori,rel,rev,snp.csv.zst
andgraph.edges.dir.00..21.csv.zst
nodes as
graph.nodes.csv.zst
deduplicated labels as
graph.labels.csv.zst
statistics as
graph.edges.count.txt
,graph.edges.stats.txt
,graph.labels.count.txt
,graph.nodes.count.txt
, andgraph.nodes.stats.txt
Compressed graph:
S3:
s3://softwareheritage/graph/2020-12-15/compressed
Edges: - S3:
s3://softwareheritage/graph/2020-12-15/edges
2020-05-20#
A full export of the graph dated from May 2020. Only available in compressed representation. (DEPRECATED: known issue with missing snapshot edges.)
Compressed graph:
2019-01-28#
A full export of the graph dated from January 2019. The export was done in two phases, one of them called “2018-09-25” and the other “2019-01-28”. They both refer to the same dataset, but the different formats have various inconsistencies between them. (DEPRECATED: early export pipeline, various inconsistencies).
Columnar tables (Apache Parquet):
Total size: 1.2 TiB
S3:
s3://softwareheritage/graph/2018-09-25/parquet
Compressed graph: