Dataset#
We aim to provide regular exports of the Software Heritage graph in two different formats:
Columnar data storage: a set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.
Compressed graph: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.
See also:
Summary of dataset versions#
Full graph:
Name |
# Nodes |
# Edges |
Columnar |
Compressed |
---|---|---|---|---|
34,121,566,250 |
517,399,308,984 |
✔ |
✔ |
|
27,397,574,122 |
416,565,871,870 |
✔ |
✔ |
|
25,340,003,875 |
375,867,687,011 |
✔ |
✔ |
|
20,667,308,808 |
232,748,148,441 |
✔ |
✔ |
|
19,330,739,526 |
213,848,749,638 |
✗ |
✔ |
|
17,075,708,289 |
203,351,589,619 |
✗ |
✔ |
|
11,683,687,950 |
159,578,271,511 |
✔ |
✔ |
Teaser datasets:
Name |
# Nodes |
# Edges |
Columnar |
Compressed |
---|---|---|---|---|
45,691,499 |
1,221,283,907 |
✔ |
✔ |
|
1,083,011,764 |
27,919,670,049 |
✗ |
✔ |
|
304,037,235 |
9,516,984,175 |
✗ |
✔ |
|
? |
? |
✔ |
✗ |
|
27,363,226 |
346,413,337 |
✔ |
✔ |
Full graph datasets#
Because of their size, some of the latest datasets are only available for downside from Amazon S3.
2023-09-06#
A full export of the graph dated from September 2023
Columnar tables (Apache ORC):
Total size: 15 TiB
S3:
s3://softwareheritage/graph/2023-09-06/orc
Compressed graph:
Total size: 8.8 TiB
S3:
s3://softwareheritage/graph/2023-09-06/compressed
2022-12-07#
A full export of the graph dated from December 2022
Columnar tables (Apache ORC):
Total size: 13 TiB
S3:
s3://softwareheritage/graph/2022-12-07/orc
Compressed graph:
Total size: 7.1 TiB
S3:
s3://softwareheritage/graph/2022-12-07/compressed
“History and hosting” Compressed graph:
This is a compressed graph of only the “history and hosting” layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded
Total size: 1 TiB
S3:
s3://softwareheritage/graph/2022-12-07-history-hosting/compressed
Erratum:
2022-04-25#
A full export of the graph dated from April 2022
Columnar tables (Apache ORC):
Total size: 11 TiB
S3:
s3://softwareheritage/graph/2022-04-25/orc
Compressed graph:
Total size: 6.5 TiB
S3:
s3://softwareheritage/graph/2022-04-25/compressed
2021-03-23#
A full export of the graph dated from March 2021.
Columnar tables (Apache ORC):
Total size: 8.4 TiB
S3:
s3://softwareheritage/graph/2021-03-23/orc
Compressed graph:
S3:
s3://softwareheritage/graph/2021-03-23/compressed
2020-12-15#
A full export of the graph dated from December 2020. Only available in compressed representation.
Compressed graph:
2020-05-20#
A full export of the graph dated from May 2020. Only available in compressed representation. (DEPRECATED: known issue with missing snapshot edges.)
Compressed graph:
2019-01-28#
A full export of the graph dated from January 2019. The export was done in two phases, one of them called “2018-09-25” and the other “2019-01-28”. They both refer to the same dataset, but the different formats have various inconsistencies between them. (DEPRECATED: early export pipeline, various inconsistencies).
Columnar tables (Apache Parquet):
Total size: 1.2 TiB
S3:
s3://softwareheritage/graph/2018-09-25/parquet
Compressed graph: