Dataset#

We aim to provide regular exports of the Software Heritage graph in two different formats:

  • Columnar data storage: a set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.

  • Compressed graph: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.

See also:

Summary of dataset versions#

Full graph:

Name

# Nodes

# Edges

Columnar

Compressed

2023-09-06

34,121,566,250

517,399,308,984

2022-12-07

27,397,574,122

416,565,871,870

2022-04-25

25,340,003,875

375,867,687,011

2021-03-23

20,667,308,808

232,748,148,441

2020-12-15

19,330,739,526

213,848,749,638

2020-05-20

17,075,708,289

203,351,589,619

2019-01-28

11,683,687,950

159,578,271,511

Teaser datasets:

Name

# Nodes

# Edges

Columnar

Compressed

2021-03-23-popular-3k-python

45,691,499

1,221,283,907

2020-12-15-gitlab-all

1,083,011,764

27,919,670,049

2020-12-15-gitlab-100k

304,037,235

9,516,984,175

2019-01-28-popular-4k

?

?

2019-01-28-popular-3k-python

27,363,226

346,413,337

Full graph datasets#

Because of their size, some of the latest datasets are only available for downside from Amazon S3.

2023-09-06#

A full export of the graph dated from September 2023

  • Columnar tables (Apache ORC):

    • Total size: 15 TiB

    • S3: s3://softwareheritage/graph/2023-09-06/orc

  • Compressed graph:

    • Total size: 8.8 TiB

    • S3: s3://softwareheritage/graph/2023-09-06/compressed

2022-12-07#

A full export of the graph dated from December 2022

  • Columnar tables (Apache ORC):

    • Total size: 13 TiB

    • S3: s3://softwareheritage/graph/2022-12-07/orc

  • Compressed graph:

    • Total size: 7.1 TiB

    • S3: s3://softwareheritage/graph/2022-12-07/compressed

  • “History and hosting” Compressed graph:

    • This is a compressed graph of only the “history and hosting” layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded

    • Total size: 1 TiB

    • S3: s3://softwareheritage/graph/2022-12-07-history-hosting/compressed

  • Erratum:

2022-04-25#

A full export of the graph dated from April 2022

  • Columnar tables (Apache ORC):

    • Total size: 11 TiB

    • S3: s3://softwareheritage/graph/2022-04-25/orc

  • Compressed graph:

    • Total size: 6.5 TiB

    • S3: s3://softwareheritage/graph/2022-04-25/compressed

2021-03-23#

A full export of the graph dated from March 2021.

  • Columnar tables (Apache ORC):

  • Compressed graph:

    • S3: s3://softwareheritage/graph/2021-03-23/compressed

2020-12-15#

A full export of the graph dated from December 2020. Only available in compressed representation.

2020-05-20#

A full export of the graph dated from May 2020. Only available in compressed representation. (DEPRECATED: known issue with missing snapshot edges.)

2019-01-28#

A full export of the graph dated from January 2019. The export was done in two phases, one of them called “2018-09-25” and the other “2019-01-28”. They both refer to the same dataset, but the different formats have various inconsistencies between them. (DEPRECATED: early export pipeline, various inconsistencies).

Teaser datasets#

If the above datasets are too big, we also provide “teaser” datasets that can get you started and have a smaller size fingerprint.

2020-12-15-gitlab-all#

A teaser dataset containing the entirety of Gitlab, exported in December 2020. Available in compressed graph format.

2020-12-15-gitlab-100k#

A teaser dataset containing the 100k most popular Gitlab repositories, exported in December 2020. Available in compressed graph format.