Dataset#

We aim to provide regular exports of the Software Heritage graph in two different formats:

  • Columnar data storage: a set of relational tables stored in a columnar format such as Apache ORC, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment.

  • Compressed graph: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in Boldi-Vigna representation, designed to be loaded by the WebGraph framework, specifically using our swh-graph library.

See also Using Software Heritage data.

Terms of Use

Usage of the datasets from the Software Heritage archive is covered by our Ethical Charter and the Terms of use for bulk access.

Downloading the datasets#

All datasets below are available publicly and with no login required, subject to the terms of use above. After installing awscli, datasets hosted on Amazon S3 can be downloaded with this command:

aws s3 cp s3://softwareheritage/graph/... ./target/path/ --recursive --no-sign-request

The latest compressed graphs contain some .zst files, which must be decompressed with unzstd before they can be used with swh-graph.

Summary of dataset versions#

Full graph:

Name

# Nodes

# Edges

Columnar

Compressed

2024-08-23

41,074,031,225

644,153,760,912

2024-05-16

38,977,225,252

604,179,689,399

2023-09-06

34,121,566,250

517,399,308,984

2022-12-07

27,397,574,122

416,565,871,870

2022-04-25

25,340,003,875

375,867,687,011

2021-03-23

20,667,308,808

232,748,148,441

2020-12-15

19,330,739,526

213,848,749,638

2020-05-20

17,075,708,289

203,351,589,619

2019-01-28

11,683,687,950

159,578,271,511

Teaser datasets:

Name

# Nodes

# Edges

Columnar

Compressed

2024-08-23-popular-500-python

60,286,526

1,630,768,493

2023-09-06-popular-1k

176,569,127

11,322,432,687

2021-03-23-popular-3k-python

45,691,499

1,221,283,907

2020-12-15-gitlab-all

1,083,011,764

27,919,670,049

2020-12-15-gitlab-100k

304,037,235

9,516,984,175

2019-01-28-popular-4k

?

?

2019-01-28-popular-3k-python

27,363,226

346,413,337

Full graph datasets#

Because of their size, some of the latest datasets are only available for download from Amazon S3.

2024-08-23#

A full export of the graph dated from August 2024

  • Columnar tables (Apache ORC):

    • Total size: 19 TiB

    • S3: s3://softwareheritage/graph/2024-08-23/orc

  • Compressed graph:

    • Total size: 11 TiB

    • S3: s3://softwareheritage/graph/2024-08-23/compressed

    • This graph changed the MPH from GOV/Cmph to PTHash; Rust code hardcoding GOVMPH needs to replace it with DynMph or SwhidPthash. Java is no longer supported to read this graph.

2024-05-16#

A full export of the graph dated from May 2024

  • Columnar tables (Apache ORC):

    • Total size: 18 TiB

    • S3: s3://softwareheritage/graph/2024-05-16/orc

  • Compressed graph:

    • Total size: 11 TiB

    • S3: s3://softwareheritage/graph/2024-05-16/compressed

    • This graph export contains all files needed by the Rust implementation of swh-graph, so running swh-graph/tools/swh-graph-java2rust.sh is no longer necessary.

2023-09-06#

A full export of the graph dated from September 2023

  • Columnar tables (Apache ORC):

    • Total size: 15 TiB

    • S3: s3://softwareheritage/graph/2023-09-06/orc

  • Compressed graph:

    • Total size: 8.8 TiB

    • S3: s3://softwareheritage/graph/2023-09-06/compressed

2022-12-07#

A full export of the graph dated from December 2022

  • Columnar tables (Apache ORC):

    • Total size: 13 TiB

    • S3: s3://softwareheritage/graph/2022-12-07/orc

  • Compressed graph:

    • Total size: 7.1 TiB

    • S3: s3://softwareheritage/graph/2022-12-07/compressed

  • “History and hosting” Compressed graph:

    • This is a compressed graph of only the “history and hosting” layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded

    • Total size: 1 TiB

    • S3: s3://softwareheritage/graph/2022-12-07-history-hosting/compressed

  • Erratum:

2022-04-25#

A full export of the graph dated from April 2022

  • Columnar tables (Apache ORC):

    • Total size: 11 TiB

    • S3: s3://softwareheritage/graph/2022-04-25/orc

  • Compressed graph:

    • Total size: 6.5 TiB

    • S3: s3://softwareheritage/graph/2022-04-25/compressed

2021-03-23#

A full export of the graph dated from March 2021.

  • Columnar tables (Apache ORC):

  • Compressed graph:

    • S3: s3://softwareheritage/graph/2021-03-23/compressed

2020-12-15#

A full export of the graph dated from December 2020.

This export has a CSV representation of nodes and edges instead of columnar:

  • edges as graph.edges.cnt,ori,rel,rev,snp.csv.zst and graph.edges.dir.00..21.csv.zst

  • nodes as graph.nodes.csv.zst

  • deduplicated labels as graph.labels.csv.zst

  • statistics as graph.edges.count.txt, graph.edges.stats.txt, graph.labels.count.txt, graph.nodes.count.txt, and graph.nodes.stats.txt

  • Compressed graph:

  • Edges: - S3: s3://softwareheritage/graph/2020-12-15/edges

2020-05-20#

A full export of the graph dated from May 2020. Only available in compressed representation. (DEPRECATED: known issue with missing snapshot edges.)

2019-01-28#

A full export of the graph dated from January 2019. The export was done in two phases, one of them called “2018-09-25” and the other “2019-01-28”. They both refer to the same dataset, but the different formats have various inconsistencies between them. (DEPRECATED: early export pipeline, various inconsistencies).

Teaser datasets#

If the above datasets are too big, we also provide “teaser” datasets that can get you started and have a smaller size fingerprint.

2020-12-15-gitlab-all#

A teaser dataset containing the entirety of Gitlab.com, exported in December 2020. Available in compressed graph format.

2020-12-15-gitlab-100k#

A teaser dataset containing the 100k most popular Gitlab.com repositories, exported in December 2020. Available in compressed graph format.