.. _swh-dataset-list: Dataset ======= We aim to provide regular exports of the Software Heritage graph in two different formats: - **Columnar data storage**: a set of relational tables stored in a columnar format such as `Apache ORC `_, which is particularly suited for scale-out analyses on data lakes and big data processing ecosystems such as the Hadoop environment. - **Compressed graph**: a compact and highly-efficient representation of the graph dataset, suited for scale-up analysis on high-end machines with large amounts of memory. The graph is compressed in *Boldi-Vigna representation*, designed to be loaded by the `WebGraph framework `_, specifically using our `swh-graph library `_. See also :ref:`using-swh-data`. .. admonition:: Terms of Use :name: remember-the-tos :class: important Usage of the datasets from the Software Heritage archive is covered by our `Ethical Charter`_ and the `Terms of use for bulk access`_. .. _Ethical charter: https://www.softwareheritage.org/legal/users-ethical-charter/ .. _Terms of use for bulk access: https://www.softwareheritage.org/legal/bulk-access-terms-of-use/ .. raw:: html Downloading the datasets ------------------------ All datasets below are available publicly and with no login required, subject to the terms of use above. After installing `awscli`_, datasets hosted on Amazon S3 can be downloaded with this command:: aws s3 cp s3://softwareheritage/graph/... ./target/path/ --recursive --no-sign-request The latest **compressed graphs** contain some ``.zst`` files, which must be decompressed with ``unzstd`` before they can be used with swh-graph. .. _awscli: https://github.com/aws/aws-cli Summary of dataset versions --------------------------- **Full graph**: .. list-table:: :header-rows: 1 * - Name - # Nodes - # Edges - Columnar - Compressed * - `2023-09-06`_ - 34,121,566,250 - 517,399,308,984 - ✔ - ✔ * - `2022-12-07`_ - 27,397,574,122 - 416,565,871,870 - ✔ - ✔ * - `2022-04-25`_ - 25,340,003,875 - 375,867,687,011 - ✔ - ✔ * - `2021-03-23`_ - 20,667,308,808 - 232,748,148,441 - ✔ - ✔ * - `2020-12-15`_ - 19,330,739,526 - 213,848,749,638 - ✗ - ✔ * - `2020-05-20`_ - 17,075,708,289 - 203,351,589,619 - ✗ - ✔ * - `2019-01-28`_ - 11,683,687,950 - 159,578,271,511 - ✔ - ✔ **Teaser datasets**: .. list-table:: :header-rows: 1 * - Name - # Nodes - # Edges - Columnar - Compressed * - `2023-09-06-popular-1k`_ - 176,569,127 - 11,322,432,687 - ✔ - ✔ * - `2021-03-23-popular-3k-python`_ - 45,691,499 - 1,221,283,907 - ✔ - ✔ * - `2020-12-15-gitlab-all`_ - 1,083,011,764 - 27,919,670,049 - ✗ - ✔ * - `2020-12-15-gitlab-100k`_ - 304,037,235 - 9,516,984,175 - ✗ - ✔ * - `2019-01-28-popular-4k`_ - ? - ? - ✔ - ✗ * - `2019-01-28-popular-3k-python`_ - 27,363,226 - 346,413,337 - ✔ - ✔ Full graph datasets ------------------- Because of their size, some of the latest datasets are only available for downside from Amazon S3. .. _graph-dataset-2023-09-06: 2023-09-06 ~~~~~~~~~~ A full export of the graph dated from September 2023 - **Columnar tables (Apache ORC)**: - **Total size**: 15 TiB - **S3**: ``s3://softwareheritage/graph/2023-09-06/orc`` - **Compressed graph**: - **Total size**: 8.8 TiB - **S3**: ``s3://softwareheritage/graph/2023-09-06/compressed`` .. _graph-dataset-2022-12-07: 2022-12-07 ~~~~~~~~~~ A full export of the graph dated from December 2022 - **Columnar tables (Apache ORC)**: - **Total size**: 13 TiB - **S3**: ``s3://softwareheritage/graph/2022-12-07/orc`` - **Compressed graph**: - **Total size**: 7.1 TiB - **S3**: ``s3://softwareheritage/graph/2022-12-07/compressed`` - **"History and hosting" Compressed graph**: - This is a compressed graph of only the "history and hosting" layer (origins, snapshots, releases, revisions) and the root directory (or rarely content) of every revision/release; but most directories and contents are excluded - **Total size**: 1 TiB - **S3**: ``s3://softwareheritage/graph/2022-12-07-history-hosting/compressed`` - **Erratum**: - `author and committer timestamps were shifted back 1 or 2 hours, based on the Europe/Paris timezone `_ .. _graph-dataset-2022-04-25: 2022-04-25 ~~~~~~~~~~ A full export of the graph dated from April 2022 - **Columnar tables (Apache ORC)**: - **Total size**: 11 TiB - **S3**: ``s3://softwareheritage/graph/2022-04-25/orc`` - **Compressed graph**: - **Total size**: 6.5 TiB - **S3**: ``s3://softwareheritage/graph/2022-04-25/compressed`` .. _graph-dataset-2021-03-23: 2021-03-23 ~~~~~~~~~~ A full export of the graph dated from March 2021. - **Columnar tables (Apache ORC)**: - **Total size**: 8.4 TiB - **URL**: `/graph/2021-03-23/orc/ `_ - **S3**: ``s3://softwareheritage/graph/2021-03-23/orc`` - **Compressed graph**: - **S3**: ``s3://softwareheritage/graph/2021-03-23/compressed`` .. _graph-dataset-2020-12-15: 2020-12-15 ~~~~~~~~~~ A full export of the graph dated from December 2020. This export has a CSV representation of nodes and edges instead of columnar: * edges as :file:`graph.edges.{cnt,ori,rel,rev,snp}.csv.zst` and :file:`graph.edges.dir.{00..21}.csv.zst` * nodes as :file:`graph.nodes.csv.zst` * deduplicated labels as :file:`graph.labels.csv.zst` * statistics as :file:`graph.edges.count.txt`, :file:`graph.edges.stats.txt`, :file:`graph.labels.count.txt`, :file:`graph.nodes.count.txt`, and :file:`graph.nodes.stats.txt` - **Compressed graph**: - **URL**: `/graph/2020-12-15/compressed/ `_ - **S3**: ``s3://softwareheritage/graph/2020-12-15/compressed`` - **Edges**: - **S3**: ``s3://softwareheritage/graph/2020-12-15/edges`` .. _graph-dataset-2020-05-20: 2020-05-20 ~~~~~~~~~~ A full export of the graph dated from May 2020. Only available in compressed representation. **(DEPRECATED: known issue with missing snapshot edges.)** - **Compressed graph**: - **URL**: `/graph/2020-05-20/compressed/ `_ .. _graph-dataset-2019-01-28: 2019-01-28 ~~~~~~~~~~ A full export of the graph dated from January 2019. The export was done in two phases, one of them called "2018-09-25" and the other "2019-01-28". They both refer to the same dataset, but the different formats have various inconsistencies between them. **(DEPRECATED: early export pipeline, various inconsistencies).** - **Columnar tables (Apache Parquet)**: - **Total size**: 1.2 TiB - **URL**: `/graph/2019-01-28/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2018-09-25/parquet`` - **Compressed graph**: - **URL**: `/graph/2019-01-28/compressed/ `_ Teaser datasets --------------- If the above datasets are too big, we also provide "teaser" datasets that can get you started and have a smaller size fingerprint. .. _graph-dataset-2023-09-06-popular-1k: 2023-09-06-popular-1k ~~~~~~~~~~~~~~~~~~~~~ The ``popular-1k`` teaser contains a subset of 1120 popular repositories **tagged as being written in one of the 10 most popular languages** (Javascript, Python, Java, Typescript, C#, C++, PHP, Shell, C, Ruby), from GitHub, Gitlab.com, Packagist, PyPI and Debian. The selection criteria to pick the software origins for each language was the following: - the 50 most popular Gitlab.com projects written in that languagethat have 2 stars or more, - for Python, the 50 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - for PHP, the 50 most popular Packagist projects (by usage statistics, according to `Packagist's API `_), - the 50 most popular Debian packages with the relevant ``implemented-in::`` `debtag `_ (by "installs" according to the `Debian Popularity Contest `_ database). - most popular GitHub projects written in Python (by number of stars), until the total number of origins for that language reaches 200 - removing origins not archived by |swh| by 2023-09-06 - **Columnar (Apache ORC)**: - **Total size**: 280 GiB - **S3**: ``s3://softwareheritage/graph/2023-09-06-popular-1k/orc/`` - **Compressed graph**: - **Total size**: 42 GiB - **S3**: ``s3://softwareheritage/graph/2023-09-06-popular-1k/compressed/`` .. _graph-dataset-2021-03-23-popular-3k-python: 2021-03-23-popular-3k-python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 2197 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab.com, PyPI and Debian. The selection criteria to pick the software origins was the following: - the 580 most popular GitHub projects written in Python (by number of stars), - the 135 Gitlab.com projects written in Python that have 2 stars or more, - the 827 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 655 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **Columnar (Apache ORC)**: - **Total size**: 36 GiB - **S3**: ``s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/`` - **Compressed graph**: - **Total size**: 15 GiB - **S3**: ``s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/`` .. _graph-dataset-2020-12-15-gitlab-all: 2020-12-15-gitlab-all ~~~~~~~~~~~~~~~~~~~~~ A teaser dataset containing the entirety of Gitlab.com, exported in December 2020. Available in compressed graph format. - **Compressed graph**: - **URL**: `/graph/2020-12-15-gitlab-all/compressed/ `_ .. _graph-dataset-2020-12-15-gitlab-100k: 2020-12-15-gitlab-100k ~~~~~~~~~~~~~~~~~~~~~~ A teaser dataset containing the 100k most popular Gitlab.com repositories, exported in December 2020. Available in compressed graph format. - **Compressed graph**: - **URL**: `/graph/2020-12-15-gitlab-100k/compressed/ `_ .. _graph-dataset-2019-01-28-popular-4k: 2019-01-28-popular-4k ~~~~~~~~~~~~~~~~~~~~~ This teaser dataset contains a subset of 4000 popular repositories from GitHub, Gitlab.com, PyPI and Debian. The selection criteria to pick the software origins was the following: - The 1000 most popular GitHub projects (by number of stars) - The 1000 most popular Gitlab.com projects (by number of stars) - The 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - The 1000 most popular Debian packages (by "votes" according to the `Debian Popularity Contest `_ database) - **Columnar (Apache Parquet)**: - **Total size**: 27 GiB - **URL**: `/graph/2019-01-28-popular-4k/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-4k/parquet/`` .. _graph-dataset-2019-01-28-popular-3k-python: 2019-01-28-popular-3k-python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``popular-3k-python`` teaser contains a subset of 3052 popular repositories **tagged as being written in the Python language**, from GitHub, Gitlab.com, PyPI and Debian. The selection criteria to pick the software origins was the following, similar to ``popular-4k``: - the 1000 most popular GitHub projects written in Python (by number of stars), - the 131 Gitlab.com projects written in Python that have 2 stars or more, - the 1000 most popular PyPI projects (by usage statistics, according to the `Top PyPI Packages `_ database), - the 1000 most popular Debian packages with the `debtag `_ ``implemented-in::python`` (by "votes" according to the `Debian Popularity Contest `_ database). - **Columnar (Apache Parquet)**: - **Total size**: 5.3 GiB - **URL**: `/graph/2019-01-28-popular-3k-python/parquet/ `_ - **S3**: ``s3://softwareheritage/graph/2019-01-28-popular-3k-python/parquet/``