Software Heritage Graph Dataset#
This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.
The Software Heritage graph dataset is available in multiple formats, including relational Apache ORC files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.
By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, and the terms of use for bulk access.
If you use this dataset for research purposes, please cite the following paper: