Software Heritage - Development Documentation¶
Data Model and Specifications¶
SoftWare Heritage persistent IDentifiers (SWHIDs) Specifications of the SoftWare Heritage persistent IDentifiers (SWHID).
Data model Documentation of the main Software Heritage archive data model.
Software Heritage Journal — Specifications Documentation of the Kafka journal of the Software Heritage archive.
Frequently Asked Questions¶
Prerequisites Prerequisites to be a contributor
Getting Started Starter kit to contribute
Running SWH instance locally Starter kit to run your local swh instance
Dataset Getting some sample dataset
Error and bugs Stuck somewhere
Legal Legal and contributions
Code Review Related to Code review
API General questions about the SWH API
System Administration System Administration related questions
Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the Software Architecture Overview, which presents each of them in a didactical order.
Each component name is linked to the development documentation of the corresponding Python module.
low-level library used by modules needing keycloak authentication
low-level utilities and helpers used by almost all other modules in the stack
service providing efficient estimates of the number of objects in the SWH archive, using Redis’s Hyperloglog
public datasets and periodic data dumps of the archive released by Software Heritage
push-based deposit of software artifacts to the archive
developer documentation (used to generate this doc you are reading)
Virtual file system to browse the Software Heritage archive, based on FUSE
Fast, compressed, in-memory representation of the archive, with tooling to generate and query it.
tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it
persistent logger of changes to the archive, with publish-subscribe support
collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.)
low-level loading utilities and helpers used by all other loaders
loader for Git repositories
loader for Mercurial repositories
loader for Subversion repositories
implementation of the Data model to archive source code artifacts
content-addressable object storage
Object storage replication tool
source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage
task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package)
search engine for the archive
abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata
implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)
Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use
Python client for swh.web
The dependency relationships among the various modules are depicted below.