Software Heritage - Development Documentation¶
Getting started¶
- Run your own Software Heritage ← start here to get your own Software Heritage platform running in less than 5 minutes, or
- Developer setup ← here to hack on the Software Heritage software stack
Architecture¶
- Software Architecture ← go there to have a glimpse on the Software Heritage software architecture
Components¶
Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module.
- swh.core
- low-level utilities and helpers used by almost all other modules in the stack
- swh.dataset
- public datasets and periodic data dumps of the archive released by Software Heritage
- swh.deposit
- push-based deposit of software artifacts to the archive
- swh.docs
- developer documentation (used to generate this doc you are reading)
- swh.graph
- Fast, compressed, in-memory representation of the archive, with tooling to generate and query it.
- swh.indexer
- tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it
- swh.journal
- persistent logger of changes to the archive, with publish-subscribe support
- swh.lister
- collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.)
- swh.loader-core
- low-level loading utilities and helpers used by all other loaders
- swh.loader-debian
- loader for Debian source packages
- swh.loader-dir
- loader for source directories (e.g., expanded tarballs)
- swh.loader-git
- loader for Git repositories
- swh.loader-mercurial
- loader for Mercurial repositories
- swh.loader-pypi
- loader for PyPI source code releases
- swh.loader-svn
- loader for Subversion repositories
- swh.loader-tar
- loader for source tarballs (including Tar, ZIP and other archive formats)
- swh.model
- implementation of the Data model to archive source code artifacts
- swh.objstorage
- content-addressable object storage
- swh.scheduler
- task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package)
- swh.storage
- abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata
- swh.vault
- implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)
- swh.web
- Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use
Dependencies¶
The dependency relationships among the various modules are depicted below.