Software Heritage - Development Documentation

Getting started

Contributing

Architecture

Data Model and Specifications

Tutorials

Roadmap

System Administration

  • Network Infrastructure

  • Description → learn what a Software Heritage mirror is and how to set up one

  • Keycloak → learn how to use Keycloak, the authentication system used by Software Heritage’s web interface and public APIs

Components

Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the Software Architecture Overview, which presents each of them in a didactical order.

Each component name is linked to the development documentation of the corresponding Python module.

swh.auth

low-level library used by modules needing keycloak authentication

swh.core

low-level utilities and helpers used by almost all other modules in the stack

swh.counters

service providing efficient estimates of the number of objects in the SWH archive, using Redis’s Hyperloglog

swh.dataset

public datasets and periodic data dumps of the archive released by Software Heritage

swh.deposit

push-based deposit of software artifacts to the archive

swh.docs

developer documentation (used to generate this doc you are reading)

swh.fuse

Virtual file system to browse the Software Heritage archive, based on FUSE

swh.graph

Fast, compressed, in-memory representation of the archive, with tooling to generate and query it.

swh.graphql

GraphQL API to request archive data offering more precise and flexible queries than the REST API.

swh.indexer

tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it

swh.journal

persistent logger of changes to the archive, with publish-subscribe support

swh.lister

collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.)

swh.loader-core

low-level loading utilities and helpers used by all other loaders

swh.loader-bzr

loader for Bazaar and Breezy repositories

swh.loader-git

loader for Git repositories

swh.loader-mercurial

loader for Mercurial repositories

swh.loader-metadata

pseudo-loader, which fetches extrinsic metadata from forges instead of software artifacts

swh.loader-svn

loader for Subversion repositories

swh.loader-cvs

loader for CVS repositories

swh.model

implementation of the Data model to archive source code artifacts

swh.objstorage

content-addressable object storage

swh.objstorage.replayer

Object storage replication tool

swh.perfecthash

Low level management for read-only content-addressable object storage indexed with a perfect hash table

swh.scanner

source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage

swh.scheduler

task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package)

swh.scrubber

Tooling to check integrity of various data stores (swh.journal, swh.objstorage, swh.storage) and fix corrupt objects they contain.

swh.search

search engine for the archive

swh.storage

abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata

swh.vault

implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)

swh.web

Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use

swh.web.client

Python client for swh.web

Dependencies

The dependency relationships among the various modules are depicted below.

_images/py-deps-swh.svg

Dependencies among top-level Python modules (click to zoom).

Archive

Indices and tables