Software Architecture Overview
From an end-user point of view, the Software Heritage platform consists in the archive, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the Software Heritage archive as internal RPC APIs.
These internal APIs have a dedicated database, usually PostgreSQL.
A global (and incomplete) view of this architecture looks like:
The following components are the foundation of the entire Software Heritage architecture, as they fetch data, store it, and make it available to every other service.
The Storage provides an API to store and retrieve elements of the graph, such as directory structure, revision history, and their respective metadata. It relies on the Object Storage service to store the content of source code file themselves.
Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of “cloud” object storage as backends, as well as a simple local filesystem.
The Scheduler manages the entire choreography of jobs/tasks in Software Heritage, from detecting and ingesting repositories, to extracting metadata from them, to repackaging repositories into small downloadable archives.
It does this by managing its own database of tasks that need to run (either periodically or only once), and passing them to celery for execution on dedicated workers.
Listers are type of task, run by the Scheduler, aiming at scraping a web site, a forge, etc. to gather all the source code repositories it can find, also known as origins. For each found source code repository, a loader task is created.
The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab forge, but any other supported source type would be very similar.
As one might observe in this diagram, it does two things:
it asks the forge (a gitlab instance in this case) the list of known repositories, and
it insert one loader task for each source code repository that will be in charge of importing the content of that repository.
Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories.
Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task.
The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar.
The Storage writes to it every time a new object is added to the archive; and many components read from it to be notified of these changes. For example, it allows the Scheduler to know how often software repositories are updated by their developers, to decide when next to visit these repositories.
It is also the foundation of the Mirroring infrastructure, as it allows mirrors to stay up to date.
Other major components
All the components we saw above are critical to the Software Heritage archive as they are in charge of archiving source code. But are not enough to provide another important features of Software Heritage: making this archive accessible and searchable by anyone.
Archive website and API
First of all, the archive website and API, also known as swh-web, is the main entry point of the archive.
This is the component that serves https://archive.softwareheritage.org/, which is the window into the entire archive, as it provides access to it through a web browser or the HTTP API.
It does so by querying most of the internal APIs of Software Heritage: the Data Storage (to display source code repositories and their content), the Scheduler (to allow manual scheduling of loader tasks through the Save Code Now feature), and many of the other services we will see below.
Internal data mining
It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta.
All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage.
The Vault is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories.
Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler.
Finally, Software Heritage provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits.
It is therefore possible to have a fully-functioning archive without any of these services (our development Docker environment disables most of these by default).
The swh-search service complements both the Storage and the Indexer Storage, to provide efficient advanced reverse-index search queries, such as full-text search on origin URLs and metadata.
This service is a recent addition to the Software Heritage architecture based on ElasticSearch, and is currently in use only for URL search.
swh-graph is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms.
The archive’s landing page features counts of the total number of files/directories/revisions/… in the archive. Perhaps surprisingly, counting unique objects at Software Heritage’s scale is hard, and a performance bottleneck when implemented purely in the Storage’s SQL database.
swh-counters provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using Redis’ HyperLogLog feature; and keeps the history of these counters over time using Prometheus.
The Deposit is an alternative way to add content to the archive.
While listers and loaders, as we saw above, discover repositories
and pull artifacts into the archive, the Deposit allows trusted partners to
push the content of their repository directly to the archive,
and is internally loaded by the
The Deposit is centered on the SWORDv2 protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML.
The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server.
While the archive itself is public, Software Heritage reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit.
This is managed centrally by swh-auth using KeyCloak.
Web Client, Fuse, Scanner
SWH provides a few tools to access the archive via the API:
Software Heritage - Web client, a command-line interface to authenticate with SWH and a library to access the API from Python programs
Software Heritage Filesystem (SwhFS), a Filesystem in USErspace implementation, that exposes the entire archive as a regular directory on your computer
Software Heritage - Code Scanner, a work-in-progress to check which of the files in a project are already in the archive, without submitting them
Replayers and backfillers
As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, …), and because some databases need to follow the content of the Journal (mirrors), some places of the Software Heritage codebase contains tools known as “replayers” and “backfillers”, designed to keep them in sync:
Object Storage Replayercopies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date
the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database.
the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it.