Software Architecture Overview

From an end-user point of view, the Software Heritage platform consists in the archive, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the Software Heritage archive as internal RPC APIs.

These internal APIs have a dedicated database, usually PostgreSQL.

A global (and incomplete) view of this architecture looks like:

Core components

The following components are the foundation of the entire Software Heritage architecture, as they fetch data, store it, and make it available to every other service.

Data storage

The Storage provides an API to store and retrieve elements of the graph, such as directory structure, revision history, and their respective metadata. It relies on the Object Storage service to store the content of source code file themselves.

Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of “cloud” object storage as backends, as well as a simple local filesystem.

Task management

The Scheduler manages the entire choreography of jobs/tasks in Software Heritage, from detecting and ingesting repositories, to extracting metadata from them, to repackaging repositories into small downloadable archives.

It does this by managing its own database of tasks that need to run (either periodically or only once), and passing them to celery for execution on dedicated workers.

Listers

Listers are type of task, run by the Scheduler, aiming at scraping a web site, a forge, etc. to gather all the source code repositories it can find, also known as origins. For each found source code repository, a loader task is created.

The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab forge, but any other supported source type would be very similar.

As one might observe in this diagram, it does two things:

  • it asks the forge (a gitlab instance in this case) the list of known repositories, and

  • it insert one loader task for each source code repository that will be in charge of importing the content of that repository.

Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories.

Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task.

Loaders

Loaders are also a type of task, but aim at importing or updating a source code repository. It is the one that inserts blob objects in the object storage, and inserts nodes and edges in the graph.

The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar.

Journal

The last core component is the Journal, which is a persistent logger of every change in the archive, with publish-subscribe support, using Kafka.

The Storage writes to it every time a new object is added to the archive; and many components read from it to be notified of these changes. For example, it allows the Scheduler to know how often software repositories are updated by their developers, to decide when next to visit these repositories.

It is also the foundation of the Mirroring infrastructure, as it allows mirrors to stay up to date.

Other major components

All the components we saw above are critical to the Software Heritage archive as they are in charge of archiving source code. But are not enough to provide another important features of Software Heritage: making this archive accessible and searchable by anyone.

Archive website and API

First of all, the archive website and API, also known as swh-web, is the main entry point of the archive.

This is the component that serves https://archive.softwareheritage.org/, which is the window into the entire archive, as it provides access to it through a web browser or the HTTP API.

It does so by querying most of the internal APIs of Software Heritage: the Data Storage (to display source code repositories and their content), the Scheduler (to allow manual scheduling of loader tasks through the Save Code Now feature), and many of the other services we will see below.

Internal data mining

Indexers are a type of task aiming at crawling the content of the archive to extract derived information.

It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta.

All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage.

Vault

The Vault is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories.

Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler.

Extra services

Finally, Software Heritage provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits.

It is therefore possible to have a fully-functioning archive without any of these services (our development Docker environment disables most of these by default).

Graph

swh-graph is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms.

Counters

The archive’s landing page features counts of the total number of files/directories/revisions/… in the archive. Perhaps surprisingly, counting unique objects at Software Heritage’s scale is hard, and a performance bottleneck when implemented purely in the Storage’s SQL database.

swh-counters provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using RedisHyperLogLog feature; and keeps the history of these counters over time using Prometheus.

Deposit

The Deposit is an alternative way to add content to the archive. While listers and loaders, as we saw above, discover repositories and pull artifacts into the archive, the Deposit allows trusted partners to push the content of their repository directly to the archive, and is internally loaded by the Deposit Loader

The Deposit is centered on the SWORDv2 protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML.

The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server.

Authentication

While the archive itself is public, Software Heritage reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit.

This is managed centrally by swh-auth using KeyCloak.

Web Client, Fuse, Scanner

SWH provides a few tools to access the archive via the API:

Replayers and backfillers

As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, …), and because some databases need to follow the content of the Journal (mirrors), some places of the Software Heritage codebase contains tools known as “replayers” and “backfillers”, designed to keep them in sync:

  • the Object Storage Replayer copies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date

  • the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database.

  • the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it.