Software Architecture Overview#

From an end-user point of view, the Software Heritage platform consists in the archive, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components/services that expose different aspects of the Software Heritage archive as internal RPC APIs.

These internal APIs have a dedicated database, usually PostgreSQL.

A global (and incomplete) view of this architecture looks like:

Core components#

The following components are the foundation of the entire Software Heritage architecture, as they fetch data, store it, and make it available to every other service.

Data storage#

The Storage provides an API to store and retrieve elements of the graph, such as directory structure, revision history, and their respective metadata. It relies on the Object Storage service to store the content of source code file themselves.

Both the Storage and Object Storage are designed as abstractions over possible backends. The former supports both PostgreSQL (the current solution in production) and Cassandra (a more scalable option we are exploring). The latter supports a large variety of “cloud” object storage as backends, as well as a simple local filesystem.

Journal#

The Journal, which is a persistent logger of every change in the archive, with publish-subscribe support, using Kafka.

The Storage publishes a kafka message in the journal each time a new object is added to the archive; and many components consumes them to be notified of these changes. For example, it allows the Scheduler to know when an origin has been visited and what was the resulting status of that visit, which helps to decide when to visit again these repositories.

It is also the foundation of the Description infrastructure, as it allows mirrors to stay up to date.

Source code scraping#

The infrastructure aiming at finding new source code origins (git, mercurial and other type of VCS, source packages, etc.) and regularly visiting them is build around a few components based on a task scheduling scaffolding and using a Celery-based asynchronous task execution framework. The scheduler itself consists in 2 parts: a generic asynchronous task management system and a specific management database aiming at gathering and keeping up to date liveness information of listed origins that can be used to choose which of them should be visited in priority.

To summarize, the parts involved in this carousel are:

Listers:

tasks aiming at scraping a web site like a forge, etc. to gather all the source code repositories it can find, also known as origins. Lister tasks are triggered by the scheduler, via Celery, and will fill the listed origins table of the listing and visit statistics database (see below).

Loaders:

tasks dedicated to importing source code from a source code repository (an origin). It is the component that will insert blob objects in the object storage, and insert nodes and edges in the graph.

Scheduler’s generic task management:

manages the choreography of listing tasks in Software Heritage, as well as a few other utility tasks (save code now, deposit, vault, indexers). Note that this component will not handle the scheduling of loading tasks any more. It consists in a database and API allowing to define task types and to create tasks to be scheduled (recurring or one shot), as well as a tool (the scheduler-runner) dedicated to spawn these tasks via the Celery asynchronous execution framework, as well as another tool (the scheduler-listener) dedicated to keeping the scheduler database in sync with executed tasks (task execution status, execution timestamps, etc.).

Scheduler’s listing and visit statistics:

database and API allowing to store information about liveness of a listed origin as well as statistics about the loading of said origin. The visit statistics are updated from the main storage kafka journal.

Scheduler’s origin visit scheduling:

tool that will use the statistics about listed origins and previous visits stored in the database to apply scheduling policies to select the next pool origins to visit. This does not use the generic task management system, but instead directly spawn loading Celery tasks.

The Scheduler#

The Scheduler manages the generic choreography of jobs/tasks in Software Heritage, namely listing origins of software source code, loading them, extracting metadata from loaded origins and repackaging repositories into small downloadable archives for the Vault.

It consists in a database where all the scheduling information is stored, an API allowing unified access to this database, and a set of services and tools to orchestrate the actual scheduling of tasks. Their execution being delegated to a Celery-based set of asynchronous workers.

While initially a single generic scheduling utility for all asynchronous task types, the scheduling of origin visits has now been extracted in a new, dedicated part of the Scheduler. These loading tasks used to be managed by this generic task scheduler as recurrent tasks, but the number of these loading tasks baceame a problem to handle then efficiently, as well as some of their specificities could not be accounted for to help better and more efficient scheduling of origin visits.

There are now 2 parts in the scheduler: the original SWH Task management system, and the new Origin Visit scheduling utility.

Both have a similar architecture at first sight: a database, an API, a celery based execution system. The main difference of the new visit-centric system it is dedicated to origin visits, and thus can use specific information and metadata on origins to optimise the scheduling policy; statstics about known origins resulting from the listing of a forge can be used as entry point for the scheduling of origin visits according to scheduling policies that can take several metrics into considerations, like:

  • have the origin already been visited,

  • if not, how “old” is the origin (what is the timestamp of its first sign of activity, e.g. creation date, timestamp of the first revision, etc.),

  • how long since the origin has last been visited,

  • how active is the origin (and thus how often it should be visited),

  • etc.

For each new source code repository, a listed origin entry is added in the scheduler database, as well as the timestamp of last known activity for this origin as reported by the forge. For already known origins, only this last activity timestamp is updated, if need be.

It is then the responsibility of the schedule-recurrent scheduler service to check listed origins, as well as visit statistics (see below), in order to regularly select the next origins to visit. This service also uses live data from Celery to choose an appropriate number of visits to schedule (keeping the Celery queues filled at a constant and controlled level).

The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab forge, but any other supported source type would be very similar.

As one might observe in this diagram, it does two things:

  • it asks the forge (a gitlab instance in this case) the list of known repositories as well as some metadata (especially last update timestamp), and

  • it inserts one listed origin for each new source code repository found or update the last update timestamp for the origin.

The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar.

Other major components#

All the components we saw above are critical to the Software Heritage archive as they are in charge of archiving source code. But are not enough to provide another important features of Software Heritage: making this archive accessible and searchable by anyone.

Archive website and API#

First of all, the archive website and API, also known as swh-web, is the main entry point of the archive.

This is the component that serves https://archive.softwareheritage.org/, which is the window into the entire archive, as it provides access to it through a web browser or the HTTP API.

It does so by querying most of the internal APIs of Software Heritage: the Data Storage (to display source code repositories and their content), the Scheduler (to allow manual scheduling of loader tasks through the Save Code Now feature), and many of the other services we will see below.

Internal data mining#

Indexers are a type of task aiming at crawling the content of the archive to extract derived information.

It ranges from detecting the MIME type or license of individual files, to reading all types of metadata files at the root of repositories and storing them together in a unified format, CodeMeta.

All results computed by Indexers are stored in a PostgreSQL database, the Indexer Storage.

Vault#

The Vault is an internal API, in charge of cooking compressed archive (zip or tgz) of archived objects on request (via swh-web). These compressed objects are typically directories or repositories.

Since this can be a rather long process, it is delegated to an asynchronous (celery) task, through the Scheduler.

Extra services#

Finally, Software Heritage provides additional tools that, although not necessary to operate the archive, provide convenient interfaces or performance benefits.

It is therefore possible to have a fully-functioning archive without any of these services (our development Docker environment disables most of these by default).

Graph#

swh-graph is also a recent addition to the architecture designed to complement the Storage using a specialized backend. It leverages WebGraph to store a compressed in-memory representation of the entire graph, and provides fast implementations of graph traversal algorithms.

Counters#

The archive’s landing page features counts of the total number of files/directories/revisions/… in the archive. Perhaps surprisingly, counting unique objects at Software Heritage’s scale is hard, and a performance bottleneck when implemented purely in the Storage’s SQL database.

swh-counters provides an alternative design to solve this issue, by reading new objects from the Journal and counting them using RedisHyperLogLog feature; and keeps the history of these counters over time using Prometheus.

Deposit#

The Deposit is an alternative way to add content to the archive. While listers and loaders, as we saw above, discover repositories and pull artifacts into the archive, the Deposit allows trusted partners to push the content of their repository directly to the archive, and is internally loaded by the Deposit Loader

The Deposit is centered on the SWORDv2 protocol, which allows depositing archives (usually TAR or ZIP) along with metadata in XML.

The Deposit has its own HTTP interface, independent of swh-web. It also has its own SWORD client, which is specialized to interact with the Deposit server.

Authentication#

While the archive itself is public, Software Heritage reserves some features to authenticated clients, such as higher rate limits, access to experimental APIs (currently: the Graph service), or the Deposit.

This is managed centrally by swh-auth using KeyCloak.

Web Client, Fuse, Scanner#

SWH provides a few tools to access the archive via the API:

Replayers and backfillers#

As the Journal and various databases may be out of sync for various reasons (scrub of either of them, migration, database addition, …), and because some databases need to follow the content of the Journal (mirrors), some places of the Software Heritage codebase contains tools known as “replayers” and “backfillers”, designed to keep them in sync:

  • the Object Storage Replayer copies the content of an objects storage to another one. It first performs a full copy, then streams new objects using the Journal to stay up to date

  • the Storage Replayer loads the entire content of the Journal into a Storage database, and also keeps them in sync. This is used for mirrors, and when creating a new database.

  • the Storage Backfiller, which does the opposite. This was initially used to populate the Journal from the database; and is occasionally when one needs to clear a topic in the Journal and recreate it.