Software Architecture

From an end-user point of view, the Software Heritage platform consists in the archive, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components that expose different aspects of the Software Heritage archive as internal REST APIs.

Each of these internal APIs have a dedicated (Postgresql) database.

A global (and incomplete) view of this architecture looks like:

The front API components are:

On the back stage of this show, a celery based game of tasks and workers occurs to perform all the required work to fill, maintain and update the Software Heritage archive.

The main components involved in this choreography are:

  • Listers: a lister is a type of task aiming at scraping a web site, a forge, etc. to gather all the source code repositories it can find. For each found source code repository, a loader task is created.

  • Loaders: a loader is a type of task aiming at importing or updating a source code repository. It is the one that inserts blob objects in the object storage, and inserts nodes and edges in the graph.

  • Indexers: an indexer is a type of task aiming at crawling the content of the archive to extract derived information (mimetype, etc.)

  • Vault: this type of celery task is responsible for cooking a compressed archive (zip or tgz) of an archived object (typically a directory or a repository). Since this can be a rather long process, it is delegated to an asynchronous (celery) task.



The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab forge, but any other supported source type would be very similar.

As one might observe in this diagram, it does two things:

  • it asks the forge (a gitlab instance in this case) the list of known repositories, and

  • it insert one loader task for each source code repository that will be in charge of importing the content of that repository.

Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories.

Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task.


The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar.