Software Architecture

From an end-user point of view, the Software Heritage platform consists in the archive, which can be accessed using the web interface or its REST API. Behind the scene (and the web app) are several components that expose different aspects of the Software Heritage archive as internal REST APIs.

Each of these internal APIs have a dedicated (Postgresql) database.

A global view of this architecture looks like:

The front API components are:

On the back stage of this show, a celery based game of tasks and workers occurs to perform all the required work to fill, maintain and update the Software Heritage archive.

The main components involved in this choreography are:

  • Listers: a lister is a type of task aiming at scrapping a web site, a forge, etc. to gather all the source code repositories it can find. For each found source code repository, a loader task is created.
  • Loaders: a loader is a type of task aiming at importing or updating a source code repository. It is the one that inserts blob objects in the object storage, and inserts nodes and edges in the graph.
  • Indexers: an indexer is a type of task aiming at crawling the content of the archive to extract derived information (mimetype, etc.)

Tasks

The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab forge, but any other supported source type would be very similar.

As one might observe in this diagram, it does create two things:

  • it adds one origin objects in the storage database for each source code repository, and
  • it insert one loader task for each source code repository that will be in charge of importing the content of that repository.

The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar.