Mirroring

Description

A mirror is a full copy of the Software Heritage archive, operated independently from the Software Heritage initiative. A minimal mirror consists of two parts:

  • the graph storage (typically an instance of swh.storage), which contains the Merkle DAG structure of the archive, except the actual content of source code files (AKA blobs),

  • the object storage (typically an instance of swh.objstorage), which contains all the blobs corresponding to archived source code files.

However, a usable mirror needs also to be accessible by others. As such, a proper mirror should also allow to:

  • navigate the archive copy using a Web browser and/or the Web API (typically using the the web application),

  • retrieve data from the copy of the archive (typically using the the vault service)

A mirror is initially populated and maintained up-to-date by consuming data from the Software Heritage Kafka-based journal and retrieving the blob objects (file content) from the Software Heritage object storage.

Note

It is not required that a mirror is deployed using the Software Heritage software stack. Other technologies, including different storage methods, can be used. But we will focus in this documentation to the case of mirror deployment using the Software Heritage software stack.

Mirroring the Graph Storage

The replication of the graph is based on a journal using Kafka as event streaming platform.

On the Software Heritage side, every addition made to the archive consist of the addition of a Data model object. The new object is also serialized as a msgpack bytestring which is used as the value of a message added to a Kafka topic dedicated to the object type.

The main Kafka topics for the Software Heritage Data model are:

  • swh.journal.objects.content

  • swh.journal.objects.directory

  • swh.journal.objects.extid

  • swh.journal.objects.metadata_authority

  • swh.journal.objects.metadata_fetcher

  • swh.journal.objects.origin_visit_status

  • swh.journal.objects.origin_visit

  • swh.journal.objects.origin

  • swh.journal.objects.raw_extrinsic_metadata

  • swh.journal.objects.release

  • swh.journal.objects.revision

  • swh.journal.objects.skipped_content

  • swh.journal.objects.snapshot

In order to set up a mirror of the graph, one needs to deploy a stack capable of retrieving all these topics and store their content reliably. For example a Kafka cluster configured as a replica of the main Kafka broker hosted by Software Heritage would do the job (albeit not in a very useful manner by itself).

A more useful mirror can be set up using the storage component with the help of the special service named replayer provided by the swh.storage.replay module.

Mirroring the Object Storage

File content (blobs) are not directly stored in messages of the swh.journal.objects.content Kafka topic, which only contains metadata about them, such as various kinds of cryptographic hashes. A separate component is in charge of replicating blob objects from the archive and stored them in the local object storage instance.

A separate swh-journal client should subscribe to the swh.journal.objects.content topic to get the stream of blob objects identifiers, then retrieve corresponding blobs from the main Software Heritage object storage, and store them in the local object storage.

A reference implementation for this component is available in content replayer.

Installation

When using the Software Heritage software stack to deploy a mirror, a number of Software Heritage software components must be installed (cf. architecture diagram above):

A docker-swarm based deployment solution is provided as a working example of the mirror stack:

It is strongly recommended to start from there before planning a production-like deployment.

See the README file of the swh-docker repository for details.