Mirror Operations#

Description#

A mirror is a full copy of the Software Heritage archive, operated independently from the Software Heritage initiative. A minimal mirror consists of two parts:

  • the graph storage (typically an instance of swh.storage), which contains the Merkle DAG structure of the archive, except the actual content of source code files (AKA blobs),

  • the object storage (typically an instance of swh.objstorage), which contains all the blobs corresponding to archived source code files.

However, a usable mirror needs also to be accessible by others. As such, a proper mirror should also allow to:

  • navigate the archive copy using a Web browser and/or the Web API (typically using the the web application),

  • have minimal search capabilities (typically using the swh-search service with an Elasticsearch backend),

  • retrieve data from the copy of the archive (typically using the the vault service)

A mirror is initially populated and maintained up-to-date by consuming data from the Software Heritage Kafka-based journal and retrieving the blob objects (file content) from the Software Heritage object storage.

Note

It is not required that a mirror be deployed using the Software Heritage software stack. Other technologies, including different storage methods, can be used. But we will focus in this documentation to the case of mirror deployment using the Software Heritage software stack.

Note

This general view is very simplified and does not show all the services involved in hosting and operating a mirror.

See the Hosting a mirror for a complete description of the requirements to host a mirror.

Note

Hosting a complete mirror is a complex task, involving the deployment of dozens of inter related (micro-)services. It should be planned and operated carefully, using state-of-art ops practices (cloud-based, or using container orchestration tools on an elastic execution platform like kubernetes, docker swarm, or using tools like Ansible or Salt Stack).

Important

It is strongly recommended to start with a simple docker swarm based deployment (this can be done on a single machine) as described in How to deploy a mirror.

Mirroring the Graph Storage#

The replication of the graph is based on a journal using Kafka as event streaming platform.

On the Software Heritage side, every addition made to the archive consist of the addition of a Data model object. The new object is also serialized as a msgpack bytestring which is used as the value of a message added to a Kafka topic dedicated to the object type.

The main Kafka topics for the Software Heritage Data model are:

  • swh.journal.objects.content

  • swh.journal.objects.directory

  • swh.journal.objects.extid

  • swh.journal.objects.metadata_authority

  • swh.journal.objects.metadata_fetcher

  • swh.journal.objects.origin_visit_status

  • swh.journal.objects.origin_visit

  • swh.journal.objects.origin

  • swh.journal.objects.raw_extrinsic_metadata

  • swh.journal.objects.release

  • swh.journal.objects.revision

  • swh.journal.objects.skipped_content

  • swh.journal.objects.snapshot

In order to set up a mirror of the graph, one needs to deploy a stack capable of retrieving all these topics and store their content reliably. For example a Kafka cluster configured as a replica of the main Kafka broker hosted by Software Heritage would do the job (albeit not in a very useful manner by itself).

A more useful mirror can be set up using the storage component with the help of the special service named replayer provided by the swh.storage.replay module.

Mirroring the Object Storage#

File content (blobs) are not directly stored in messages of the swh.journal.objects.content Kafka topic, which only contains metadata about them, such as various kinds of cryptographic hashes. A separate component is in charge of replicating blob objects from the archive and stored them in the local object storage instance.

A separate swh-journal client should subscribe to the swh.journal.objects.content topic to get the stream of blob objects identifiers, then retrieve corresponding blobs from the main Software Heritage object storage, and store them in the local object storage.

A reference implementation for this component is available in content replayer.

Installation#

When using the Software Heritage software stack to deploy a mirror, a number of Software Heritage software components must be installed (cf. architecture diagram above).

Note

It is not recommended to try to deploy each Software Heritage service individually. You should rather start from the example docker-based deployment project described here.

A docker swarm based deployment solution is provided as a working example of the mirror stack, see How to deploy a mirror.

It is strongly recommended to start from there before planning a production-like deployment.

You may want to read: