Hosting a mirror#

This section present and discuss the technical requirements needed to host a Software Heritage mirror.

There are many different options to host a mirror, but there are common overall requirements that needs to be fulfilled.

Namely, hosting a mirror requires:

  • a dedicated infrastructure with enough compute (s/computing) power and storage

  • enough network bandwidth (both ingress and egress)

  • good IT tooling (supervision, alerting)

  • a legal and operational structure to handle takedown requests

The mirror operator is not required to run the Software Heritage full software stack, however it is possible to use it.

Warning

Volumes given in this section are estimations and numbers from January 2022.

The global raw hardware requirements are:

  • a database system for the main storage of the archive (the graph structure); the current volume is about 17TB, with an increase rate of about 280GB/month,

  • an object storage system for the objects (archived software source code files); the current volume is about 800TB with an increase rate of about 21TB/month,

  • an elasticsearch engine; the current main index is about 180M entries (origins) for an index size of 360GB; the increase rate is about 2M entries/month,

  • a web/application server for the main web application and public API,

  • a few compute nodes for the application services.

A mirror should provision machines or cloud-based resources with these numbers in mind. This should include the usual robustness margins (RAID-like storage, replication, backup etc.).

General hardware requirements#

When deploying a mirror based on the Software Heritage software stack, one will need:

Core services#

  • a database for the storage; this can be either a Postgresql database (single machine) or a Cassandra cluster (at least 3 nodes),

  • an object storage system; this can be any supported backend – a public cloud-based obstorage (e.g. s3), any private supported object storage, an ad-hoc filesystem storage system, etc.

  • an elasticsearch instance,

  • a few nodes for backend applications (swh-storage, swh-objstorage)

  • the web frontend (swh-web) serving the main web app and the public API)

Replaying services#

Vault service#

Sizing a mirror infrastructure#

Note

solutions with a star (*) in the tables below are still under test or validation.

Common components#

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

storage

swh-storage

16

16GB

regular

10GB

search

elasticsearch

3

32GB

fast / zfs

6TB

web

swh-web

1

32GB

regular

100GB

graph replayer

swh-storage

32

4GB

regular

10GB

content replayer

swh-obstorage-replayer

32

4GB

regular

10GB

replayer

redis

1

8GB

regular

100GB

vault

swh-vault

1

4GB

regular

10GB

vault worker

swh-vault

1

16GB

fast

1TB

vault

rabbitmq

1

8GB

regular

10GB

Storage backend#

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

storage

postgresql

1

512GB

fast+zfs (lz4)

40TB

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

storage

cassandra

3

32GB

fast

30TB

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

storage

cassandra

6+

32GB

fast

20TB

Objstorage backend#

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

objstorage

swh-objstorage

1 [1]

512GB

zfs (with lz4)

1PB

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

objstorage

swh-objstorage

2 [2]

32GB

standard

100GB

winery-db

postgresql

2 [2]

512GB

fast

10TB

ceph-mon

ceph

3

4GB

fast

60GB

ceph-osd

ceph

12+

4GB

mix fast+HDD

1PB (total)

SWH Service

Tool

Instances

RAM

Storage Type

Storage Volume

objstorage

swh-objstorage

3

32GB

standard

100GB

seaweed LB

nginx

1

32GB

fast

100GB

seaweed-master

seaweedfs

3

8GB

standard

10GB

seaweed-filer

seaweedfs

3

32GB

fast

1TB

seaweed-volume

seaweedfs

3+

32GB

standard

1PB (total)

Notes