Hosting a mirror#
This section present and discuss the technical requirements needed to host a Software Heritage mirror.
There are many different options to host a mirror, but there are common overall requirements that needs to be fulfilled.
Namely, hosting a mirror requires:
a dedicated infrastructure with enough compute (s/computing) power and storage
enough network bandwidth (both ingress and egress)
good IT tooling (supervision, alerting)
a legal and operational structure to handle takedown requests
The mirror operator is not required to run the Software Heritage full software stack, however it is possible to use it.
Warning
Volumes given in this section are estimations and numbers from May 2024.
The global raw hardware requirements are:
a database system for the main storage of the archive (the graph structure); the current volume of the Postgresql database is about 42TB, with an increase rate of about 16To/year, the cassandra database it about 30TB with an increase of 7TB/year (multiply this per 3 to use a standard x3 replication factor)
an object storage system for the objects (archived software source code files); the current volume is about 3PB with an increase rate of about 750TB/year,
an elasticsearch engine; the current main index is about 390M entries (origins) for an index size of 380GB; the increase rate is about 2M entries/month,
a web/application server for the main web application and public API,
a few compute nodes for the application services.
Using a zfs-like filesystem supporting compression for the objstorage and the Postgresql database can reduce the on disk volume by a factor of 1.5 or 2.5 respectively.
A mirror should provision machines or cloud-based resources with these numbers in mind. This should include the usual robustness margins (RAID-like storage, replication, backup etc.).
General hardware requirements#
When deploying a mirror based on the Software Heritage software stack, one will need:
Core services#
a database for the storage; this can be either a Postgresql database (single machine) or a Cassandra cluster (at least 3 nodes),
an object storage system; this can be any
supported backend
– a public cloud-based obstorage (e.g. s3), any private supported object storage, an ad-hoc filesystem storage system, etc.an Elasticsearch instance,
a few nodes for backend applications (
swh-storage
,swh-objstorage
)the web frontend (
swh-web
) serving the main web app and the public API)
Replaying services#
graph replayers as mirroring workers (increase parallelism to increase speed)
content replayers as mirroring workers (id.)
Vault service#
Sizing a mirror infrastructure#
Note
solutions with a star (*) in the tables below are still under test or validation.
Common components#
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
storage |
swh-storage |
16 |
16GB |
regular |
10GB |
search |
elasticsearch |
3 |
24GB |
fast / zfs |
1TB |
web |
swh-web |
2 |
8GB |
regular |
100GB |
graph replayer |
swh-storage |
32 |
32GB |
regular |
10GB |
content replayer |
swh-obstorage-replayer |
32 |
64GB |
regular |
10GB |
replayer |
redis |
1 |
8GB |
regular |
100GB |
vault |
swh-vault |
1 |
4GB |
regular |
10GB |
vault worker |
swh-vault |
1 |
16GB |
fast |
1TB |
vault |
rabbitmq |
1 |
8GB |
regular |
10GB |
Storage backend#
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
storage |
postgresql |
1 |
512GB |
fast+zfs (lz4) |
20TB |
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
storage |
cassandra |
3 |
32GB |
fast |
30TB |
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
storage |
cassandra |
6+ |
32GB |
fast |
90TB |
Objstorage backend#
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
objstorage |
swh-objstorage |
1 [1] |
512GB |
zfs (with lz4) |
2PB |
SWH Service |
Tool |
Instances |
RAM |
Storage Type |
Storage Volume |
---|---|---|---|---|---|
objstorage |
swh-objstorage |
3 |
32GB |
standard |
100GB |
seaweed LB |
nginx |
1 |
32GB |
fast |
100GB |
seaweed-master |
seaweedfs |
3 |
8GB |
standard |
10GB |
seaweed-filer |
seaweedfs |
3 |
32GB |
fast |
1TB |
seaweed-volume |
seaweedfs |
3+ |
32GB |
standard |
1PB (total) |
Notes
Example of hardware used by Software Heritage#
These configurations are only documented as example. Feel free to adapt the architecture to suit your own use of the mirror or use any other architecture type (cloud/vms/…).
Database#
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
Postgresql |
2 |
>= 32 |
768GB |
30TB Write Intensive |
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
Cassandra |
12 |
>= 16 |
256GB |
12TB fast + 600Go Write intensive |
Objstorage#
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
FS |
1 |
>= 16 |
384GB |
1.5PB (attached disk arrays) |
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
api/pg |
2 |
>= 32 |
768GB |
10TB fast |
Ceph mon |
3 |
>= 16 |
192GB |
500GB |
Ceph osd |
26 |
>= 16 |
192GB |
144TB SAS + 360GB fast |
Compute nodes#
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
Kubernetes node |
3 |
>= 32 |
256GB |
2TB fast |
Elasticsearch#
Type |
Instance(s) |
Cores |
Memory |
Disk |
---|---|---|---|---|
Elasticsearch |
3 |
>= 8 |
64GB |
6TB fast |