Using seaweedfs as object storage backend#

When deploying a mirror, one of the key technological choice you have to make is the backend used for the Software Heritage object storage.

The Software Heritage - Object storage module support a variety of backends, one of which is seaweedfs.

seaweedfs is a SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake. It’s able to handle billions of files. It comes with many valuable features that are useful a Software Heritage - Object storage backend:

  • Distributed architecture; allows to easily grow the objstorage cluster when needed.

  • Pack small files in big (30GB by default) volumes, implementing the concepts behind the Haystack storage system developed by Facebook, preventing space amplication problems.

  • Support for Erasure Coding, allowing to have data replication at a fraction of the cost in terms of storage space.

  • Supports asynchronous cross-DC replication.

For seaweedfs to be used as an objstorage backend, it needs:

  • one seaweedfs master (ideally at least 3 for HA),

  • one seaweedfs filer using a fast key-value store as backend (e.g. leveldb or redis)

  • a number of seaweedfs volume servers with enough capacity to store the whole Software Heritage Archive object storage (including erasure coding replication factor, if used).

The key to achieve good replication performances is to have a good communication backbone behind the seawwwedfs cluster. At least a properly configured 10G network is mandatory.

For example, a seaweedfs cluster consisting of 4 bare metal machines hosting all the seaweedfs components (3x master, 1x filer, 4x volumes), the content replication process can achieve a steady throughput of 2000 objects/s at about 200MB/s (giving a total replication ETA of about 3 months as of march 2023).

Master#

The master configuration is pretty straightforward. Typically, you will want to ensure the following options of the weed master command are set:

  • volumePreallocate: Preallocate disk space for volumes

  • metrics.address: Prometheus gateway address

You will also want to have a fair number of writable volumes set up (for good concurrent write performances); this can be done in the master.toml configuration file:

[master.volume_growth]
copy_1 = 32

This ensure there are always at least 32 volumes used for writing.

See the SeaweedFS optimization page for more details.

Having multiple masters is a good idea. The number of masters must be an even number.

Replication and erasure coding#

Seaweedfs support both replication (keeping multiple copies of each content object) and erasure coding. The erasure coding implementedin seaweedfs is 10.4 Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of 1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB data volume will be encoded into 14 EC shards, each shard is of size 3 GB and has 3 EC blocks. This allows loss of 4 shards of data with 1.4x data size. Compared to replicating data 5 times to achieve the same robustness, it saves 3.6x disk space.

Erasure Coding is only activated for warm storage, i.e. storage volumes that are sealed (almost full, not part of the active volumes in which new writes are done, and have not been modified for a while).

As such, performing the EC processing is a asynchronous background process initiated by the master. It is typically set up using this piece of master.toml configuration:

[master.maintenance]
# periodically run these scripts are the same as running them from 'weed shell'
scripts = """
  ec.encode -fullPercent=95 -quietFor=1h
  ec.rebuild -force
  ec.balance -force
"""
sleep_minutes = 17          # sleep minutes between each script execution

For replication, you can choose between different replication strategies: number of copies, placement constraints for copies (server, rack, datacenter); see the documentation for more details.

Filer#

The filer is the component used to make the link between the way objects are identified in software heritage (using the hash of the object as identifier) and the location of the objects (as well as a few other metadata entries) in a seaweedfs volume.

This is a rather simple key/value store. Seaweedfs filer actually uses an existing k/v store as backend. By default, it will use a local leveldb store, putting all the objects in a flat namespace. This works ok up to a few billions of objects, but it might be a good idea to organize the filer a bit.

When using leveldb as backend k/v store, the volume needed to index the whole Software Heritage archive objstorage is about 2TB (as of March 2023).

Multi-filer deployment#

seaweedfs filer does support for multiple filers. When using a shared/distributed k/v store as backend (redis, postgresql, cassandra, HBase, etc.), the filer itself is stateless so it’s easy to deploy several filer instances. But Seaweedfs also support parallel filers with embedded filer store (e.g. leveldb). But in this case, the asynchronous and eventually consistent nature of the replication process between the different filer instances makes it not suitable for the Software Heritage objstorage mirroring workload.

It’s however possible to use several filer instances (for HA or load balancing) when using a shared or distributed k/v store as backend (typically redis).

Note that using a multiple filers with embedded filer store (leveldb) remains possible as a HA or backupping solution; as long as all the content replayers do target the same filer instance for writing objects, it should be fine. Doing so, the filer is (eventually) replicated and the behavior of the replication process remains consistent.

Note that a single leveldb filer can support up to 2000 object insertions / sec (with fast SSD storage).

Using Redis as filer store#

For a Software Heritage mirror deployment, it is probably a good idea to go with a solution based on a shared filer store (redis should be the best choice to ensure good performances during the replication process).

However there is a small catch: a redis filer (redis2) store won’t be able to handle having all the Software Heritage content entries in a single (flat) directory.

Seaweedfs provides an alternate redis backend implementation, redis3, that can circumvent this issue. This comes at the cost of slightly slower insertions.

Another approach is to configure the Software Heritage seaweedfs objstorage to use a slicer: this will organize the objects in a directory structure built from the object hash. For example, using this Software Heritage objstorage configuration:

objstorage_dst:
cls: seaweedfs
compression: none
url: http://localhost:8088/swh/
slicing: "0:2/2:4"

will organize the placement of objects in a tree structure with 2 levels of directory. For instance a file with SHA1 34973274ccef6ab4dfaaf86599792fa9c3fe4689 will be located at http://localhost:8088/swh/34/97/34973274ccef6ab4dfaaf86599792fa9c3fe4689.

Note that using the slicer also comes with a slight performance hit.

See this page for more details on the redis3 filer backend, and this one for a more general discussion about handling super large directories in seaweedfs filer.

Filer backup#

The filer db is a key component of the seaweedfs backend; it’s content can be rebuilt from the volumes, but the time it take to do so will probably be in the range of 1 to 2 months, so better not to loose it!

One can use the integrated filer backup tool to perform continuous backups of the filer metadata using the command: weed filer.meta.backup. This command needs a backup_filer.toml file specifying a filer store that will be used as backup. For example, backupping an existing filer in a local leveldb the backup_filer.toml file would be like:

[leveldb2]
enabled = true
dir = "/tmp/filerrdb"  # directory to store leveldb files

then:

weed filer.meta.backup -config ./backup_filer.toml -filer localhost:8581
I0315 11:18:44 95975 leveldb2_store.go:42] filer store leveldb2 dir: /tmp/filerrdb
I0315 11:18:44 95975 file_util.go:23] Folder /tmp/filerrdb Permission: -rwxr-xr-x
I0315 11:18:45 95975 filer_meta_backup.go:112] configured filer store to leveldb2
I0315 11:18:45 95975 filer_meta_backup.go:80] traversing metadata tree...
I0315 11:18:45 95975 filer_meta_backup.go:86] metadata copied up to 2023-03-15 11:18:45.559363829 +0000 UTC m=+0.915180010
I0315 11:18:45 95975 filer_meta_backup.go:154] streaming from 2023-03-15 11:18:45.559363829 +0000 UTC

It can be interrupted and restarted. Using the -restart argument force a full backup (rather than resuming incremental backup).

This is however very similar to having multiple filers configured in the seawwedf cluster. As a backup component, it can however be a good idea not to have this backup filer store not part of the cluster at all.

Volume#

seaweedfs volume servers are the backbone of a seaweedfs cluster; where objects are stored. Each volume server will manage a number of volumes, each os which is by default about 30GB.

Content objects are packed in small (compressed) chunks in one or more volumes (so one content object can be spread in several volumes).

When using several disks on one volume server, you may use hardware or software RAID or similar redundancy techniques, but you may also prefer to run one volume server per disk. When using safety features like erasure coding or volume replication, seaweedfs has support for ensuring safe distribution among servers (depending on the number if servers and the number of disks, make sure losing one (or more) server does not lead to data loss.)

By default, each volume server will use memory to store indexes for volume content (keep track of the position of each chunk in local volumes). This can lead to a few problems when the number of volumes increases:

  • extended startup time of the volume server (it needs to load all volume file indexes at startup), and

  • consumed memory can grow pretty large.

In the context of a Software Heritage mirror, it’s a good idea to have volume servers using a leveldb index. This can be done using weed volume -index=leveldb command line argument (note there are 3 possible index arguments: leveldb, leveldbMedium and leveldbLarge; see the seaweedfs documentation for more details).