.. _mirror_seaweedfs: Using seaweedfs as object storage backend ========================================= When deploying a mirror, one of the key technological choice you have to make is the backend used for the Software Heritage object storage. The :ref:`swh-objstorage` module support a variety of backends, one of which is seaweedfs_. seaweedfs_ is a SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake. It's able to handle billions of files. It comes with many valuable features that are useful a :ref:`swh-objstorage` backend: - Distributed architecture; allows to easily grow the objstorage cluster when needed. - Pack small files in big (30GB by default) volumes, implementing the concepts behind the Haystack storage system developed by Facebook, preventing space amplication problems. - Support for `Erasure Coding`_, allowing to have data replication at a fraction of the cost in terms of storage space. - Supports asynchronous cross-DC replication. For seaweedfs to be used as an objstorage backend, it needs: - one seaweedfs master (ideally at least 3 for HA), - one seaweedfs filer using a fast key-value store as backend (e.g. leveldb or redis) - a number of seaweedfs volume servers with enough capacity to store the whole Software Heritage Archive object storage (including erasure coding replication factor, if used). The key to achieve good replication performances is to have a good communication backbone behind the seawwwedfs cluster. At least a properly configured 10G network is mandatory. For example, a seaweedfs cluster consisting of 4 bare metal machines hosting all the seaweedfs components (3x master, 1x filer, 4x volumes), the content replication process can achieve a steady throughput of 2000 objects/s at about 200MB/s (giving a total replication ETA of about 3 months as of march 2023). .. _`Erasure Coding`: https://en.wikipedia.org/wiki/Erasure_code Master ------ The master configuration is pretty straightforward. Typically, you will want to ensure the following options of the ``weed master`` command are set: - ``volumePreallocate``: Preallocate disk space for volumes - ``metrics.address``: Prometheus gateway address You will also want to have a fair number of writable volumes set up (for good concurrent write performances); this can be done in the :file:`master.toml` configuration file: .. code-block:: ini [master.volume_growth] copy_1 = 32 This ensure there are always at least 32 volumes used for writing. See `the SeaweedFS optimization page `_ for more details. Having multiple masters is a good idea. The number of masters must be an even number. Replication and erasure coding ++++++++++++++++++++++++++++++ Seaweedfs support both replication (keeping multiple copies of each content object) and erasure coding. The erasure coding implementedin seaweedfs is 10.4 Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of 1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB data volume will be encoded into 14 EC shards, each shard is of size 3 GB and has 3 EC blocks. This allows loss of 4 shards of data with 1.4x data size. Compared to replicating data 5 times to achieve the same robustness, it saves 3.6x disk space. Erasure Coding is only activated for warm storage, i.e. storage volumes that are sealed (almost full, not part of the active volumes in which new writes are done, and have not been modified for a while). As such, performing the EC processing is a asynchronous background process initiated by the master. It is typically set up using this piece of :file:`master.toml` configuration: .. code-block:: ini [master.maintenance] # periodically run these scripts are the same as running them from 'weed shell' scripts = """ ec.encode -fullPercent=95 -quietFor=1h ec.rebuild -force ec.balance -force """ sleep_minutes = 17 # sleep minutes between each script execution For replication, you can choose between different replication strategies: number of copies, placement constraints for copies (server, rack, datacenter); see `the documentation `_ for more details. Filer ----- The filer is the component used to make the link between the way objects are identified in software heritage (using the hash of the object as identifier) and the location of the objects (as well as a few other metadata entries) in a seaweedfs volume. This is a rather simple key/value store. Seaweedfs filer actually uses an existing k/v store as backend. By default, it will use a local leveldb store, putting all the objects in a flat namespace. This works ok up to a few billions of objects, but it might be a good idea to organize the filer a bit. When using leveldb as backend k/v store, the volume needed to index the whole |swh| archive objstorage is about 2TB (as of March 2023). Multi-filer deployment ++++++++++++++++++++++ seaweedfs filer does support for multiple filers. When using a shared/distributed k/v store as backend (redis, postgresql, cassandra, HBase, etc.), the filer itself is stateless so it's easy to deploy several filer instances. But Seaweedfs also support parallel filers with embedded filer store (e.g. leveldb). But in this case, the asynchronous and eventually consistent nature of the replication process between the different filer instances makes it not suitable for the |swh| objstorage mirroring workload. It's however possible to use several filer instances (for HA or load balancing) when using a shared or distributed k/v store as backend (typically redis). Note that using a multiple filers with embedded filer store (leveldb) remains possible as a HA or backupping solution; as long as all the content replayers do target the same filer instance for writing objects, it should be fine. Doing so, the filer is (eventually) replicated and the behavior of the replication process remains consistent. Note that a single leveldb filer can support up to 2000 object insertions / sec (with fast SSD storage). Using Redis as filer store ++++++++++++++++++++++++++ For a |swh| mirror deployment, it is probably a good idea to go with a solution based on a shared filer store (redis should be the best choice to ensure good performances during the replication process). However there is a small catch: a redis filer (``redis2``) store won't be able to handle having all the |swh| content entries in a single (flat) directory. Seaweedfs provides an alternate redis backend implementation, ``redis3``, that can circumvent this issue. This comes at the cost of slightly slower insertions. Another approach is to configure the |swh| ``seaweedfs`` objstorage to use a slicer: this will organize the objects in a directory structure built from the object hash. For example, using this |swh| objstorage configuration: .. code-block:: yaml objstorage_dst: cls: seaweedfs compression: none url: http://localhost:8088/swh/ slicing: "0:2/2:4" will organize the placement of objects in a tree structure with 2 levels of directory. For instance a file with SHA1 `34973274ccef6ab4dfaaf86599792fa9c3fe4689` will be located at `http://localhost:8088/swh/34/97/34973274ccef6ab4dfaaf86599792fa9c3fe4689`. Note that using the slicer also comes with a slight performance hit. See `this page `_ for more details on the ``redis3`` filer backend, and `this one `_ for a more general discussion about handling super large directories in seaweedfs filer. Filer backup ++++++++++++ The filer db is a key component of the seaweedfs backend; it's content can be rebuilt from the volumes, but the time it take to do so will probably be in the range of 1 to 2 months, so better not to loose it! One can use the integrated filer backup tool to perform continuous backups of the filer metadata using the command: ``weed filer.meta.backup``. This command needs a :file:`backup_filer.toml` file specifying a filer store that will be used as backup. For example, backupping an existing filer in a local leveldb the :file:`backup_filer.toml` file would be like: .. code-block:: ini [leveldb2] enabled = true dir = "/tmp/filerrdb" # directory to store leveldb files then: .. code-block:: bash weed filer.meta.backup -config ./backup_filer.toml -filer localhost:8581 I0315 11:18:44 95975 leveldb2_store.go:42] filer store leveldb2 dir: /tmp/filerrdb I0315 11:18:44 95975 file_util.go:23] Folder /tmp/filerrdb Permission: -rwxr-xr-x I0315 11:18:45 95975 filer_meta_backup.go:112] configured filer store to leveldb2 I0315 11:18:45 95975 filer_meta_backup.go:80] traversing metadata tree... I0315 11:18:45 95975 filer_meta_backup.go:86] metadata copied up to 2023-03-15 11:18:45.559363829 +0000 UTC m=+0.915180010 I0315 11:18:45 95975 filer_meta_backup.go:154] streaming from 2023-03-15 11:18:45.559363829 +0000 UTC It can be interrupted and restarted. Using the ``-restart`` argument force a full backup (rather than resuming incremental backup). This is however very similar to having multiple filers configured in the seawwedf cluster. As a backup component, it can however be a good idea not to have this backup filer store not part of the cluster at all. Volume ------ seaweedfs volume servers are the backbone of a seaweedfs cluster; where objects are stored. Each volume server will manage a number of volumes, each os which is by default about 30GB. Content objects are packed in small (compressed) chunks in one or more volumes (so one content object can be spread in several volumes). When using several disks on one volume server, you may use hardware or software RAID or similar redundancy techniques, but you may also prefer to run one volume server per disk. When using safety features like erasure coding or volume replication, seaweedfs has support for ensuring safe distribution among servers (depending on the number if servers and the number of disks, make sure losing one (or more) server does not lead to data loss.) By default, each volume server will use memory to store indexes for volume content (keep track of the position of each chunk in local volumes). This can lead to a few problems when the number of volumes increases: - extended startup time of the volume server (it needs to load all volume file indexes at startup), and - consumed memory can grow pretty large. In the context of a |swh| mirror, it's a good idea to have `volume servers using a leveldb index `_. This can be done using ``weed volume -index=leveldb`` command line argument (note there are 3 possible index arguments: ``leveldb``, ``leveldbMedium`` and ``leveldbLarge``; see the seaweedfs documentation for more details). .. _seaweedfs: https://github.com/seaweedfs/seaweedfs/