Winery backend#
The Winery backend is an swh-objstorage backend that implements the Ceph based object storage architecture.
Design specifications#
The idea of the Winery backend is to provide an object storage capable of storing with decent performances the workload of the Software Heritage objstorage, that is (at time of writing, aka 08/2025):
about 2PB of total storage capacity
25B unique objects, for witch: - 75% are smaller than 16KB - 50% are smaller than 4KB
capable of accepting many concurrent write operations
The desired design specs for winery are:
Capable of handling 10PB storage capacity with commodity hardware
Capable of storgin 100 billion (mostly small) objects
At least 3,000 object/s and 100MB/s of write capacity
At least 3,000 object/s and 100MB/s of read capacity
Immune to space amplification
Getting the first byte of any object never takes longer than 100ms.
Objects can be enumerated in bulk, at least one million at a time.
Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.
Architecture#
In a nutshell, objects are written to a number of dedicated tables in a database used by a fixed number of services/machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) on a table, all objects in this table are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time.
After a successful write, a unique identifier (the Object ID, the sha256 sum of
the content of the object by default) is stored in a global index table
(signature2shard) with the unique identifier of the shard it has been added
to. Reads can scale out because this table is write-only and can easily be
scaled out if necessary. Writes also scales out because the database table in
which the object is written is chosen randomly, and the number of such tables
can be adapted to the desired write throughput – as long as the
signature2shard table handles the write workload.
Writer storage#
This is the part of the winery objstorage backend responsible for writing new
objects. Each objstorage writer process will lock a rw-shard, picking an existing
standby shard if one is available, or creating a new shard table in the database.
If the writer storage is deployed with gunicorn configured with N worker
processes, there will be at least N shard tables in the database (each locked
by one of the worker processes), allowing N concurrent write requests.
Writer processes will release their lock on the rw-shard if they have not
processed any writes after a given timeout (setting the shard to the standby
state), which allows active/active deployment of writer processes without
leaving lingering writing shards.
When a write request is received, it is routed to one of the writer workers by
gunicorn, thus towards one of the open rw shards. The data of the object is
inserted in the dedicated shard table, and an entry mapping the id of the
object to the id of the shard is added to the signature2shard index table.
When a rw-shard is considered full (that is, when the cumulated volume of objects stored
in the rw-shard goes over the shards:max_size limit – typically 100GiB), the
shard is marked as full and does not accept new objects.
When a shard is marked full, the packing process dumps all the object data from the rw-shard database
table into a Shard File Format for the Software Heritage Object Storage file stored on the shard backend storage – typically a Ceph
cluster, either using RBD volumes directly, of saving shard files onto a shared filesystem.
The shard entry in the shards table is then marked as packed, noting that the dedicated
table can then be destroyed, which in turn allows the shard to be marked as readonly.
Reader storage#
This is the part of the winery objstorage backend responsible for reading objects.
There are 2 possible cases: the required object can be available in a ro-shard (so in a shard-file stored in the shard file storage), or in a rw-shard (thus in one of the open shard tables in the database) if the object has been written recently and not made its way to a ro-shard file.
When an object is requested, the signature2shard index is queried to retrieve the
identifier of the shard in which this object is. This id is used to retrieve
the name and state of this shard. Depending on the state of the shard, the
object content will then be retrieved either from the rw-shard table (if the shard
is not yet marked as readonly), or retrieved from the ro-shard file otherwise.
Distributed mode#
Winery is usually deployed as a few separate components that synchronize each
other using the shared database (aka in a distributed mode):
read-only instances provide access, in read-only mode, to both read-only shards, and shards that are currently being written to
writer instances each hold one of the write tables locked, and write objects to them
the shard packer
swh objstorage winery packerhandles the packing process asynchronously (outside of theWineryWriterprocess):when a shard becomes
full, the shard is marked as locked in the database (by the packer process), and is moved to thepackingstatethe shard file is created (when
create_imagesis set) or waited for (if the management is delegated to the shard manager (akaswh objstorage winery rbd) [1]when the shard file is available, the shard gets packed
once the packing is done, the shard is moved to the
packedstateif
clean_immediatelyis set, the write shard is immediately removed and the shard moved to thereadonlystate [2]
when using the RBD shard pool backend, the RBD shard manager
swh objstorage winery rbdhandles the management of RBD images:all known
readonlyshards are mapped immediately (i.e. the RBD block device is mapped as a block device on the host),(if
manage_rw_imagesis set) when astandbyorwritingshard appears, a new RBD image is provisioned in the Ceph cluster, and mapped read-writewhen a shard packing completes (and a shard status becomes one of
packed,cleaningorreadonly), the image is mapped (or remapped) read-only.every time a shard is mapped read-only on a given host, that fact is recorded in a database column
the RW shard cleaner
swh objstorage winery rw-shard-cleanerperforms clean up of thepackedread-write shards, as soon as they are recorded as mapped on enough (--min-mapped-hosts) hosts. They get locked in thecleaningstate, the database cleanup is performed, then the shard gets moved in the finalreadonlystate.
Configuration#
Winery uses a structured configuration schema.
Here is a typical configuration for a RBD shards pool backend:
objstorage:
cls: winery
# boolean (false (default): allow writes, true: only allow reads)
readonly: false
# Shards-related settings
shards:
# integer: threshold in bytes above which shards get packed. Can be
# overflowed by the max allowed object size.
max_size: 100_000_000_000
# float: timeout in seconds after which idle read-write shards get
# released by the winery writer process
rw_idle_timeout: 300
# Shared database settings
database:
# string: PostgreSQL connection string for the object index and read-write
# shards
db: winery
# string: PostgreSQL application name for connections (unset by default)
application_name: null
# Shards pool settings
shards_pool:
## Settings for the RBD shards pool
type: rbd
# Ceph pool name for RBD metadata (default: shards)
pool_name: shards
# Ceph pool name for RBD data (default: constructed as
# `{pool_name}-data`). This is the pool where erasure-coding should be set,
# if required.
data_pool_name: null
# Use sudo to perform image management (default: true. Can be set to false
# if packer.create_images is false and the rbd image manager is deployed
# as root)
use_sudo: true
# Options passed to `rbd image map` (default: empty string)
map_options: ""
# Image features unsupported by the RBD kernel module. E.g.
# exclusive-lock, object-map and fast-diff, for Linux kernels older than 5.3
image_features_unsupported: []
# Packer-related settings
packer:
# Whether the winery writer should start packing shards immediately, or
# defer to the standalone packer (default: true, the writer launches a
# background packer process)
pack_immediately: false
# Whether the packer should create shards in the shard pool, or defer to
# the pool manager (default: true, the packer creates images)
create_images: false
# Whether the packer should clean read-write shards from the database
# immediately, or defer to the rw shard cleaner (default: true, the packer
# cleans read-write shards immediately)
clean_immediately: false
# Optional throttler configuration, leave unset to disable throttling
throttler:
# string: PostgreSQL connection string for the throttler database. Can be
# shared with (and defaults to) the main database set in the `database`
# section. Must be read-write even for readonly instances.
db: winery
# integer: max read bytes per second
max_read_bps: 100_000_000
# integer: max write bytes per second
max_write_bps: 100_000_000
Here is typical configuration for a directory shards pool backend:
objstorage:
cls: winery
# boolean (false (default): allow writes, true: only allow reads)
readonly: false
# Shards-related settings
shards:
# integer: threshold in bytes above which shards get packed. Can be
# overflowed by the max allowed object size.
max_size: 100_000_000_000
# float: timeout in seconds after which idle read-write shards get
# released by the winery writer process
rw_idle_timeout: 300
# Shared database settings
database:
# string: PostgreSQL connection string for the object index and read-write
# shards
db: winery
# string: PostgreSQL application name for connections (unset by default)
application_name: null
# Shards pool settings
shards_pool:
## Settings for the directory shards pool
# Shards are stored in `{base_directory}/{pool_name}`
type: directory
base_directory: /srv/winery/pool
pool_name: shards
# Packer-related settings
packer:
# Whether the winery writer should start packing shards immediately, or
# defer to the standalone packer (default: true, the writer launches a
# background packer process)
pack_immediately: false
# Whether the packer should create shards in the shard pool, or defer to
# the pool manager (default: true, the packer creates images)
create_images: true
# Whether the packer should clean read-write shards from the database
# immediately, or defer to the rw shard cleaner (default: true, the packer
# cleans read-write shards immediately)
clean_immediately: true
# Optional throttler configuration, leave unset to disable throttling
throttler:
# string: PostgreSQL connection string for the throttler database. Can be
# shared with (and defaults to) the main database set in the `database`
# section. Must be read-write even for readonly instances.
db: winery
# integer: max read bytes per second
max_read_bps: 100_000_000
# integer: max write bytes per second
max_write_bps: 100_000_000
IO Throttling#
Ceph (Pacific) implements IO QoS in librbd but it is only effective within a single process, not cluster wide. The preliminary benchmarks showed that accumulated read and write throughput must be throttled client side to prevent performance degradation (slower throughput and increased latency).
Table are created in a PostgreSQL database dedicated to throttling, so independent processes performing I/O against the Ceph cluster can synchronize with each other and control their accumulated throughput for reads and writes. Workers creates a row in the read and write tables and update them every minute with their current read and write throughput, in bytes per second. They also query all rows to figure out the current accumulated bandwidth.
If the current accumulated bandwidth is above the maximum desired speed for N active workers, the process will reduce its throughput to use a maximum of 1/N of the maximum desired speed. For instance, if the current accumulated usage is above 100MB/s and there are 10 workers, the process will reduce its own speed to 10MB/s. After the 10 workers independently do the same, each of them will share 1/10 of the bandwidth.
Implementation notes#
swh.objstorage.backends.winery.sharedbasecontains the globalobjstorage index implementation, which associates every object id (currently, the SHA256 of the content) to the shard it contains. The list of shards is stored in a table, associating them with a numeric id to save space, and their current
swh.objstorage.backends.winery.sharedbase.ShardState. The name of the shard is used to create a table (for write shards) or a RBD image (for read shards).swh.objstorage.backends.winery.roshardhandles read-only shardmanagement: classes handling the lifecycle of the shards pool, the
swh.objstorage.backends.winery.roshard.ROShardCreator, as well asswh.objstorage.backends.winery.roshard.ROShard, a thin layer on top ofswh.shardused to access the objects stored inside a read-only shard.swh.objstorage.backends.winery.rwshardhandles the database-backedwrite shards for all their lifecycle.
swh.objstorage.backends.winery.objstorage.WineryObjStorageis themain entry point compatible with the
swh.objstorageinterface. It is a thin layer backed by aswh.objstorage.backends.winery.objstorage.WineryWriterfor writes, and aswh.objstorage.backends.winery.objstorage.WineryReaderfor read-only accesses.swh.objstorage.backends.winery.objstorage.WineryReaderperformsread-only actions on both read-only shards and write shards. It will first determine the kind of shard the object belongs to by looking it up in the global index. If it is a read-only Shard, it will lookup the object using
swh.objstorage.backends.winery.roshard.ROShard, backed by the RBD or directory-based shards pool. If it is a write shard, it will lookup the object using theswh.objstorage.backends.winery.rwshard.RWShard, ultimately using a PostgreSQL table.
All swh.objstorage.backends.winery.objstorage.WineryWriter
operations are idempotent so they can be resumed in case they fail. When a
swh.objstorage.backends.winery.objstorage.WineryWriter is
instantiated, it will either:
Find a write shard (i.e. a table) that is not locked by another instance by looking up the list of shards or,
Create a new write shard by creating a new table
and it will lock the write Shard and own it so no other instance tries to write
to it. Locking is done transactionally by setting a locker id in the shards
index, when the
swh.objstorage.backends.winery.objstorage.WineryWriter process dies
unexpectedly, these entries need to be manually cleaned up.
Writing a new object writes its identifier in the index table, and its contents in the shard table, within the same transaction.
When the cumulative size of all objects within a Write Shard exceeds a
threshold, it is set to be in the full state. All objects it contains can be
read from it by any
swh.objstorage.backends.winery.objstorage.WineryReader but no new
object will be added to it. When pack_immediately is set, a process is
spawned and is tasked to transform the full shard into a Read Shard using the
swh.objstorage.backends.winery.objstorage.Packer class. Should the
packing process fail for any reason, a cron job will restart it when it finds
Write Shards that are both in the packing state and not locked by any
process. Packing is done by enumerating all the records from the Write Shard
database and writing them into a Read Shard by the same name. Incomplete Read
Shards will never be used by
swh.objstorage.backends.winery.objstorage.WineryReader because the
global index will direct it to use the Write Shard instead. Once the packing
completes, the state of the shard is modified to be packed, and from that
point on the swh.objstorage.backends.winery.objstorage.WineryReader
will only use the Read Shard to find the objects it contains. If
clean_immediately is set, the table containing the Write Shard is then
destroyed because it is no longer useful and the process terminates on success.
Footnotes