Winery backend#

The Winery backend is an swh-objstorage backend that implements the Ceph based object storage architecture.

Design specifications#

The idea of the Winery backend is to provide an object storage capable of storing with decent performances the workload of the Software Heritage objstorage, that is (at time of writing, aka 08/2025):

  • about 2PB of total storage capacity

  • 25B unique objects, for witch: - 75% are smaller than 16KB - 50% are smaller than 4KB

  • capable of accepting many concurrent write operations

The desired design specs for winery are:

  • Capable of handling 10PB storage capacity with commodity hardware

  • Capable of storgin 100 billion (mostly small) objects

  • At least 3,000 object/s and 100MB/s of write capacity

  • At least 3,000 object/s and 100MB/s of read capacity

  • Immune to space amplification

  • Getting the first byte of any object never takes longer than 100ms.

  • Objects can be enumerated in bulk, at least one million at a time.

  • Mirroring the content of the Software Heritage archive can be done in bulk, at least one million objects at a time.

Architecture#

In a nutshell, objects are written to a number of dedicated tables in a database used by a fixed number of services/machines (the Write Storage) that can vary to control the write throughput. When a threshold is reached (e.g. 100GB) on a table, all objects in this table are put together in container (a Shard), and moved to a readonly storage that keeps expanding over time.

After a successful write, a unique identifier (the Object ID, the sha256 sum of the content of the object by default) is stored in a global index table (signature2shard) with the unique identifier of the shard it has been added to. Reads can scale out because this table is write-only and can easily be scaled out if necessary. Writes also scales out because the database table in which the object is written is chosen randomly, and the number of such tables can be adapted to the desired write throughput – as long as the signature2shard table handles the write workload.

Writer storage#

This is the part of the winery objstorage backend responsible for writing new objects. Each objstorage writer process will lock a rw-shard, picking an existing standby shard if one is available, or creating a new shard table in the database. If the writer storage is deployed with gunicorn configured with N worker processes, there will be at least N shard tables in the database (each locked by one of the worker processes), allowing N concurrent write requests.

Writer processes will release their lock on the rw-shard if they have not processed any writes after a given timeout (setting the shard to the standby state), which allows active/active deployment of writer processes without leaving lingering writing shards.

When a write request is received, it is routed to one of the writer workers by gunicorn, thus towards one of the open rw shards. The data of the object is inserted in the dedicated shard table, and an entry mapping the id of the object to the id of the shard is added to the signature2shard index table.

When a rw-shard is considered full (that is, when the cumulated volume of objects stored in the rw-shard goes over the shards:max_size limit – typically 100GiB), the shard is marked as full and does not accept new objects.

When a shard is marked full, the packing process dumps all the object data from the rw-shard database table into a Shard File Format for the Software Heritage Object Storage file stored on the shard backend storage – typically a Ceph cluster, either using RBD volumes directly, of saving shard files onto a shared filesystem. The shard entry in the shards table is then marked as packed, noting that the dedicated table can then be destroyed, which in turn allows the shard to be marked as readonly.

Reader storage#

This is the part of the winery objstorage backend responsible for reading objects.

There are 2 possible cases: the required object can be available in a ro-shard (so in a shard-file stored in the shard file storage), or in a rw-shard (thus in one of the open shard tables in the database) if the object has been written recently and not made its way to a ro-shard file.

When an object is requested, the signature2shard index is queried to retrieve the identifier of the shard in which this object is. This id is used to retrieve the name and state of this shard. Depending on the state of the shard, the object content will then be retrieved either from the rw-shard table (if the shard is not yet marked as readonly), or retrieved from the ro-shard file otherwise.

Shards pool backends#

Shard files use a custom but simple binary file format to pack together a number of objects. It uses a cmph based index system to allow constant-time access to particular objects from their id. The Shard File Format for the Software Heritage Object Storage library is used to create, read and manipulate these files.

In order to support the creation, storage and replication of 10k+ shard files, a clustered and safe storage solution must be used as backend.

Winery currently support 2 types of pool to store read-only shard files:

  • Ceph RBD (rbd): this is the original design; it directly uses Ceph block devices (RBD) to pack all content objects in, using the Shard File Format for the Software Heritage Object Storage file format. When a RBD volume has been created, or at starting time of a winery frontend node, RBD volumes are mapped on the winery frontend nodes to be usable for object accesses.

  • Regular files (directory): in this backend, regular files are created in a directory (the base_directory configuration entry under the shards_pool section). In a production-like deployment, this directory will typically be made available on all winery front-end nodes via a shared storage solution like NFS or CephFS.

Distributed mode#

Winery is usually deployed as a few separate components that synchronize each other using the shared database (aka in a distributed mode):

  • read-only instances provide access, in read-only mode, to both read-only shards, and shards that are currently being written to

  • writer instances each hold one of the write tables locked, and write objects to them

  • the shard packer swh objstorage winery packer handles the packing process asynchronously (outside of the WineryWriter process):

    • when a shard becomes full, the shard is marked as locked in the database (by the packer process), and is moved to the packing state

    • the shard file is created (when create_images is set) or waited for (if the management is delegated to the shard manager (aka swh objstorage winery rbd) [1]

    • when the shard file is available, the shard gets packed

    • once the packing is done, the shard is moved to the packed state

    • if clean_immediately is set, the write shard is immediately removed and the shard moved to the readonly state [2]

  • when using the RBD shard pool backend, the RBD shard manager swh objstorage winery rbd handles the management of RBD images:

    • all known readonly shards are mapped immediately (i.e. the RBD block device is mapped as a block device on the host),

    • (if manage_rw_images is set) when a standby or writing shard appears, a new RBD image is provisioned in the Ceph cluster, and mapped read-write

    • when a shard packing completes (and a shard status becomes one of packed, cleaning or readonly), the image is mapped (or remapped) read-only.

    • every time a shard is mapped read-only on a given host, that fact is recorded in a database column

  • the RW shard cleaner swh objstorage winery rw-shard-cleaner performs clean up of the packed read-write shards, as soon as they are recorded as mapped on enough (--min-mapped-hosts) hosts. They get locked in the cleaning state, the database cleanup is performed, then the shard gets moved in the final readonly state.

Configuration#

Winery uses a structured configuration schema.

Here is a typical configuration for a RBD shards pool backend:

objstorage:
  cls: winery

  # boolean (false (default): allow writes, true: only allow reads)
  readonly: false

  # Shards-related settings
  shards:
    # integer: threshold in bytes above which shards get packed. Can be
    # overflowed by the max allowed object size.
    max_size: 100_000_000_000

    # float: timeout in seconds after which idle read-write shards get
    # released by the winery writer process
    rw_idle_timeout: 300

  # Shared database settings
  database:
    # string: PostgreSQL connection string for the object index and read-write
    # shards
    db: winery

    # string: PostgreSQL application name for connections (unset by default)
    application_name: null

  # Shards pool settings
  shards_pool:
    ## Settings for the RBD shards pool
    type: rbd

    # Ceph pool name for RBD metadata (default: shards)
    pool_name: shards

    # Ceph pool name for RBD data (default: constructed as
    # `{pool_name}-data`). This is the pool where erasure-coding should be set,
    # if required.
    data_pool_name: null

    # Use sudo to perform image management (default: true. Can be set to false
    # if packer.create_images is false and the rbd image manager is deployed
    # as root)
    use_sudo: true

    # Options passed to `rbd image map` (default: empty string)
    map_options: ""

    # Image features unsupported by the RBD kernel module. E.g.
    # exclusive-lock, object-map and fast-diff, for Linux kernels older than 5.3
    image_features_unsupported: []

  # Packer-related settings
  packer:
    # Whether the winery writer should start packing shards immediately, or
    # defer to the standalone packer (default: true, the writer launches a
    # background packer process)
    pack_immediately: false

    # Whether the packer should create shards in the shard pool, or defer to
    # the pool manager (default: true, the packer creates images)
    create_images: false

    # Whether the packer should clean read-write shards from the database
    # immediately, or defer to the rw shard cleaner (default: true, the packer
    # cleans read-write shards immediately)
    clean_immediately: false

  # Optional throttler configuration, leave unset to disable throttling
  throttler:
    # string: PostgreSQL connection string for the throttler database. Can be
    # shared with (and defaults to) the main database set in the `database`
    # section. Must be read-write even for readonly instances.
    db: winery

    # integer: max read bytes per second
    max_read_bps: 100_000_000

    # integer: max write bytes per second
    max_write_bps: 100_000_000

Here is typical configuration for a directory shards pool backend:

objstorage:
  cls: winery

  # boolean (false (default): allow writes, true: only allow reads)
  readonly: false

  # Shards-related settings
  shards:
    # integer: threshold in bytes above which shards get packed. Can be
    # overflowed by the max allowed object size.
    max_size: 100_000_000_000

    # float: timeout in seconds after which idle read-write shards get
    # released by the winery writer process
    rw_idle_timeout: 300

  # Shared database settings
  database:
    # string: PostgreSQL connection string for the object index and read-write
    # shards
    db: winery

    # string: PostgreSQL application name for connections (unset by default)
    application_name: null

  # Shards pool settings
  shards_pool:
    ## Settings for the directory shards pool
    # Shards are stored in `{base_directory}/{pool_name}`
    type: directory
    base_directory: /srv/winery/pool
    pool_name: shards

  # Packer-related settings
  packer:
    # Whether the winery writer should start packing shards immediately, or
    # defer to the standalone packer (default: true, the writer launches a
    # background packer process)
    pack_immediately: false

    # Whether the packer should create shards in the shard pool, or defer to
    # the pool manager (default: true, the packer creates images)
    create_images: true

    # Whether the packer should clean read-write shards from the database
    # immediately, or defer to the rw shard cleaner (default: true, the packer
    # cleans read-write shards immediately)
    clean_immediately: true

  # Optional throttler configuration, leave unset to disable throttling
  throttler:
    # string: PostgreSQL connection string for the throttler database. Can be
    # shared with (and defaults to) the main database set in the `database`
    # section. Must be read-write even for readonly instances.
    db: winery

    # integer: max read bytes per second
    max_read_bps: 100_000_000

    # integer: max write bytes per second
    max_write_bps: 100_000_000

IO Throttling#

Ceph (Pacific) implements IO QoS in librbd but it is only effective within a single process, not cluster wide. The preliminary benchmarks showed that accumulated read and write throughput must be throttled client side to prevent performance degradation (slower throughput and increased latency).

Table are created in a PostgreSQL database dedicated to throttling, so independent processes performing I/O against the Ceph cluster can synchronize with each other and control their accumulated throughput for reads and writes. Workers creates a row in the read and write tables and update them every minute with their current read and write throughput, in bytes per second. They also query all rows to figure out the current accumulated bandwidth.

If the current accumulated bandwidth is above the maximum desired speed for N active workers, the process will reduce its throughput to use a maximum of 1/N of the maximum desired speed. For instance, if the current accumulated usage is above 100MB/s and there are 10 workers, the process will reduce its own speed to 10MB/s. After the 10 workers independently do the same, each of them will share 1/10 of the bandwidth.

Implementation notes#

swh.objstorage.backends.winery.sharedbase contains the global

objstorage index implementation, which associates every object id (currently, the SHA256 of the content) to the shard it contains. The list of shards is stored in a table, associating them with a numeric id to save space, and their current swh.objstorage.backends.winery.sharedbase.ShardState. The name of the shard is used to create a table (for write shards) or a RBD image (for read shards).

swh.objstorage.backends.winery.roshard handles read-only shard

management: classes handling the lifecycle of the shards pool, the swh.objstorage.backends.winery.roshard.ROShardCreator, as well as swh.objstorage.backends.winery.roshard.ROShard, a thin layer on top of swh.shard used to access the objects stored inside a read-only shard.

swh.objstorage.backends.winery.rwshard handles the database-backed

write shards for all their lifecycle.

swh.objstorage.backends.winery.objstorage.WineryObjStorage is the

main entry point compatible with the swh.objstorage interface. It is a thin layer backed by a swh.objstorage.backends.winery.objstorage.WineryWriter for writes, and a swh.objstorage.backends.winery.objstorage.WineryReader for read-only accesses.

swh.objstorage.backends.winery.objstorage.WineryReader performs

read-only actions on both read-only shards and write shards. It will first determine the kind of shard the object belongs to by looking it up in the global index. If it is a read-only Shard, it will lookup the object using swh.objstorage.backends.winery.roshard.ROShard, backed by the RBD or directory-based shards pool. If it is a write shard, it will lookup the object using the swh.objstorage.backends.winery.rwshard.RWShard, ultimately using a PostgreSQL table.

All swh.objstorage.backends.winery.objstorage.WineryWriter operations are idempotent so they can be resumed in case they fail. When a swh.objstorage.backends.winery.objstorage.WineryWriter is instantiated, it will either:

  • Find a write shard (i.e. a table) that is not locked by another instance by looking up the list of shards or,

  • Create a new write shard by creating a new table

and it will lock the write Shard and own it so no other instance tries to write to it. Locking is done transactionally by setting a locker id in the shards index, when the swh.objstorage.backends.winery.objstorage.WineryWriter process dies unexpectedly, these entries need to be manually cleaned up.

Writing a new object writes its identifier in the index table, and its contents in the shard table, within the same transaction.

When the cumulative size of all objects within a Write Shard exceeds a threshold, it is set to be in the full state. All objects it contains can be read from it by any swh.objstorage.backends.winery.objstorage.WineryReader but no new object will be added to it. When pack_immediately is set, a process is spawned and is tasked to transform the full shard into a Read Shard using the swh.objstorage.backends.winery.objstorage.Packer class. Should the packing process fail for any reason, a cron job will restart it when it finds Write Shards that are both in the packing state and not locked by any process. Packing is done by enumerating all the records from the Write Shard database and writing them into a Read Shard by the same name. Incomplete Read Shards will never be used by swh.objstorage.backends.winery.objstorage.WineryReader because the global index will direct it to use the Write Shard instead. Once the packing completes, the state of the shard is modified to be packed, and from that point on the swh.objstorage.backends.winery.objstorage.WineryReader will only use the Read Shard to find the objects it contains. If clean_immediately is set, the table containing the Write Shard is then destroyed because it is no longer useful and the process terminates on success.

Footnotes