.. _swh-objstorage-replayer:

.. include:: README.rst


This Python module provides a command line tool to replicate content objects from a
source Object storage to a destination one by listening the ``content`` topic of a
:ref:`swh-journal` kafka stream.

It is meant to be used as the brick of a mirror setup dedicated to replicating content
objects.


Quick start
-----------

Once installed (using pip or debian packages), the command ``swh objstorage
replay`` should be available:

It needs a configuration file with 4 sections:

- ``objstorage``: the source objstorage to retrieve objects from,

- ``objstorage_dst``: the destination objstorage to put objects into,

- ``journal_client``: the journal client (kafka configuration where the object
  hashes are consumed from),

- ``replayer`` (optional): some replayer specific configurations options.


For example with a configuration file like:

.. code-block:: yaml

   objstorage:
     cls: multiplexer
     objstorages:
       - cls: http
         url: https://softwareheritage.s3.amazonaws.com/content/
         compression: gzip
       - cls: remote
         url: https://login:password@objstorage.staging.swh.network

   objstorage_dst:
     cls: remote
     args:
       url: http://objstorage:5003

   journal_client:
     cls: kafka
     brokers:
     - broker1.journal.staging.swh.network:9093
     group_id: kafka-username-content-replayer-003
     sasl.username: kafka-username
     sasl.password: kafka-password
     security.protocol: sasl_ssl
     sasl.mechanism: SCRAM-SHA-512
     session.timeout.ms: 600000
     max.poll.interval.ms: 3600000
     message.max.bytes: 1000000000
     privileged: true
     batch_size: 2000

   replayer:
     error_reporter:
       host: redis
       port: 6379
       db: 0


you can start the content replayer with:

.. code-block:: bash

   $ swh objstorage -C replayer-config.yml replay


You would typically run this tool on several machines, using the same
``group_id``, to increase replication parallelism.

Also note that you may increase the default concurrency within one replayer
using the ``--concurrency`` command line option. This will use as many
replication threads as given in argument, distributing the replication of
objects **within the same kafka consumer** among these threads. This is
typically useful when the replication of one object comes with non negligeable
minimal latency (e.g. consuming from public cloud-based objstorages).


Reference Documentation
-----------------------

.. toctree::
   :maxdepth: 2

   cli

.. only:: standalone_package_doc

   Indices and tables
   ------------------

   * :ref:`genindex`
   * :ref:`modindex`
   * :ref:`search`