Software Heritage - Object storage replayer#
Simple command line tool to replicate content objects from a source object
storage to a destination one by listening the content
topic of a
swh.journal
kafka stream.
This Python module provides a command line tool to replicate content objects from a
source Object storage to a destination one by listening the content
topic of a
swh-journal kafka stream.
It is meant to be used as the brick of a mirror setup dedicated to replicating content objects.
Quick start#
Once installed (using pip or debian packages), the command swh objstorage
replay
should be available:
It needs a configuration file with 4 sections:
objstorage
: the source objstorage to retrieve objects from,objstorage_dst
: the destination objstorage to put objects into,journal_client
: the journal client (kafka configuration where the object hashes are consumed from),replayer
(optional): some replayer specific configurations options.
For example with a configuration file like:
objstorage:
cls: multiplexer
objstorages:
- cls: http
url: https://softwareheritage.s3.amazonaws.com/content/
compression: gzip
- cls: remote
url: https://login:password@objstorage.staging.swh.network
objstorage_dst:
cls: remote
args:
url: http://objstorage:5003
journal_client:
cls: kafka
brokers:
- broker1.journal.staging.swh.network:9093
group_id: kafka-username-content-replayer-003
sasl.username: kafka-username
sasl.password: kafka-password
security.protocol: sasl_ssl
sasl.mechanism: SCRAM-SHA-512
session.timeout.ms: 600000
max.poll.interval.ms: 3600000
message.max.bytes: 1000000000
privileged: true
batch_size: 2000
replayer:
error_reporter:
host: redis
port: 6379
db: 0
you can start the content replayer with:
$ swh objstorage -C replayer-config.yml replay
You would typically run this tool on several machines, using the same
group_id
, to increase replication parallelism.
Also note that you may increase the default concurrency within one replayer
using the --concurrency
command line option. This will use as many
replication threads as given in argument, distributing the replication of
objects within the same kafka consumer among these threads. This is
typically useful when the replication of one object comes with non negligeable
minimal latency (e.g. consuming from public cloud-based objstorages).