.. _swh-storage-replayer: Storage replayer ================ The storage replayer is a mechanism allowing to replicate a storage database using the :ref:`swh-journal` (based on Kafka_). This can be used to feed |swh| mirrors, to asynchronously replicate a given storage database using a given backend (e.g. Cassandra) to another backend (e.g. Postgresql), etc. The replayer itself is a cli tool `swh storage replay` that takes the usual |swh| configuration file and runs indefinitely, consuming Kafka_ topics from the kafka cluster and inserting |swh| model objects in the destination storage. A typical ``replayer-config.yml`` configuration file for a replayer would be: .. code-block:: yaml storage: cls: pipeline steps: - cls: tenacious - cls: remote url: http://my-storage:5002 journal_client: cls: kafka brokers: - kafka-broker1:9093 - kafka-broker2:9093 group_id: my-user-replayer-01 privileged: true session.timeout.ms: 600000 socket.timeout.ms: 120000 fetch.wait.max.ms: 30000 max.poll.interval.ms: 3600000 message.max.bytes: 1000000000 batch_size: 200 fetch.message.max.bytes: 1048570 replayer: error_reporter: host: redis port: 6379 db: 10 Then you can start one replayer worker: .. code-block:: shell (swh) :~$ swh storage -C replayer-config.yml Error reporting --------------- Objects retrieved from kafka may not be possible to ingest. This can happen for several reasons and is generally not an issue (there are generally good reasons not to ingest them if the local storage's validation says so). Depending on the type of error occurring while decoding and ingestion the object, it may be a known case, in which case it is OK to ignore it, but it may be some other situation in which case there are 2 possibilities: - you want the replaying to stop ingesting objects in order to identify the reason for this error to occur (which generally involve fixing the code before being able to resume the relaying process), in which case you should not use a ``tenacious`` proxy in your storage configuration, or - you may chose to ignore it and continue the replaying process, using the ``tenacious`` proxy storage. In this second situation, it is key to keep track of failures. They will be reported in the logging system, but this is not always the best way to keep track of these specific errors. For this, there is an error reporting mechanism provided by the replayer, typically using a Redis_ key/value DB to collect the errors. This is enabled by configuring the ``replayer/error_reporter`` section in the config file. Each time an object failed to be deserialized from kafka or ingested in the storage, it will be logged in Redis using the following format: - the key will be formatted like ``/[:]`` where: - ``timestamp`` is the timestamp, as an ISO date format, when the failure occurred - ``object_type`` is one of the possible ``ModelObject.object_type`` constant ("origin", "revision", etc.) - ``identifier`` is an optional identifier for the object, if available a report time. - the value will be a yaml-encoded dictionary with 2 or 3 keys: - ``obj`` a dict representation of the failed object (as produced by ``ModelObject.to_dict()``), if available, - ``exc`` a list of strings representing the exception at the root of the failure, if any (None otherwise), - ``kafka_message`` if present, the dump of the kafka message representing the object .. _Kafka: https://kafka.apache.org/ .. _Redis: https://redis.io/