Configuration reference#
Software Heritage components are all configured with a YAML file, made of multiple blocks, most of which describe how to connect to other components/services.
Most services are composable, so they can be either instantiated locally or
accessed through Software Heritage’s HTTP-based RPC protocol (cls: remote
).
For example, a possible configuration for swh-vault is:
graph:
url: http://graph.internal.softwareheritage.org:5009/
storage:
cls: pipeline
steps:
- cls: retry
- cls: remote
url: http://webapp.internal.staging.swh.network:5002/
objstorage:
cls: s3
compression: gzip
container_name: softwareheritage
path_prefix: content
All URLs in this document are examples, see Service urls for actual values.
celery#
The scheduler uses Celery for running some tasks. This configuration key is used for parameters passed directly to Celery, e.g. the URI of the RabbitMQ broker used for distribution of tasks, for both scheduler commands as well as Celery workers.
The contents of this configuration key follow the “lowercase settings” schema from Celery upstream.
Some default values can be found in swh.scheduler.celery_backend.config
.
graph#
The graph can only be accessed as a remote service, and
its configuration block is a single key: url
, which is the URL to its
HTTP endpoint; usually on port 5009 or at the path /graph/
.
journal#
The journal can only be locally instantiated to consume directly from Kafka:
journal:
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
prefix: swh.journal.objects
sasl.mechanism: "SCRAM-SHA-512"
security.protocol: "sasl_ssl"
sasl.username: "..."
sasl.password: "..."
privileged: false
group_id: "..."
metadata_fetcher_credentials#
Nested dictionary of strings.
The first level identifies a metadata fetcher’s name
(eg. gitea
github
), the second level the lister instance (eg. codeberg.org
or github
). The final level is a list of dicts containing the expected API
credentials for the given instance of that fetcher. For example:
metadata_fetcher_credentials:
github:
github:
- username: ...
password: ...
- ...
scheduler#
The scheduler can only be accessed as a remote service, and
its configuration block is a single key: url
, which is the URL to its
HTTP endpoint; usually on port 5008 or at the path /scheduler/
.:
scheduler:
cls: remote
url: http://saatchi.internal.softwareheritage.org:5008
storage#
Backends#
The storage has four possible classes:
cassandra
, seeswh.storage.cassandra.storage.CassandraStorage
:storage: cls: cassandra hosts: [...] keyspace: swh port: 9042 journal_writer: # ... # ...
postgresql
, which takes a libpq connection string:storage: cls: postgresql db: service=swh journal_writer: # ...
For optional arguments, see
swh.storage.postgresql.storage.Storage
memory
, which stores data in-memory instead of persisting it somewhere; this should only be used for debugging:storage: cls: memory journal_writer: # ...
remote
, which takes a URL to a remote service’s HTTP endpoint; usually on port 5002 or at the path/storage/
:storage: cls: remote url: http://webapp.internal.staging.swh.network:5002/
The journal_writer
key is optional. If provided, it will be used to write all
additions to some sort of log (usually Kafka) before any write to the main database.
cls: kafka
brokers:
- broker1.journal.softwareheritage.org:9093
- broker2.journal.softwareheritage.org:9093
- broker3.journal.softwareheritage.org:9093
- broker4.journal.softwareheritage.org:9093
prefix: swh.journal.objects
anonymize: true
client_id: ...
producer_config: ...
swh.journal.writer.stream
, which writes directly to a file
(or stdout if set to -
):
cls: stream
output_stream: /tmp/messages.msgpack
swh.journal.writer.inmemory
, which does not actually persist anywhere,
and should only be used for tests:
cls: memory
anonymize: false
Proxies#
In addition to these three backends, “storage proxies” can be used and chained in order
to change the behavior of accesses to it. They usually do not change the semantics,
but perform optimizations such as batching calls, stripping redundant operations,
and retrying on error.
They are invoked through the special pipeline
class, which takes as parameter
a list of proxy configurations, ending with a backend configuration as seen above:
storage:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 10000
directory: 5000
- cls: filter
- cls: retry
- cls: remote
url: http://webapp1.internal.softwareheritage.org:5002/
which is equivalent to this nested configuration:
storage:
cls: buffer
min_batch_size:
content: 10000
directory: 5000
storage:
cls: filter
storage:
cls: retry
storage:
cls: remote
url: http://webapp1.internal.softwareheritage.org:5002/
See swh.storage.proxies
for the list of proxies.