Configuration reference#

Software Heritage components are all configured with a YAML file, made of multiple blocks, most of which describe how to connect to other components/services.

Most services are composable, so they can be either instantiated locally or accessed through Software Heritage’s HTTP-based RPC protocol (cls: remote).

For example, a possible configuration for swh-vault is:

graph:
  url: http://graph.internal.softwareheritage.org:5009/

storage:
  cls: pipeline
  steps:
  - cls: retry
  - cls: remote
    url: http://webapp.internal.staging.swh.network:5002/

objstorage:
  cls: s3
  compression: gzip
  container_name: softwareheritage
  path_prefix: content

All URLs in this document are examples, see Service urls for actual values.

celery#

The scheduler uses Celery for running some tasks. This configuration key is used for parameters passed directly to Celery, e.g. the URI of the RabbitMQ broker used for distribution of tasks, for both scheduler commands as well as Celery workers.

The contents of this configuration key follow the “lowercase settings” schema from Celery upstream.

Some default values can be found in swh.scheduler.celery_backend.config.

graph#

The graph can only be accessed as a remote service, and its configuration block is a single key: url, which is the URL to its HTTP endpoint; usually on port 5009 or at the path /graph/.

journal#

The journal can only be locally instantiated to consume directly from Kafka:

journal:
  brokers:
    - broker1.journal.softwareheritage.org:9093
    - broker2.journal.softwareheritage.org:9093
    - broker3.journal.softwareheritage.org:9093
    - broker4.journal.softwareheritage.org:9093
  prefix: swh.journal.objects
  sasl.mechanism: "SCRAM-SHA-512"
  security.protocol: "sasl_ssl"
  sasl.username: "..."
  sasl.password: "..."
  privileged: false
  group_id: "..."

metadata_fetcher_credentials#

Nested dictionary of strings.

The first level identifies a metadata fetcher’s name (eg. gitea github), the second level the lister instance (eg. codeberg.org or github). The final level is a list of dicts containing the expected API credentials for the given instance of that fetcher. For example:

metadata_fetcher_credentials:
  github:
    github:
    - username: ...
      password: ...
    - ...

scheduler#

The scheduler can only be accessed as a remote service, and its configuration block is a single key: url, which is the URL to its HTTP endpoint; usually on port 5008 or at the path /scheduler/.:

scheduler:
  cls: remote
  url: http://saatchi.internal.softwareheritage.org:5008

storage#

Backends#

The storage has four possible classes:

  • cassandra, see swh.storage.cassandra.storage.CassandraStorage:

    storage:
      cls: cassandra
      hosts: [...]
      keyspace: swh
      port: 9042
      journal_writer:
        # ...
      # ...
    
  • postgresql, which takes a libpq connection string:

    storage:
      cls: postgresql
      db: service=swh
      journal_writer:
        # ...
    

    For optional arguments, see swh.storage.postgresql.storage.Storage

  • memory, which stores data in-memory instead of persisting it somewhere; this should only be used for debugging:

    storage:
      cls: memory
      journal_writer:
        # ...
    
  • remote, which takes a URL to a remote service’s HTTP endpoint; usually on port 5002 or at the path /storage/:

    storage:
      cls: remote
      url: http://webapp.internal.staging.swh.network:5002/
    

The journal_writer key is optional. If provided, it will be used to write all additions to some sort of log (usually Kafka) before any write to the main database.

swh.journal.writer.kafka:

cls: kafka
brokers:
  - broker1.journal.softwareheritage.org:9093
  - broker2.journal.softwareheritage.org:9093
  - broker3.journal.softwareheritage.org:9093
  - broker4.journal.softwareheritage.org:9093
prefix: swh.journal.objects
anonymize: true
client_id: ...
producer_config: ...

swh.journal.writer.stream, which writes directly to a file (or stdout if set to -):

cls: stream
output_stream: /tmp/messages.msgpack

swh.journal.writer.inmemory, which does not actually persist anywhere, and should only be used for tests:

cls: memory
anonymize: false

Proxies#

In addition to these three backends, “storage proxies” can be used and chained in order to change the behavior of accesses to it. They usually do not change the semantics, but perform optimizations such as batching calls, stripping redundant operations, and retrying on error. They are invoked through the special pipeline class, which takes as parameter a list of proxy configurations, ending with a backend configuration as seen above:

storage:
  cls: pipeline
  steps:
    - cls: buffer
      min_batch_size:
        content: 10000
        directory: 5000
    - cls: filter
    - cls: retry
    - cls: remote
      url: http://webapp1.internal.softwareheritage.org:5002/

which is equivalent to this nested configuration:

storage:
  cls: buffer
  min_batch_size:
    content: 10000
    directory: 5000
  storage:
    cls: filter
    storage:
      cls: retry
      storage:
        cls: remote
        url: http://webapp1.internal.softwareheritage.org:5002/

See swh.storage.proxies for the list of proxies.