swh.core package


swh.core.api_async module

swh.core.cli module

swh.core.config module


Check whether a file exists, and is accessible.

Returns:True if the file exists and is accessible False if the file does not exist
Raises:PermissionError if the file cannot be read.

Return the base path of a configuration file


Read the raw config corresponding to base_config_path.

Can read yml or ini files.


Check whether the given config exists

swh.core.config.read(conf_file=None, default_conf=None)[source]

Read the user’s configuration file.

Fill in the gap using default_conf. default_conf is similar to this:

    'a': ('str', '/tmp/swh-loader-git/log'),
    'b': ('str', 'dbname=swhloadergit')
    'c': ('bool', true)
    'e': ('bool', None)
    'd': ('int', 10)

If conf_file is None, return the default config.

swh.core.config.priority_read(conf_filenames, default_conf=None)[source]

Try reading the configuration files from conf_filenames, in order, and return the configuration from the first one that exists.

default_conf has the same specification as it does in read.

swh.core.config.merge_default_configs(base_config, *other_configs)[source]

Merge several default config dictionaries, from left to right

swh.core.config.merge_configs(base, other)[source]

Merge two config dictionaries

This does merge config dicts recursively, with the rules, for every value of the dicts (with ‘val’ not being a dict):

  • None + type -> type
  • type + None -> None
  • dict + dict -> dict (merged)
  • val + dict -> TypeError
  • dict + val -> TypeError
  • val + val -> val (other)

so merging

‘key1’: {
‘skey1’: value1, ‘skey2’: {‘sskey1’: value2},

}, ‘key2’: value3,



‘key1’: {
‘skey1’: value4, ‘skey2’: {‘sskey2’: value5},

}, ‘key3’: value6,


will give:

‘key1’: {

‘skey1’: value4, # <– note this ‘skey2’: {

‘sskey1’: value2, ‘sskey2’: value5,


}, ‘key2’: value3, ‘key3’: value6,


Note that no type checking is done for anything but dicts.


Return the Software Heritage specific configuration paths for the given filename.

swh.core.config.prepare_folders(conf, *keys)[source]

Prepare the folder mentioned in config under keys.


Load the global Software Heritage config

swh.core.config.load_named_config(name, default_conf=None, global_conf=True)[source]

Load the config named name from the Software Heritage configuration paths.

If global_conf is True (default), read the global configuration too.

class swh.core.config.SWHConfig[source]

Bases: object

Mixin to add configuration parsing abilities to classes

The class should override the class attributes:
  • DEFAULT_CONFIG (default configuration to be parsed)
  • CONFIG_BASE_FILENAME (the filename of the configuration to be used)

This class defines one classmethod, parse_config_file, which parses a configuration file using the default config as set in the class attribute.

classmethod parse_config_file(base_filename=None, config_filename=None, additional_configs=None, global_config=True)[source]

Parse the configuration file associated to the current class.

By default, parse_config_file will load the configuration cls.CONFIG_BASE_FILENAME from one of the Software Heritage configuration directories, in order, unless it is overridden by base_filename or config_filename (which shortcuts the file lookup completely).

  • base_filename (-) – cls.CONFIG_BASE_FILENAME
  • config_filename (-) – the defaults set from cls.CONFIG_BASE_FILENAME
  • additional_configs (-) – allows to override or extend the configuration set in cls.DEFAULT_CONFIG.
  • global_config (-) – Load the global configuration (default: True)

swh.core.logger module


convert a log level of the logging module to a log level suitable for the logging Postgres DB

swh.core.logger.get_extra_data(record, task_args=True)[source]

Get the extra data to insert to the database from the logging record

swh.core.logger.flatten(data, separator='_')[source]

Flatten the data dictionary into a flat structure


Convert value to string

class swh.core.logger.PostgresHandler(connstring)[source]

Bases: logging.Handler

log handler that store messages in a Postgres DB

See swh-core/swh/core/sql/log-schema.sql for the DB schema.

All logging methods can be used as usual. Additionally, arbitrary metadata can be passed to logging methods, requesting that they will be stored in the DB as a single JSONB value. To do so, pass a dictionary to the ‘extra’ kwarg of any logging method; all keys in that dictionary that start with EXTRA_LOGDATA_PREFIX (currently: swh_) will be extracted to form the JSONB dictionary. The prefix will be stripped and not included in the DB.

Note: the logger name will be used to fill the ‘module’ DB column.

Sample usage:

h = PostgresHandler('dbname=softwareheritage-log')

logger.info('not so important notice',
            extra={'swh_type': 'swh_logging_test',
                   'swh_meditation': 'guru'})
logger.warn('something weird just happened, did you see that?')

Do whatever it takes to actually log the specified logging record.

This version is intended to be implemented by subclasses and so raises a NotImplementedError.

class swh.core.logger.JournalHandler(level=0, sender_function=<function send>, **kwargs)[source]

Bases: systemd.journal.JournalHandler


Write record as a journal event.

MESSAGE is taken from the message provided by the user, and PRIORITY, LOGGER, THREAD_NAME, CODE_{FILE,LINE,FUNC} fields are appended automatically. In addition, record.MESSAGE_ID will be used if present.

swh.core.statsd module

class swh.core.statsd.TimedContextManagerDecorator(statsd, metric=None, tags=None, sample_rate=1)[source]

Bases: object

A context manager and a decorator which will report the elapsed time in the context OR in a function call.


the elapsed time at the point of completion


Start the timer


Stop the timer, send the metric value

class swh.core.statsd.Statsd(host=None, port=None, max_buffer_size=50, namespace=None, constant_tags=None)[source]

Bases: object

Initialize a client to send metrics to a StatsD server.

  • host (str) – the host of the StatsD server. Defaults to localhost.
  • port (int) – the port of the StatsD server. Defaults to 8125.
  • max_buffer_size (int) – Maximum number of metrics to buffer before sending to the server if sending metrics in batch
  • namespace (str) – Namespace to prefix all metric names
  • constant_tags (Dict[str, str]) – Tags to attach to all metrics


This class also supports the following environment variables:

Override the default host of the statsd server
Override the default port of the statsd server

Tags to attach to every metric reported. Example value:


gauge(metric, value, tags=None, sample_rate=1)[source]

Record the value of a gauge, optionally setting a list of tags and a sample rate.

>>> statsd.gauge('users.online', 123)
>>> statsd.gauge('active.connections', 1001, tags={"protocol": "http"})
increment(metric, value=1, tags=None, sample_rate=1)[source]

Increment a counter, optionally setting a value, tags and a sample rate.

>>> statsd.increment('page.views')
>>> statsd.increment('files.transferred', 124)
decrement(metric, value=1, tags=None, sample_rate=1)[source]

Decrement a counter, optionally setting a value, tags and a sample rate.

>>> statsd.decrement('files.remaining')
>>> statsd.decrement('active.connections', 2)
histogram(metric, value, tags=None, sample_rate=1)[source]

Sample a histogram value, optionally setting tags and a sample rate.

>>> statsd.histogram('uploaded.file.size', 1445)
>>> statsd.histogram('file.count', 26, tags={"filetype": "python"})
timing(metric, value, tags=None, sample_rate=1)[source]

Record a timing, optionally setting tags and a sample rate.

>>> statsd.timing("query.response.time", 1234)
timed(metric=None, tags=None, sample_rate=1)[source]

A decorator or context manager that will measure the distribution of a function’s/context’s run time. Optionally specify a list of tags or a sample rate. If the metric is not defined as a decorator, the module name and function name will be used. The metric is required as a context manager.

@statsd.timed('user.query.time', sample_rate=0.5)
def get_user(user_id):
    # Do what you need to ...

# Is equivalent to ...
with statsd.timed('user.query.time', sample_rate=0.5):
    # Do what you need to ...

# Is equivalent to ...
start = time.monotonic()
    statsd.timing('user.query.time', time.monotonic() - start)
set(metric, value, tags=None, sample_rate=1)[source]

Sample a set value.

>>> statsd.set('visitors.uniques', 999)

Return a connected socket.

Note: connect the socket before assigning it to the class instance to avoid bad thread race conditions.


Open a buffer to send a batch of metrics in one packet.

You can also use this as a context manager.

>>> with Statsd() as batch:
...     batch.gauge('users.online', 123)
...     batch.gauge('active.connections', 1001)

Flush the buffer and switch back to single metric packets.


Closes connected socket if connected.

swh.core.statsd.random() → x in the interval [0, 1).

swh.core.tarball module


Given a filepath, determine if it represents an archive.

Parameters:filepath – file to test for tarball property
Returns:Bool, True if it’s a tarball, False otherwise
swh.core.tarball.uncompress(tarpath, dest)[source]
Uncompress tarpath to dest folder if tarball is supported and safe.

Safe means, no file will be uncompressed outside of dirpath.

Note that this fixes permissions after successfully uncompressing the archive.

  • tarpath – path to tarball to uncompress
  • dest – the destination folder where to uncompress the tarball

The nature of the tarball, zip or tar.

  • ValueError when
    • an archive member would be extracted outside basepath
    • the archive is not supported
swh.core.tarball.compress(tarpath, nature, dirpath_or_files)[source]

Create a tarball tarpath with nature nature. The content of the tarball is either dirpath’s content (if representing a directory path) or dirpath’s iterable contents.

Compress the directory dirpath’s content to a tarball. The tarball being dumped at tarpath. The nature of the tarball is determined by the nature argument.

swh.core.utils module


Contextually change the working directory to do thy bidding. Then gets back to the original location.

swh.core.utils.grouper(iterable, n)[source]
Collect data into fixed-length size iterables. The last block might

contain less elements as it will hold only the remaining number of elements.

The invariant here is that the number of elements in the input iterable and the sum of the number of elements of all iterables generated from this function should be equal.

  • iterable (Iterable) – an iterable
  • n (int) – size of block to slice the iterable into

fixed-length blocks as iterables. As mentioned, the last iterable might be less populated.


Encode an unicode string containing x<hex> backslash escapes


Decode a bytestring as utf-8, escaping the bytes of invalid utf-8 sequences as x<hex value>. We also escape NUL bytes as they are invalid in JSON strings.

swh.core.utils.commonname(path0, path1, as_str=False)[source]

Compute the commonname between the path0 and path1.


Simple function to sort filenames of the form:


where nn is a number according to the numbers.

Typically used to sort sql/nn-swh-xxx.sql files.

Module contents