swh.core package


swh.core.api module

exception swh.core.api.RemoteException[source]

Bases: Exception

class swh.core.api.MetaSWHRemoteAPI[source]

Bases: type

Metaclass for SWHRemoteAPI, which adds a method for each endpoint of the database it is designed to access.

See for example swh.indexer.storage.api.client.RemoteStorage

class swh.core.api.SWHRemoteAPI(api_exception, url, timeout=None)[source]

Bases: object

Proxy to an internal SWH API

backend_class = None

For each method of backend_class decorated with remote_api_endpoint(), a method with the same prototype and docstring will be added to this class. Calls to this new method will be translated into HTTP requests to a remote server.

This backend class will never be instantiated, it only serves as a template.

raw_post(endpoint, data, **opts)[source]
raw_get(endpoint, params=None, **opts)[source]
post(endpoint, data, params=None)[source]
get(endpoint, params=None)[source]
post_stream(endpoint, data, params=None)[source]
get_stream(endpoint, params=None, chunk_size=4096)[source]
class swh.core.api.BytesRequest(environ, populate_request=True, shallow=False)[source]

Bases: flask.wrappers.Request

Request with proper escaping of arbitrary byte sequences.

encoding = 'utf-8'
encoding_errors = 'surrogateescape'
swh.core.api.error_handler(exception, encoder)[source]
class swh.core.api.SWHServerAPIApp(*args, backend_class=None, backend_factory=None, **kwargs)[source]

Bases: flask.app.Flask

For each endpoint of the given backend_class, tells app.route to call a function that decodes the request and sends it to the backend object provided by the factory.

  • backend_class (Any) – The class of the backend, which will be analyzed to look for API endpoints.
  • backend_class] backend_factory (Callable[[],) – A function with no argument that returns an instance of backend_class.

alias of BytesRequest

swh.core.api_async module

swh.core.api_async.encode_data_server(data, **kwargs)[source]
swh.core.api_async.error_middleware(app, handler)[source]
class swh.core.api_async.SWHRemoteAPI(*args, middlewares=(), **kwargs)[source]

Bases: aiohttp.web_app.Application

swh.core.cli module

swh.core.config module


Check whether a file exists, and is accessible.

Returns:True if the file exists and is accessible False if the file does not exist
Raises:PermissionError if the file cannot be read.

Return the base path of a configuration file


Read the raw config corresponding to base_config_path.

Can read yml or ini files.


Check whether the given config exists

swh.core.config.read(conf_file=None, default_conf=None)[source]

Read the user’s configuration file.

Fill in the gap using default_conf. default_conf is similar to this:

    'a': ('str', '/tmp/swh-loader-git/log'),
    'b': ('str', 'dbname=swhloadergit')
    'c': ('bool', true)
    'e': ('bool', None)
    'd': ('int', 10)

If conf_file is None, return the default config.

swh.core.config.priority_read(conf_filenames, default_conf=None)[source]

Try reading the configuration files from conf_filenames, in order, and return the configuration from the first one that exists.

default_conf has the same specification as it does in read.

swh.core.config.merge_default_configs(base_config, *other_configs)[source]

Merge several default config dictionaries, from left to right


Return the Software Heritage specific configuration paths for the given filename.

swh.core.config.prepare_folders(conf, *keys)[source]

Prepare the folder mentioned in config under keys.


Load the global Software Heritage config

swh.core.config.load_named_config(name, default_conf=None, global_conf=True)[source]

Load the config named name from the Software Heritage configuration paths.

If global_conf is True (default), read the global configuration too.

class swh.core.config.SWHConfig[source]

Bases: object

Mixin to add configuration parsing abilities to classes

The class should override the class attributes:
  • DEFAULT_CONFIG (default configuration to be parsed)
  • CONFIG_BASE_FILENAME (the filename of the configuration to be used)

This class defines one classmethod, parse_config_file, which parses a configuration file using the default config as set in the class attribute.

classmethod parse_config_file(base_filename=None, config_filename=None, additional_configs=None, global_config=True)[source]

Parse the configuration file associated to the current class.

By default, parse_config_file will load the configuration cls.CONFIG_BASE_FILENAME from one of the Software Heritage configuration directories, in order, unless it is overridden by base_filename or config_filename (which shortcuts the file lookup completely).

  • base_filename (-) – cls.CONFIG_BASE_FILENAME
  • config_filename (-) – the defaults set from cls.CONFIG_BASE_FILENAME
  • additional_configs (-) – allows to override or extend the configuration set in cls.DEFAULT_CONFIG.
  • global_config (-) – Load the global configuration (default: True)

swh.core.logger module


convert a log level of the logging module to a log level suitable for the logging Postgres DB

swh.core.logger.get_extra_data(record, task_args=True)[source]

Get the extra data to insert to the database from the logging record

swh.core.logger.flatten(data, separator='_')[source]

Flatten the data dictionary into a flat structure


Convert value to string

class swh.core.logger.PostgresHandler(connstring)[source]

Bases: logging.Handler

log handler that store messages in a Postgres DB

See swh-core/swh/core/sql/log-schema.sql for the DB schema.

All logging methods can be used as usual. Additionally, arbitrary metadata can be passed to logging methods, requesting that they will be stored in the DB as a single JSONB value. To do so, pass a dictionary to the ‘extra’ kwarg of any logging method; all keys in that dictionary that start with EXTRA_LOGDATA_PREFIX (currently: swh_) will be extracted to form the JSONB dictionary. The prefix will be stripped and not included in the DB.

Note: the logger name will be used to fill the ‘module’ DB column.

Sample usage:

h = PostgresHandler('dbname=softwareheritage-log')

logger.info('not so important notice',
            extra={'swh_type': 'swh_logging_test',
                   'swh_meditation': 'guru'})
logger.warn('something weird just happened, did you see that?')

Do whatever it takes to actually log the specified logging record.

This version is intended to be implemented by subclasses and so raises a NotImplementedError.

class swh.core.logger.JournalHandler(level=0, sender_function=<function send>, **kwargs)[source]

Bases: systemd.journal.JournalHandler


Write record as a journal event.

MESSAGE is taken from the message provided by the user, and PRIORITY, LOGGER, THREAD_NAME, CODE_{FILE,LINE,FUNC} fields are appended automatically. In addition, record.MESSAGE_ID will be used if present.

swh.core.serializers module

class swh.core.serializers.SWHJSONEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

JSON encoder for data structures generated by Software Heritage.

This JSON encoder extends the default Python JSON encoder and adds awareness for the following specific types:

  • bytes (get encoded as a Base85 string);
  • datetime.datetime (get encoded as an ISO8601 string).

Non-standard types get encoded as a a dictionary with two keys:

  • swhtype with value ‘bytes’ or ‘datetime’;
  • d containing the encoded value.

SWHJSONEncoder also encodes arbitrary iterables as a list (allowing serialization of generators).

Caveats: Limitations in the JSONEncoder extension mechanism prevent us from “escaping” dictionaries that only contain the swhtype and d keys, and therefore arbitrary data structures can’t be round-tripped through SWHJSONEncoder and SWHJSONDecoder.


Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
        iterable = iter(o)
    except TypeError:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class swh.core.serializers.SWHJSONDecoder(object_hook=None, parse_float=None, parse_int=None, parse_constant=None, strict=True, object_pairs_hook=None)[source]

Bases: json.decoder.JSONDecoder

JSON decoder for data structures encoded with SWHJSONEncoder.

This JSON decoder extends the default Python JSON decoder, allowing the decoding of:

  • bytes (encoded as a Base85 string);
  • datetime.datetime (encoded as an ISO8601 string).

Non-standard types must be encoded as a a dictionary with exactly two keys:

  • swhtype with value ‘bytes’ or ‘datetime’;
  • d containing the encoded value.

To limit the impact our encoding, if the swhtype key doesn’t contain a known value, the dictionary is decoded as-is.

raw_decode(s, idx=0)[source]

Decode a JSON document from s (a str beginning with a JSON document) and return a 2-tuple of the Python representation and the index in s where the document ended.

This can be used to decode a JSON document from a string that may have extraneous data at the end.


Write data as a msgpack stream


Read data as a msgpack stream

swh.core.statsd module

class swh.core.statsd.TimedContextManagerDecorator(statsd, metric=None, tags=None, sample_rate=1)[source]

Bases: object

A context manager and a decorator which will report the elapsed time in the context OR in a function call.


the elapsed time at the point of completion


Start the timer


Stop the timer, send the metric value

class swh.core.statsd.Statsd(host=None, port=None, max_buffer_size=50, namespace=None, constant_tags=None)[source]

Bases: object

Initialize a client to send metrics to a StatsD server.

  • host (str) – the host of the StatsD server. Defaults to localhost.
  • port (int) – the port of the StatsD server. Defaults to 8125.
  • max_buffer_size (int) – Maximum number of metrics to buffer before sending to the server if sending metrics in batch
  • namespace (str) – Namespace to prefix all metric names
  • constant_tags (Dict[str, str]) – Tags to attach to all metrics


This class also supports the following environment variables:

Override the default host of the statsd server
Override the default port of the statsd server

Tags to attach to every metric reported. Example value:


gauge(metric, value, tags=None, sample_rate=1)[source]

Record the value of a gauge, optionally setting a list of tags and a sample rate.

>>> statsd.gauge('users.online', 123)
>>> statsd.gauge('active.connections', 1001, tags={"protocol": "http"})
increment(metric, value=1, tags=None, sample_rate=1)[source]

Increment a counter, optionally setting a value, tags and a sample rate.

>>> statsd.increment('page.views')
>>> statsd.increment('files.transferred', 124)
decrement(metric, value=1, tags=None, sample_rate=1)[source]

Decrement a counter, optionally setting a value, tags and a sample rate.

>>> statsd.decrement('files.remaining')
>>> statsd.decrement('active.connections', 2)
histogram(metric, value, tags=None, sample_rate=1)[source]

Sample a histogram value, optionally setting tags and a sample rate.

>>> statsd.histogram('uploaded.file.size', 1445)
>>> statsd.histogram('file.count', 26, tags={"filetype": "python"})
timing(metric, value, tags=None, sample_rate=1)[source]

Record a timing, optionally setting tags and a sample rate.

>>> statsd.timing("query.response.time", 1234)
timed(metric=None, tags=None, sample_rate=1)[source]

A decorator or context manager that will measure the distribution of a function’s/context’s run time. Optionally specify a list of tags or a sample rate. If the metric is not defined as a decorator, the module name and function name will be used. The metric is required as a context manager.

@statsd.timed('user.query.time', sample_rate=0.5)
def get_user(user_id):
    # Do what you need to ...

# Is equivalent to ...
with statsd.timed('user.query.time', sample_rate=0.5):
    # Do what you need to ...

# Is equivalent to ...
start = time.monotonic()
    statsd.timing('user.query.time', time.monotonic() - start)
set(metric, value, tags=None, sample_rate=1)[source]

Sample a set value.

>>> statsd.set('visitors.uniques', 999)

Return a connected socket.

Note: connect the socket before assigning it to the class instance to avoid bad thread race conditions.


Open a buffer to send a batch of metrics in one packet.

You can also use this as a context manager.

>>> with Statsd() as batch:
>>>     batch.gauge('users.online', 123)
>>>     batch.gauge('active.connections', 1001)

Flush the buffer and switch back to single metric packets.


Closes connected socket if connected.

swh.core.statsd.random() → x in the interval [0, 1).

swh.core.tarball module


Given a filepath, determine if it represents an archive.

Parameters:filepath – file to test for tarball property
Returns:Bool, True if it’s a tarball, False otherwise
swh.core.tarball.uncompress(tarpath, dest)[source]
Uncompress tarpath to dest folder if tarball is supported and safe.

Safe means, no file will be uncompressed outside of dirpath.

Note that this fixes permissions after successfully uncompressing the archive.

  • tarpath – path to tarball to uncompress
  • dest – the destination folder where to uncompress the tarball

The nature of the tarball, zip or tar.

  • ValueError when
    • an archive member would be extracted outside basepath
    • the archive is not supported
swh.core.tarball.compress(tarpath, nature, dirpath_or_files)[source]

Create a tarball tarpath with nature nature. The content of the tarball is either dirpath’s content (if representing a directory path) or dirpath’s iterable contents.

Compress the directory dirpath’s content to a tarball. The tarball being dumped at tarpath. The nature of the tarball is determined by the nature argument.

swh.core.utils module


Contextually change the working directory to do thy bidding. Then gets back to the original location.

swh.core.utils.grouper(iterable, n)[source]
Collect data into fixed-length size iterables. The last block might

contain less elements as it will hold only the remaining number of elements.

The invariant here is that the number of elements in the input iterable and the sum of the number of elements of all iterables generated from this function should be equal.

  • iterable (Iterable) – an iterable
  • n (int) – size of block to slice the iterable into

fixed-length blocks as iterables. As mentioned, the last iterable might be less populated.


Encode an unicode string containing x<hex> backslash escapes


Decode a bytestring as utf-8, escaping the bytes of invalid utf-8 sequences as x<hex value>. We also escape NUL bytes as they are invalid in JSON strings.

swh.core.utils.commonname(path0, path1, as_str=False)[source]

Compute the commonname between the path0 and path1.


Simple function to sort filenames of the form:


where nn is a number according to the numbers.

Typically used to sort sql/nn-swh-xxx.sql files.

Module contents