swh.lister.core.lister_base module

swh.lister.core.lister_base.utcnow()[source]
exception swh.lister.core.lister_base.FetchError(response)[source]

Bases: RuntimeError

class swh.lister.core.lister_base.ListerBase(override_config=None)[source]

Bases: abc.ABC, swh.core.config.SWHConfig

Lister core base class. Generally a source code hosting service provides an API endpoint for listing the set of stored repositories. A Lister is the discovery service responsible for finding this list, all at once or sequentially by parts, and queueing local tasks to fetch and ingest the referenced repositories.

The core method in this class is ingest_data. Any subclasses should be calling this method one or more times to fetch and ingest data from API endpoints. See swh.lister.core.lister_base.IndexingLister for example usage.

This class cannot be instantiated. Any instantiable Lister descending from ListerBase must provide at least the required overrides. (see member docstrings for details):

Required Overrides:

MODEL def transport_request def transport_response_to_string def transport_response_simplified def transport_quota_check

Optional Overrides:

def filter_before_inject def is_within_bounds

MODEL = <swh.lister.core.abstractattribute.AbstractAttribute object>
LISTER_NAME = <swh.lister.core.abstractattribute.AbstractAttribute object>
transport_request(identifier)[source]

Given a target endpoint identifier to query, try once to request it.

Implementation of this method determines the network request protocol.

Parameters

identifier (string) – unique identifier for an endpoint query. e.g. If the service indexes lists of repositories by date and time of creation, this might be that as a formatted string. Or it might be an integer UID. Or it might be nothing. It depends on what the service needs.

Returns

the entire request response

Raises
  • Will catch internal transport-dependent connection exceptions and

  • raise swh.lister.core.lister_base.FetchError instead. Other

  • non-connection exceptions should propagate unchanged.

transport_response_to_string(response)[source]

Convert the server response into a formatted string for logging.

Implementation of this method depends on the shape of the network response object returned by the transport_request method.

Parameters

response – the server response

Returns

a pretty string of the response

transport_response_simplified(response)[source]
Convert the server response into list of a dict for each repo in the

response, mapping columns in the lister’s MODEL class to repo data.

Implementation of this method depends on the server API spec and the shape of the network response object returned by the transport_request method.

Parameters

response – response object from the server.

Returns

list of repo MODEL dicts

( eg. [{‘uid’: r[‘id’], etc.} for r in response.json()] )

transport_quota_check(response)[source]

Check server response to see if we’re hitting request rate limits.

Implementation of this method depends on the server communication protocol and API spec and the shape of the network response object returned by the transport_request method.

Parameters

response (session response) – complete API query response

Returns

  1. must retry request? True/False

  2. seconds to delay if True

filter_before_inject(models_list)[source]

Filter models_list entries prior to injection in the db. This is ran directly after transport_response_simplified.

Default implementation is to have no filtering.

Parameters

models_list – list of dicts returned by transport_response_simplified.

Returns

models_list with entries changed according to custom logic.

do_additional_checks(models_list)[source]

Execute some additional checks on the model list (after the filtering).

Default implementation is to run no check at all and to return the input as is.

Parameters

models_list – list of dicts returned by transport_response_simplified.

Returns

models_list with entries if checks ok, False otherwise

is_within_bounds(inner, lower=None, upper=None)[source]

See if a sortable value is inside the range [lower,upper].

MAY BE OVERRIDDEN, for example if the server indexable* key is technically sortable but not automatically so.

    • ( see: swh.lister.core.indexing_lister.IndexingLister )

Parameters
  • inner (sortable type) – the value being checked

  • lower (sortable type) – optional lower bound

  • upper (sortable type) – optional upper bound

Returns

whether inner is confined by the optional lower and upper bounds

DEFAULT_CONFIG = {'lister': ('dict', {'cls': 'local', 'args': {'db': 'postgresql:///lister'}}), 'scheduler': ('dict', {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}})}
property CONFIG_BASE_FILENAME

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

property ADDITIONAL_CONFIG
INITIAL_BACKOFF = 10
MAX_RETRIES = 7
CONN_SLEEP = 10
reset_backoff()[source]

Reset exponential backoff timeout to initial level.

back_off()[source]

Get next exponential backoff timeout.

safely_issue_request(identifier)[source]

Make network request with retries, rate quotas, and response logs.

Protocol is handled by the implementation of the transport_request method.

Parameters

identifier – resource identifier

Returns

server response

db_query_equal(key, value)[source]

Look in the db for a row with key == value

Parameters
  • key – column key to look at

  • value – value to look for in that column

Returns

sqlalchemy.ext.declarative.declarative_base object

with the given key == value

winnow_models(mlist, key, to_remove)[source]
Given a list of models, remove any with <key> matching

some member of a list of values.

Parameters
  • mlist (list of model rows) – the initial list of models

  • key (column) – the column to filter on

  • to_remove (list) – if anything in mlist has column <key> equal to one of the values in to_remove, it will be removed from the result

Returns

A list of model rows starting from mlist minus any matching rows

db_num_entries()[source]

Return the known number of entries in the lister db

db_inject_repo(model_dict)[source]

Add/update a new repo to the db and mark it last_seen now.

Parameters

model_dict – dictionary mapping model keys to values

Returns

new or updated sqlalchemy.ext.declarative.declarative_base object associated with the injection

task_dict(origin_type: str, origin_url: str, **kwargs) → Dict[str, Any][source]

Return special dict format for the tasks list

Parameters
  • origin_type (string) –

  • origin_url (string) –

Returns

the same information in a different form

string_pattern_check(a, b, c=None)[source]
When comparing indexable types in is_within_bounds, complex strings

may not be allowed to differ in basic structure. If they do, it could be a sign of not understanding the data well. For instance, an ISO 8601 time string cannot be compared against its urlencoded equivalent, but this is an easy mistake to accidentally make. This method acts as a friendly sanity check.

Parameters
  • a (string) – inner component of the is_within_bounds method

  • b (string) – lower component of the is_within_bounds method

  • c (string) – upper component of the is_within_bounds method

Returns

nothing

Raises
  • TypeError if strings a, b, and c don't conform to the same basic

  • pattern.

inject_repo_data_into_db(models_list: List[Dict]) → Dict[source]

Inject data into the db.

Parameters

models_list – list of dicts mapping keys from the db model for each repo to be injected

Returns

sql_repo pairs

Return type

dict of uid

schedule_missing_tasks(models_list: List[Dict], injected_repos: Dict) → None[source]
Schedule any newly created db entries that do not have been

scheduled yet.

Parameters
  • models_list – List of dicts mapping keys in the db model for each repo

  • injected_repos – Dict of uid:sql_repo pairs that have just been created

Returns

Nothing. (Note that it Modifies injected_repos to set the new task_id).

ingest_data(identifier, checks=False)[source]
The core data fetch sequence. Request server endpoint. Simplify and

filter response list of repositories. Inject repo information into local db. Queue loader tasks for linked repositories.

Parameters
  • identifier – Resource identifier.

  • checks (bool) – Additional checks required

save_response(response)[source]

Log the response from a server request to a cache dir.

Parameters
  • response – full server response

  • cache_dir – system path for cache dir

Returns

nothing