swh.lister.utils module#

swh.lister.utils.split_range(total_pages: int, nb_pages: int) Iterator[Tuple[int, int]][source]#

Split total_pages into mostly nb_pages ranges. In some cases, the last range can have one more element.

>>> list(split_range(19, 10))
[(0, 9), (10, 19)]
>>> list(split_range(20, 3))
[(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 20)]
>>> list(split_range(21, 3))
[(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 21)]
swh.lister.utils.is_throttling_exception(e: Exception) bool[source]#

Checks if an exception is a requests.exception.HTTPError for a response with status code 429 (too many requests).

swh.lister.utils.is_retryable_exception(e: Exception) bool[source]#

Checks if an exception is worth retrying (connection, throttling or a server error).

swh.lister.utils.retry_if_exception(retry_state, predicate: Callable[[Exception], bool]) bool[source]#

Custom tenacity retry predicate for handling exceptions with the given predicate.

swh.lister.utils.retry_policy_generic(retry_state) bool[source]#
Custom tenacity retry predicate for handling failed requests:
  • ConnectionError

  • Server errors (status >= 500)

  • Throttling errors (status == 429)

This does not handle 404, 403 or other status codes.

swh.lister.utils.http_retry(retry=<function retry_policy_generic>, wait=<tenacity.wait.wait_exponential object>, stop=<tenacity.stop.stop_after_attempt object>, **retry_args)[source]#

Decorator based on tenacity for retrying a function possibly raising requests.exception.HTTPError for status code 429 (too many requests).

It provides a default configuration that should work properly in most cases but all tenacity.retry parameters can also be overridden in client code.

When the mmaximum of attempts is reached, the HTTPError exception will then be reraised.

Parameters:
swh.lister.utils.is_valid_origin_url(url: Optional[str]) bool[source]#

Returns whether the given string is a valid origin URL. This excludes Git SSH URLs and pseudo-URLs (eg. ssh://git@example.org:foo and git@example.org:foo), as they are not supported by the Git loader and usually require authentication.

All HTTP URLs are allowed:

>>> is_valid_origin_url("http://example.org/repo.git")
True
>>> is_valid_origin_url("http://example.org/repo")
True
>>> is_valid_origin_url("https://example.org/repo")
True
>>> is_valid_origin_url("https://foo:bar@example.org/repo")
True

Scheme-less URLs are rejected;

>>> is_valid_origin_url("example.org/repo")
False
>>> is_valid_origin_url("example.org:repo")
False

Git SSH URLs and pseudo-URLs are rejected:

>>> is_valid_origin_url("git@example.org:repo")
False
>>> is_valid_origin_url("ssh://git@example.org:repo")
False