swh.lister.utils module#
- swh.lister.utils.split_range(total_pages: int, nb_pages: int) Iterator[Tuple[int, int]] [source]#
Split total_pages into mostly nb_pages ranges. In some cases, the last range can have one more element.
>>> list(split_range(19, 10)) [(0, 9), (10, 19)]
>>> list(split_range(20, 3)) [(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 20)]
>>> list(split_range(21, 3)) [(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 21)]
- swh.lister.utils.is_throttling_exception(e: Exception) bool [source]#
Checks if an exception is a requests.exception.HTTPError for a response with status code 429 (too many requests).
- swh.lister.utils.is_retryable_exception(e: Exception) bool [source]#
Checks if an exception is worth retrying (connection, throttling or a server error).
- swh.lister.utils.retry_if_exception(retry_state, predicate: Callable[[Exception], bool]) bool [source]#
Custom tenacity retry predicate for handling exceptions with the given predicate.
- swh.lister.utils.retry_policy_generic(retry_state) bool [source]#
- Custom tenacity retry predicate for handling failed requests:
ConnectionError
Server errors (status >= 500)
Throttling errors (status == 429)
This does not handle 404, 403 or other status codes.
- swh.lister.utils.http_retry(retry=<function retry_policy_generic>, wait=<tenacity.wait.wait_exponential object>, stop=<tenacity.stop.stop_after_attempt object>, **retry_args)[source]#
Decorator based on tenacity for retrying a function possibly raising requests.exception.HTTPError for status code 429 (too many requests).
It provides a default configuration that should work properly in most cases but all tenacity.retry parameters can also be overridden in client code.
When the mmaximum of attempts is reached, the HTTPError exception will then be reraised.
- Parameters:
retry – function defining request retry condition (default to 429 status code) https://tenacity.readthedocs.io/en/latest/#whether-to-retry
wait – function defining wait strategy before retrying (default to exponential backoff) https://tenacity.readthedocs.io/en/latest/#waiting-before-retrying
stop – function defining when to stop retrying (default after 5 attempts) https://tenacity.readthedocs.io/en/latest/#stopping
- swh.lister.utils.is_valid_origin_url(url: Optional[str]) bool [source]#
Returns whether the given string is a valid origin URL. This excludes Git SSH URLs and pseudo-URLs (eg.
ssh://git@example.org:foo
andgit@example.org:foo
), as they are not supported by the Git loader and usually require authentication.All HTTP URLs are allowed:
>>> is_valid_origin_url("http://example.org/repo.git") True >>> is_valid_origin_url("http://example.org/repo") True >>> is_valid_origin_url("https://example.org/repo") True >>> is_valid_origin_url("https://foo:bar@example.org/repo") True
Scheme-less URLs are rejected;
>>> is_valid_origin_url("example.org/repo") False >>> is_valid_origin_url("example.org:repo") False
Git SSH URLs and pseudo-URLs are rejected:
>>> is_valid_origin_url("git@example.org:repo") False >>> is_valid_origin_url("ssh://git@example.org:repo") False