swh.lister.utils module#

swh.lister.utils.split_range(total_pages: int, nb_pages: int) Iterator[Tuple[int, int]][source]#

Split total_pages into mostly nb_pages ranges. In some cases, the last range can have one more element.

>>> list(split_range(19, 10))
[(0, 9), (10, 19)]
>>> list(split_range(20, 3))
[(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 20)]
>>> list(split_range(21, 3))
[(0, 2), (3, 5), (6, 8), (9, 11), (12, 14), (15, 17), (18, 21)]
swh.lister.utils.is_valid_origin_url(url: str | None) bool[source]#

Returns whether the given string is a valid origin URL. This excludes Git SSH URLs and pseudo-URLs (eg. ssh://git@example.org:foo and git@example.org:foo), as they are not supported by the Git loader and usually require authentication.

All HTTP URLs are allowed:

>>> is_valid_origin_url("http://example.org/repo.git")
True
>>> is_valid_origin_url("http://example.org/repo")
True
>>> is_valid_origin_url("https://example.org/repo")
True
>>> is_valid_origin_url("https://foo:bar@example.org/repo")
True

Scheme-less URLs are rejected;

>>> is_valid_origin_url("example.org/repo")
False
>>> is_valid_origin_url("example.org:repo")
False

Git SSH URLs and pseudo-URLs are rejected:

>>> is_valid_origin_url("git@example.org:repo")
False
>>> is_valid_origin_url("ssh://git@example.org:repo")
False
exception swh.lister.utils.ArtifactNatureUndetected[source]#

Bases: ValueError

Raised when a remote artifact’s nature (tarball, file) cannot be detected.

exception swh.lister.utils.ArtifactNatureMistyped[source]#

Bases: ValueError

Raised when a remote artifact is neither a tarball nor a file.

Error of this type are’ probably a misconfiguration in the manifest generation that badly typed a vcs repository.

exception swh.lister.utils.ArtifactWithoutExtension[source]#

Bases: ValueError

Raised when an artifact nature cannot be determined by its name.

swh.lister.utils.url_contains_tarball_filename(urlparsed, extensions: List[str], raise_when_no_extension: bool = True) bool[source]#

Determine whether urlparsed contains a tarball filename ending with one of the extensions passed as parameter, path parts and query parameters are checked.

This also account for the edge case of a filename with only a version as name (so no extension in the end.)

Raises:
  • ArtifactWithoutExtension in case no extension is available and

  • raise_when_no_extension is True (the default)

swh.lister.utils.is_tarball(urls: List[str], request: Any | None = None) Tuple[bool, str][source]#

Determine whether a list of files actually are tarball or simple files.

This iterates over the list of urls provided to detect the artifact’s nature. When this cannot be answered simply out of the url and request is provided, this executes a HTTP HEAD query on the url to determine the information. If request is not provided, this raises an ArtifactNatureUndetected exception.

If, at the end of the iteration on the urls, no detection could be deduced, this raises an ArtifactNatureUndetected.

Parameters:
  • urls – name of the remote files to check for artifact nature.

  • request – (Optional) Request object allowing http calls. If not provided and naive check cannot detect anything, this raises ArtifactNatureUndetected.

Raises:
  • ArtifactNatureUndetected when the artifact's nature cannot be detected out – of its urls

  • ArtifactNatureMistyped when the artifact is not a tarball nor a file. It's up to – the caller to do what’s right with it.

Returns: A tuple (bool, url). The boolean represents whether the url is an archive

or not. The second parameter is the actual url once the head request is issued as a fallback of not finding out whether the urls are tarballs or not.