swh.lister.cgit.lister module#

class swh.lister.cgit.lister.CGitLister(scheduler: SchedulerInterface, url: str | None = None, instance: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, base_git_url: str | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True)[source]#

Bases: StatelessLister[List[Dict[str, Any]]]

Lister class for CGit repositories.

This lister will retrieve the list of published git repositories by parsing the HTML page(s) of the index retrieved at url.

The lister currently defines 2 listing behaviors:

  • If the base_git_url is provided, the listed origin urls are computed out of the base git url link and the one listed in the main listed page (resulting in less HTTP queries than the 2nd behavior below). This is expected to be the main deployed behavior.

  • Otherwise (with no base_git_url), for each found git repository listed, one extra HTTP query is made at the given url found in the main listing page to gather published “Clone” URLs to be used as origin URL for that git repo. If several “Clone” urls are provided, prefer the http/https one, if any, otherwise fallback to the first one.

Lister class for CGit repositories.

Parameters:
  • url – (Optional) Root URL of the CGit instance, i.e. url of the index of published git repositories on this instance. Defaults to https://instance if unset.

  • instance – Name of cgit instance. Defaults to url’s network location if unset.

  • base_git_url – Optional base git url which allows the origin url computations.

LISTER_NAME: str = 'cgit'#
get_pages() Iterator[List[Dict[str, Any]]][source]#

Generate git ‘project’ URLs found on the current CGit server The last_update date is retrieved on the list of repo page to avoid to compute it on the repository details which only give a date per branch

get_origins_from_page(repositories: List[Dict[str, Any]]) Iterator[ListedOrigin][source]#

Convert a page of cgit repositories into a list of ListedOrigins.