swh.lister.gitweb.lister module#

class swh.lister.gitweb.lister.GitwebLister(scheduler: SchedulerInterface, url: str | None = None, instance: str | None = None, base_git_url: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True)[source]#

Bases: StatelessLister[List[Dict[str, Any]]]

Lister class for Gitweb repositories.

This lister will retrieve the list of published git repositories by parsing the HTML page(s) of the index retrieved at url.

Lister class for Gitweb repositories.

Parameters:
  • url – Root URL of the Gitweb instance, i.e. url of the index of published git repositories on this instance. Defaults to https://instance if unset.

  • instance – Name of gitweb instance. Defaults to url’s network location if unset.

  • base_git_url – Base URL to clone a git project hosted on the Gitweb instance, should only be used if the clone URLs cannot be found when scraping project page or cannot be easily derived from the root URL of the instance

LISTER_NAME: str = 'gitweb'#
get_pages() Iterator[List[Dict[str, Any]]][source]#

Generate git ‘project’ URLs found on the current Gitweb server.

get_origins_from_page(repositories: List[Dict[str, Any]]) Iterator[ListedOrigin][source]#

Convert a page of gitweb repositories into a list of ListedOrigins.

swh.lister.gitweb.lister.try_to_determine_git_repository(repository_url: str, base_git_url: str | None = None) str | None[source]#

Some gitweb instances does not advertise the git urls.

This heuristic works on instances demonstrating this behavior.

swh.lister.gitweb.lister.parse_last_update(last_update_interval: str | None) datetime | None[source]#

Parse the last update string into a datetime.