swh.lister.maven.lister module#

class swh.lister.maven.lister.MavenListerState(last_seen_doc: int = -1, last_seen_pom: int = -1)[source]#

Bases: object

State of the MavenLister

last_seen_doc: int = -1#

Last doc ID ingested during an incremental pass

last_seen_pom: int = -1#

Last doc ID related to a pom and ingested during an incremental pass

class swh.lister.maven.lister.MavenLister(scheduler: SchedulerInterface, url: str, index_url: str, instance: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, incremental: bool = True)[source]#

Bases: Lister[MavenListerState, Dict[str, Any]]

List origins from a Maven repository.

Maven Central provides artifacts for Java builds. It includes POM files and source archives, which we download to get the source code of artifacts and links to their scm repository.

This lister yields origins of types: git/svn/hg or whatever the Artifacts use as repository type, plus maven types for the maven loader (tgz, jar).

Lister class for Maven repositories.

Parameters:
  • url – main URL of the Maven repository, i.e. url of the base index used to fetch maven artifacts. For Maven central use https://repo1.maven.org/maven2/

  • index_url – the URL to download the exported text indexes from. Would typically be a local host running the export docker image. See README.md in this directory for more information.

  • instance – Name of maven instance. Defaults to url’s network location if unset.

  • incremental – bool, defaults to True. Defines if incremental listing is activated or not.

LISTER_NAME: str = 'maven'#
state_from_dict(d: Dict[str, Any]) MavenListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: MavenListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

get_pages() Iterator[Dict[str, Any]][source]#

Retrieve and parse exported maven indexes to identify all pom files and src archives.

get_scm(page: Dict[str, Any]) ListedOrigin | None[source]#

Retrieve scm origin out of the page information. Only called when type of the page is scm.

Try and detect an scm/vcs repository. Note that official format is in the form: scm:{type}:git://example.org/{user}/{repo}.git but some projects directly put the repo url (without the “scm:type”), so we have to check against the content to extract the type and url properly.

Raises

AssertionError when the type of the page is not ‘scm’

Returns

ListedOrigin with proper canonical scm url (for github) if any is found, None otherwise.

get_origins_from_page(page: Dict[str, Any]) Iterator[ListedOrigin][source]#

Convert a page of Maven repositories into a list of ListedOrigins.

commit_page(page: Dict[str, Any]) None[source]#

Update currently stored state using the latest listed doc.

Note: this is a noop for full listing mode

finalize() None[source]#

Finalize the lister state, set update if any progress has been made.

Note: this is a noop for full listing mode