swh.lister.maven.lister module#

class swh.lister.maven.lister.MavenListerState(last_seen_doc: int = -1, last_seen_pom: int = -1)[source]#

Bases: object

State of the MavenLister

last_seen_doc: int = -1#

Last doc ID ingested during an incremental pass

last_seen_pom: int = -1#

Last doc ID related to a pom and ingested during an incremental pass

class swh.lister.maven.lister.MavenLister(scheduler: SchedulerInterface, url: str, index_url: str | None = None, instance: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, incremental: bool = True, with_github_session=True, process_pom_files: bool = True)[source]#

Bases: Lister[MavenListerState, Dict[str, Any] | None]

List origins from a Maven repository.

Maven Central provides artifacts for Java builds. It includes POM files and source archives, which we download to get the source code of artifacts and links to their scm repository.

This lister yields origins of types: git/svn/hg or whatever the Artifacts use as repository type, plus maven types for the maven loader (tarball, source jar).

The lister relies on the use of the maven index exporter tool allowing to convert the binary content of a maven repository index to NDJSON format (https://gitlab.softwareheritage.org/swh/devel/fixtures/maven-index-exporter). To be able to execute the tool, Java runtime environment >= 17 must be available in the lister execution environment.

Lister class for Maven repositories.

Parameters:
  • url – main URL of the Maven repository, i.e. url of the base index used to fetch maven artifacts. For Maven central use https://repo1.maven.org/maven2/

  • instance – Name of maven instance. Defaults to url’s network location if unset.

  • incremental – defaults to True. Defines if incremental listing is activated or not.

  • with_github_session – defaults to True. Defines if canonical URL for extracted github repository should be retrieved with the GitHub REST API.

LISTER_NAME: str = 'maven'#
state_from_dict(d: Dict[str, Any]) MavenListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: MavenListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

get_pages() Iterator[Dict[str, Any] | None][source]#

Retrieve and parse exported maven indexes to identify all pom files and src archives.

get_scm(page: Dict[str, Any] | None) ListedOrigin | None[source]#

Retrieve scm origin out of the page information. Only called when type of the page is scm.

Try and detect an scm/vcs repository. Note that official format is in the form: scm:{type}:git://example.org/{user}/{repo}.git but some projects directly put the repo url (without the “scm:type”), so we have to check against the content to extract the type and url properly.

Raises

AssertionError when the type of the page is not ‘scm’

Returns

ListedOrigin with proper canonical scm url (for github) if any is found, None otherwise.

get_origins_from_page(page: Dict[str, Any] | None) Iterator[ListedOrigin][source]#

Convert a page of Maven repositories into a list of ListedOrigins.

commit_page(page: Dict[str, Any] | None) None[source]#

Update currently stored state using the latest listed doc.

Note: this is a noop for full listing mode

finalize() None[source]#

Finalize the lister state, set update if any progress has been made.

Note: this is a noop for full listing mode