swh.lister.maven.lister module#
- class swh.lister.maven.lister.MavenListerState(last_seen_doc: int = -1, last_seen_pom: int = -1)[source]#
Bases:
objectState of the MavenLister
- class swh.lister.maven.lister.MavenLister(scheduler: SchedulerInterface, url: str, index_url: str | None = None, instance: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, incremental: bool = True, with_github_session=True, process_pom_files: bool = True)[source]#
Bases:
Lister[MavenListerState,Dict[str,Any] |None]List origins from a Maven repository.
Maven Central provides artifacts for Java builds. It includes POM files and source archives, which we download to get the source code of artifacts and links to their scm repository.
This lister yields origins of types: git/svn/hg or whatever the Artifacts use as repository type, plus maven types for the maven loader (tarball, source jar).
The lister relies on the use of the maven index exporter tool allowing to convert the binary content of a maven repository index to NDJSON format (https://gitlab.softwareheritage.org/swh/devel/fixtures/maven-index-exporter). To be able to execute the tool, Java runtime environment >= 17 must be available in the lister execution environment.
Lister class for Maven repositories.
- Parameters:
url – main URL of the Maven repository, i.e. url of the base index used to fetch maven artifacts. For Maven central use https://repo1.maven.org/maven2/
instance – Name of maven instance. Defaults to url’s network location if unset.
incremental – defaults to
True. Defines if incremental listing is activated or not.with_github_session – defaults to
True. Defines if canonical URL for extracted github repository should be retrieved with the GitHub REST API.
- state_from_dict(d: Dict[str, Any]) MavenListerState[source]#
Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.
- state_to_dict(state: MavenListerState) Dict[str, Any][source]#
Convert the StateType for this lister to its serialization as dict for storage in the scheduler.
Values must be JSON-compatible as that’s what the backend database expects.
- get_pages() Iterator[Dict[str, Any] | None][source]#
Retrieve and parse exported maven indexes to identify all pom files and src archives.
- get_scm(page: Dict[str, Any] | None) ListedOrigin | None[source]#
Retrieve scm origin out of the page information. Only called when type of the page is scm.
Try and detect an scm/vcs repository. Note that official format is in the form: scm:{type}:git://example.org/{user}/{repo}.git but some projects directly put the repo url (without the “scm:type”), so we have to check against the content to extract the type and url properly.
- Raises
AssertionError when the type of the page is not ‘scm’
- Returns
ListedOrigin with proper canonical scm url (for github) if any is found, None otherwise.
- get_origins_from_page(page: Dict[str, Any] | None) Iterator[ListedOrigin][source]#
Convert a page of Maven repositories into a list of ListedOrigins.