swh.lister.sourceforge.lister module#

class swh.lister.sourceforge.lister.VcsNames(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

Used to filter SourceForge tool names for valid VCS types

CVS = 'cvs'#
GIT = 'git'#
SUBVERSION = 'svn'#
MERCURIAL = 'hg'#
BAZAAR = 'bzr'#
class swh.lister.sourceforge.lister.SourceForgeListerEntry(vcs: swh.lister.sourceforge.lister.VcsNames, url: str, last_modified: datetime.date)[source]#

Bases: object

vcs: VcsNames#
url: str#
last_modified: date#
class swh.lister.sourceforge.lister.SourceForgeListerState(subsitemap_last_modified: ~typing.Dict[str, ~datetime.date] = <factory>, empty_projects: ~typing.Dict[str, ~datetime.date] = <factory>)[source]#

Bases: object

Current state of the SourceForge lister in incremental runs

subsitemap_last_modified: Dict[str, date]#

Some projects (not the majority, but still meaningful) have no VCS for us to archive. We need to remember a mapping of their API URL to their “last modified” date so we don’t keep querying them needlessly every time.

empty_projects: Dict[str, date]#
class swh.lister.sourceforge.lister.SourceForgeLister(scheduler: SchedulerInterface, url: str = 'https://sourceforge.net', instance: str = 'main', incremental: bool = False, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True)[source]#

Bases: Lister[SourceForgeListerState, List[SourceForgeListerEntry]]

List origins from the “SourceForge” forge.

SOURCEFORGE_URL = 'https://sourceforge.net'#
LISTER_NAME: str = 'sourceforge'#
INSTANCE = 'main'#
state_from_dict(d: Dict[str, Dict[str, Any]]) SourceForgeListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: SourceForgeListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

projects_last_modified() Dict[Tuple[str, str], date][source]#
get_pages() Iterator[List[SourceForgeListerEntry]][source]#

SourceForge has a main XML sitemap that lists its sharded sitemaps for all projects. Each XML sub-sitemap lists project pages, which are not unique per project: a project can have a wiki, a home, a git, an svn, etc. For each unique project, we query an API endpoint that lists (among other things) the tools associated with said project, some of which are the VCS used. Subprojects are considered separate projects. Lastly we use the information of which VCS are used to build the predictable clone URL for any given VCS.

get_origins_from_page(page: List[SourceForgeListerEntry]) Iterator[ListedOrigin][source]#

Extract a list of model.ListedOrigin from a raw page of results.

Parameters:

page – a single page of results

Returns:

an iterator for the origins present on the given page of results