swh.lister.save_bulk.lister module#

swh.lister.save_bulk.lister.is_valid_tarball_url(origin_url: str) Tuple[bool, str | None][source]#

Checks if an URL targets a tarball using a set of heuritiscs.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL targets a tarball and

second member holds an optional error message if check failed

swh.lister.save_bulk.lister.is_valid_git_url(origin_url: str) Tuple[bool, str | None][source]#

Check if an URL targets a public git repository by attempting to list its remote refs.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL targets a public git

repository and second member holds an error message if check failed

swh.lister.save_bulk.lister.is_valid_svn_url(origin_url: str) Tuple[bool, str | None][source]#

Check if an URL targets a public subversion repository by attempting to get repository information.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL targets a public subversion

repository and second member holds an error message if check failed

swh.lister.save_bulk.lister.is_valid_hg_url(origin_url: str) Tuple[bool, str | None][source]#

Check if an URL targets a public mercurial repository by attempting to connect to the remote repository.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL targets a public mercurial

repository and second member holds an error message if check failed

swh.lister.save_bulk.lister.is_valid_bzr_url(origin_url: str) Tuple[bool, str | None][source]#

Check if an URL targets a public bazaar repository by attempting to get repository information.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL targets a public bazaar

repository and second member holds an error message if check failed

swh.lister.save_bulk.lister.is_valid_cvs_url(origin_url: str) Tuple[bool, str | None][source]#

Check if an URL matches one of the formats expected by the CVS loader of Software Heritage.

Parameters:

origin_url – The URL to check

Returns:

a tuple whose first member indicates if the URL matches one of the formats

expected by the CVS loader and second member holds an error message if check failed.

class swh.lister.save_bulk.lister.SubmittedOrigin[source]#

Bases: TypedDict

origin_url: str#
visit_type: str#
class swh.lister.save_bulk.lister.RejectedOrigin(origin_url: str, visit_type: str, reason: str, exception: str | None)[source]#

Bases: object

origin_url: str#
visit_type: str#
reason: str#
exception: str | None#
class swh.lister.save_bulk.lister.SaveBulkListerState(rejected_origins: ~typing.List[~swh.lister.save_bulk.lister.RejectedOrigin] = <factory>)[source]#

Bases: object

Stored lister state

rejected_origins: List[RejectedOrigin]#

List of origins rejected by the lister.

class swh.lister.save_bulk.lister.SaveBulkLister(url: str, instance: str, scheduler: SchedulerInterface, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, per_page: int = 1000, max_workers: int = 4)[source]#

Bases: Lister[SaveBulkListerState, List[SubmittedOrigin]]

The save-bulk lister enables to verify a list of origins to archive provided by an HTTP endpoint. Its purpose is to avoid polluting the scheduler database with origins that cannot be loaded into the archive.

Each origin is identified by an URL and a visit type. For a given visit type the lister is checking if the origin URL can be found and if the visit type is valid.

The HTTP endpoint must return an origins list in a paginated way through the use of two integer query parameters: page indicates the page to fetch and per_page corresponds the number of origins in a page. The endpoint must return a JSON list in the following format:

[
    {
        "origin_url": "https://git.example.org/user/project",
        "visit_type": "git"
    },
    {
        "origin_url": "https://example.org/downloads/project.tar.gz",
        "visit_type": "tarball-directory"
    }
]

The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus the one for loading a tarball content into the archive (tarball-directory).

Accepted origins are inserted or upserted in the scheduler database.

Rejected origins are stored in the lister state.

LISTER_NAME: str = 'save-bulk'#
state_from_dict(d: Dict[str, Any]) SaveBulkListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: SaveBulkListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

get_pages() Iterator[List[SubmittedOrigin]][source]#

Retrieve a list of pages of listed results. This is the main loop of the lister.

Returns:

an iterator of raw pages fetched from the platform currently being listed.

check_origin(origin_url: str, visit_type: str) ListedOrigin | RejectedOrigin[source]#
get_origins_from_page(origins: List[SubmittedOrigin]) Iterator[ListedOrigin][source]#

Extract a list of model.ListedOrigin from a raw page of results.

Parameters:

page – a single page of results

Returns:

an iterator for the origins present on the given page of results