swh.lister.github.lister module#

class swh.lister.github.lister.GitHubListerState(last_seen_id: int = 0)[source]#

Bases: object

State of the GitHub lister

last_seen_id: int = 0#

Numeric id of the last repository listed on an incremental pass

class swh.lister.github.lister.GitHubLister(scheduler: SchedulerInterface, url: str = 'https://api.github.com/repositories', instance: str = 'github', credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, first_id: int | None = None, last_id: int | None = None)[source]#

Bases: Lister[GitHubListerState, List[Dict[str, Any]]]

List origins from GitHub.

By default, the lister runs in incremental mode: it lists all repositories, starting with the last_seen_id stored in the scheduler backend.

Providing the first_id and last_id arguments enables the “relisting” mode: in that mode, the lister finds the origins present in the range excluding first_id and including last_id. In this mode, the lister can overrun the last_id: it will always record all the origins seen in a given page. As the lister is fully idempotent, this is not a practical problem. Once relisting completes, the lister state in the scheduler backend is not updated.

When the config contains a set of credentials, we shuffle this list at the beginning of the listing. To follow GitHub’s abuse rate limit policy, we keep using the same token over and over again, until its rate limit runs out. Once that happens, we switch to the next token over in our shuffled list.

When a request fails with a rate limit exception for all tokens, we pause the listing until the largest value for X-Ratelimit-Reset over all tokens.

When the credentials aren’t set in the lister config, the lister can run in anonymous mode too (e.g. for testing purposes).

Parameters:
  • first_id – the id of the first repo to list

  • last_id – stop listing after seeing a repo with an id higher than this value.

LISTER_NAME: str = 'github'#
INSTANCE = 'github'#
API_URL = 'https://api.github.com/repositories'#
PAGE_SIZE = 1000#
state_from_dict(d: Dict[str, Any]) GitHubListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: GitHubListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

get_pages() Iterator[List[Dict[str, Any]]][source]#

Retrieve a list of pages of listed results. This is the main loop of the lister.

Returns:

an iterator of raw pages fetched from the platform currently being listed.

get_origins_from_page(page: List[Dict[str, Any]]) Iterator[ListedOrigin][source]#

Convert a page of GitHub repositories into a list of ListedOrigins.

This records the html_url, as well as the pushed_at value if it exists.

commit_page(page: List[Dict[str, Any]])[source]#

Update the currently stored state using the latest listed page

finalize()[source]#

Custom hook to finalize the lister state before returning from the main loop.

This method must set updated if the lister has done some work.

If relevant, this method can use :meth`get_state_from_scheduler` to merge the current lister state with the one from the scheduler backend, reducing the risk of race conditions if we’re running concurrent listings.

This method is called in a finally block, which means it will also run when the lister fails.