swh.lister.crates.lister module#

class swh.lister.crates.lister.CratesListerState(index_last_update: datetime | None = None)[source]#

Bases: object

Store lister state for incremental mode operations. ‘index_last_update’ represents the UTC time the crates.io database dump was started

index_last_update: datetime | None = None#
class swh.lister.crates.lister.CratesLister(scheduler: SchedulerInterface, url: str = 'https://crates.io', instance: str = 'crates', credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True)[source]#

Bases: Lister[CratesListerState, List[Dict[str, Any]]]

List origins from the “crates.io” forge.

It downloads a tar.gz archive which contains crates.io database table content as csv files which is automatically generated every 24 hours. Parsing two csv files we can list all Crates.io package names and their related versions.

In incremental mode, it check each entry comparing their ‘last_update’ value with self.state.index_last_update

LISTER_NAME: str = 'crates'#
VISIT_TYPE = 'crates'#
INSTANCE = 'crates'#
BASE_URL = 'https://crates.io'#
DB_DUMP_URL = 'https://static.crates.io/db-dump.tar.gz'#
CRATE_FILE_URL_PATTERN = 'https://static.crates.io/crates/{crate}/{crate}-{version}.crate'#
CRATE_URL_PATTERN = 'https://crates.io/crates/{crate}'#
state_from_dict(d: Dict[str, Any]) CratesListerState[source]#

Convert the state stored in the scheduler backend (as a dict), to the concrete StateType for this lister.

state_to_dict(state: CratesListerState) Dict[str, Any][source]#

Convert the StateType for this lister to its serialization as dict for storage in the scheduler.

Values must be JSON-compatible as that’s what the backend database expects.

is_new(dt_str: str)[source]#

Returns True when dt_str is greater than self.state.index_last_update

get_and_parse_db_dump() Dict[str, Any][source]#

Download and parse csv files from db_dump_path.

Returns a dict where each entry corresponds to a package name with its related versions.

page_entry_dict(entry: Dict[str, Any]) Dict[str, Any][source]#

Transform package version definition dict to a suitable page entry dict

get_pages() Iterator[List[Dict[str, Any]]][source]#

Each page is a list of crate versions with: - name: Name of the crate - version: Version - checksum: Checksum - yanked: Whether the package is yanked or not - crate_file: Url of the crate file - filename: File name of the crate file - last_update: Last update for that version

get_origins_from_page(page: List[Dict[str, Any]]) Iterator[ListedOrigin][source]#

Iterate on all crate pages and yield ListedOrigin instances.

finalize() None[source]#

Custom hook to finalize the lister state before returning from the main loop.

This method must set updated if the lister has done some work.

If relevant, this method can use :meth`get_state_from_scheduler` to merge the current lister state with the one from the scheduler backend, reducing the risk of race conditions if we’re running concurrent listings.

This method is called in a finally block, which means it will also run when the lister fails.