swh.lister.rubygems.lister module#

class swh.lister.rubygems.lister.RubyGemsLister(scheduler: SchedulerInterface, url: str = 'https://s3-us-west-2.amazonaws.com/rubygems-dumps', instance: str = 'rubygems', credentials: Optional[Dict[str, Dict[str, List[Dict[str, str]]]]] = None, max_origins_per_page: Optional[int] = None, max_pages: Optional[int] = None, enable_origins: bool = True)[source]#

Bases: StatelessLister[Dict[str, Any]]

Lister for RubyGems.org, the Ruby community’s gem hosting service.

Instead of querying rubygems.org Web API, it uses gems data from the daily PostreSQL database dump of rubygems. It enables to gather all interesting info about a gem and its release artifacts (version number, download URL, checksums, release date) in an efficient way and without flooding rubygems Web API with numerous HTTP requests (as there is more than 187000 gems available on 07/10/2022).

LISTER_NAME: str = 'rubygems'#
VISIT_TYPE = 'rubygems'#
INSTANCE = 'rubygems'#
RUBY_GEMS_POSTGRES_DUMP_BASE_URL = 'https://s3-us-west-2.amazonaws.com/rubygems-dumps'#
RUBY_GEMS_POSTGRES_DUMP_LIST_URL = 'https://s3-us-west-2.amazonaws.com/rubygems-dumps?prefix=production/public_postgresql'#
RUBY_GEM_DOWNLOAD_URL_PATTERN = 'https://rubygems.org/downloads/{gem}-{version}.gem'#
RUBY_GEM_ORIGIN_URL_PATTERN = 'https://rubygems.org/gems/{gem}'#
RUBY_GEM_EXTRINSIC_METADATA_URL_PATTERN = 'https://rubygems.org/api/v2/rubygems/{gem}/versions/{version}.json'#
DB_NAME = 'rubygems'#
DUMP_SQL_PATH = 'public_postgresql/databases/PostgreSQL.sql.gz'#
get_latest_dump_file() str[source]#
create_rubygems_db(postgresql: Postgresql) Tuple[str, connection][source]#
populate_rubygems_db(db_url: str)[source]#
get_pages() Iterator[Dict[str, Any]][source]#

Retrieve a list of pages of listed results. This is the main loop of the lister.

Returns:

an iterator of raw pages fetched from the platform currently being listed.

get_origins_from_page(page: Dict[str, Any]) Iterator[ListedOrigin][source]#

Extract a list of model.ListedOrigin from a raw page of results.

Parameters:

page – a single page of results

Returns:

an iterator for the origins present on the given page of results

url: str#
recorded_origins: Set[str]#