swh.lister.crates package#
Submodules#
- swh.lister.crates.lister module
CratesListerState
CratesLister
CratesLister.LISTER_NAME
CratesLister.VISIT_TYPE
CratesLister.INSTANCE
CratesLister.BASE_URL
CratesLister.DB_DUMP_URL
CratesLister.CRATE_FILE_URL_PATTERN
CratesLister.CRATE_URL_PATTERN
CratesLister.state_from_dict()
CratesLister.state_to_dict()
CratesLister.is_new()
CratesLister.get_and_parse_db_dump()
CratesLister.page_entry_dict()
CratesLister.get_pages()
CratesLister.get_origins_from_page()
CratesLister.finalize()
- swh.lister.crates.tasks module
Module contents#
Crates lister#
The Crates lister list origins from Crates.io, the Rust community’s crate registry.
Origins are packages for the Rust language ecosystem. Package follow a layout specifications to be usable with the Cargo package manager and have a Cargo.toml file manifest which consists in metadata to describe and build a specific package version.
As of August 2022 Crates.io list 89013 packages name for a total of 588215 released versions.
Origins retrieving strategy#
A json http api to list packages from crates.io exists but we choose a different strategy in order to reduce to its bare minimum the amount of http call and bandwidth.
We download a db-dump.tar.gz archives which contains csv files as an export of the crates.io database. Crates.csv list package names, versions.csv list versions related to package names. It takes a few seconds to download the archive and parse csv files to build a full index of existing package and related versions.
The archive also contains a metadata.json file with a timestamp corresponding to the date the database dump started. The database dump is automatically generated every 24 hours, around 02:00:00 UTC.
The lister is incremental, so the first time it downloads the db-dump.tar.gz archive as previously described and store the last seen database dump timestamp. Next time, it downloads the db-dump.tar.gz but retrieves only the list of new and changed packages since last seen timestamp with all of their related versions.
Page listing#
Each page is related to one package. Each line of a page corresponds to different versions of this package.
The data schema for each line is:
name: Package name
version: Package version
crate_file: Package download url
checksum: Package download checksum
yanked: Whether the package is yanked or not
last_update: Iso8601 last update
Origins from page#
The lister yields one origin per page. The origin url corresponds to the http api url for a package, for example “https://crates.io/crates/{crate}”.
Additionally we add some data for each version, set to “extra_loader_arguments”:
- artifacts: Represent data about the Crates to download, following
- crates_metadata: To store all other interesting attributes that do not belongs
to artifacts. For now it mainly indicate when a version is yanked, and the version last_update timestamp.
Origin data example:
{
"url": "https://crates.io/api/v1/crates/regex-syntax",
"artifacts": [
{
"version": "0.1.0",
"checksums": {
"sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", # noqa: B950
},
"filename": "regex-syntax-0.1.0.crate",
"url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", # noqa: B950
},
],
"crates_metadata": [
{
"version": "0.1.0",
"last_update": "2017-11-30 03:37:17.449539",
"yanked": False,
},
],
},
Running tests#
Activate the virtualenv and run from within swh-lister directory:
pytest -s -vv –log-cli-level=DEBUG swh/lister/crates/tests
Testing with Docker#
Change directory to swh/docker then launch the docker environment:
docker compose up -d
Then schedule a crates listing task:
docker compose exec swh-scheduler swh scheduler task add -p oneshot list-crates
You can follow lister execution by displaying logs of swh-lister service:
docker compose logs -f swh-lister