Source code for swh.lister.crates

# Copyright (C) 2022 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information

"""
Crates lister
=============

The Crates lister list origins from `Crates.io`_, the Rust community’s crate registry.

Origins are `packages`_ for the `Rust language`_ ecosystem.
Package follow a `layout specifications`_ to be usable with the `Cargo`_ package manager
and have a `Cargo.toml`_ file manifest which consists in metadata to describe and build
a specific package version.

As of August 2022 `Crates.io`_ list 89013 packages name for a total of 588215 released
versions.

Origins retrieving strategy
---------------------------

A json http api to list packages from crates.io exists but we choose a
`different strategy`_ in order to reduce to its bare minimum the amount
of http call and bandwidth.

We download a `db-dump.tar.gz`_ archives which contains csv files as an export of
the crates.io database. Crates.csv list package names, versions.csv list versions
related to package names.
It takes a few seconds to download the archive and parse csv files to build a
full index of existing package and related versions.

The archive also contains a metadata.json file with a timestamp corresponding to
the date the database dump started. The database dump is automatically generated
every 24 hours, around 02:00:00 UTC.

The lister is incremental, so the first time it downloads the db-dump.tar.gz archive as
previously described and store the last seen database dump timestamp.
Next time, it downloads the db-dump.tar.gz but retrieves only the list of new and
changed packages since last seen timestamp with all of their related versions.

Page listing
------------

Each page is related to one package.
Each line of a page corresponds to different versions of this package.

The data schema for each line is:

* **name**: Package name
* **version**: Package version
* **crate_file**: Package download url
* **checksum**: Package download checksum
* **yanked**: Whether the package is yanked or not
* **last_update**: Iso8601 last update

Origins from page
-----------------

The lister yields one origin per page.
The origin url corresponds to the http api url for a package, for example
"https://crates.io/crates/{crate}".

Additionally we add some data for each version, set to "extra_loader_arguments":

* **artifacts**: Represent data about the Crates to download, following
    :ref:`original-artifacts-json specification <extrinsic-metadata-original-artifacts-json>`
* **crates_metadata**: To store all other interesting attributes that do not belongs
    to artifacts. For now it mainly indicate when a version is `yanked`_, and the version
    last_update timestamp.

Origin data example::

    {
        "url": "https://crates.io/api/v1/crates/regex-syntax",
        "artifacts": [
            {
                "version": "0.1.0",
                "checksums": {
                    "sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944",  # noqa: B950
                },
                "filename": "regex-syntax-0.1.0.crate",
                "url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate",  # noqa: B950
            },
        ],
        "crates_metadata": [
            {
                "version": "0.1.0",
                "last_update": "2017-11-30 03:37:17.449539",
                "yanked": False,
            },
        ],
    },

Running tests
-------------

Activate the virtualenv and run from within swh-lister directory:

   pytest -s -vv --log-cli-level=DEBUG swh/lister/crates/tests

Testing with Docker
-------------------

Change directory to swh/docker then launch the docker environment:

   docker compose up -d

Then schedule a crates listing task::

   docker compose exec swh-scheduler swh scheduler task add -p oneshot list-crates

You can follow lister execution by displaying logs of swh-lister service::

   docker compose logs -f swh-lister

.. _Crates.io: https://crates.io
.. _packages: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html
.. _Rust language: https://www.rust-lang.org/
.. _layout specifications: https://doc.rust-lang.org/cargo/guide/project-layout.html
.. _Cargo: https://doc.rust-lang.org/cargo/guide/why-cargo-exists.html#enter-cargo
.. _Cargo.toml: https://doc.rust-lang.org/cargo/reference/manifest.html
.. _different strategy: https://crates.io/data-access
.. _yanked: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-yank
.. _db-dump.tar.gz: https://static.crates.io/db-dump.tar.gz
"""


[docs] def register(): from .lister import CratesLister return { "lister": CratesLister, "task_modules": ["%s.tasks" % __name__], }