swh.lister.nixguix.lister module#

NixGuix lister definition.

This lists artifacts out of manifest for Guix or Nixpkgs manifests.

Artifacts can be of types: - upstream git repository (NixOS/nixpkgs, Guix) - VCS repositories (svn, git, hg, …) - unique file - unique tarball

swh.lister.nixguix.lister.DEFAULT_EXTENSIONS_TO_IGNORE = ['.AppImage', '.bin', '.exe', '.iso', '.linux64', '.msi', '.png', '.dic', '.deb', '.rpm', '.nupkg']#

By default, ignore binary files and archives containing binaries.

class swh.lister.nixguix.lister.ChecksumLayout(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The possible artifact types listed out of the manifest.

STANDARD = 'standard'#

Standard “flat” checksums (e.g. sha1, sha256, …) on the tarball or file.

NAR = 'nar'#

The checksum(s) are computed over the NAR dump of the output (e.g. uncompressed directory.). That uncompressed directory can come from a tarball or a (d)vcs. It’s also called “recursive” in the “outputHashMode” key in the upstream dataset.

swh.lister.nixguix.lister.MAPPING_CHECKSUM_LAYOUT = {'flat': ChecksumLayout.STANDARD, 'recursive': ChecksumLayout.NAR}#

Mapping between the outputHashMode from the manifest and how to compute checksums.

class swh.lister.nixguix.lister.Artifact(origin: str, visit_type: str, fallback_urls: List[str], checksums: Dict[str, str], checksum_layout: ChecksumLayout, ref: str | None, submodules: bool, svn_paths: List[str] | None, extrinsic_metadata: Dict[str, Any], last_update: datetime | None)[source]#

Bases: object

Metadata information on Remote Artifact with url (tarball or file).

origin: str#

Canonical url retrieve the tarball artifact.

visit_type: str#

Either ‘tar’ or ‘file’

fallback_urls: List[str]#

List of urls to retrieve tarball artifact if canonical url no longer works.

checksums: Dict[str, str]#

Integrity hash converted into a checksum dict.

checksum_layout: ChecksumLayout#

Checksum layout mode to provide to loaders (e.g. nar, standard, …)

ref: str | None#

Optional reference on the artifact (git commit, branch, svn commit, tag, …)

submodules: bool#

Indicates if submodules should be retrieved for a git-checkout visit type

svn_paths: List[str] | None#

Optional list of paths for the svn-export loader, only those will be exported and loaded into the archive

extrinsic_metadata: Dict[str, Any]#

Extrinsic metadata for the artifact as found in the JSON file consumed by the lister describing more precisely what is archived

last_update: datetime | None#

“Optional last update date for the artifact

class swh.lister.nixguix.lister.VCS(origin: str, type: str)[source]#

Bases: object

Metadata information on VCS.

origin: str#

Origin url of the vcs

type: str#

Type of (d)vcs, e.g. svn, git, hg, …

class swh.lister.nixguix.lister.ArtifactType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The possible artifact types listed out of the manifest.

ARTIFACT = 'artifact'#
VCS = 'vcs'#
swh.lister.nixguix.lister.VCS_ARTIFACT_TYPE_TO_VISIT_TYPE = {'git': 'git-checkout', 'hg': 'hg-checkout', 'svn': 'svn-export'}#

Mapping between the vcs artifact type to the loader’s visit type.

class swh.lister.nixguix.lister.NixGuixLister(scheduler, url: str, origin_upstream: str, instance: str | None = None, credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, max_origins_per_page: int | None = None, max_pages: int | None = None, enable_origins: bool = True, canonicalize: bool = True, extensions_to_ignore: List[str] = [], nixos_cache_url: str = 'https://cache.nixos.org', **kwargs: Any)[source]#

Bases: StatelessLister[Tuple[ArtifactType, Artifact | VCS]]

List Guix or Nix sources out of a public json manifest.

This lister can output:

  • unique tarball URLs (.tar.gz, .tbz2, …)

  • VCS repositories (e.g. git, hg, svn)

  • unique file URLs (.lisp, .py, …)

In the case of VCS repositories, if a reference is provided (git_ref, svn_revision or hg_changeset with a specific outputHashMode set to recursive), this provides one more origin to ingest as a directory. The swh.loader.git.directory.GitCheckoutLoader, swh.loader.mercurial.directory.HgCheckoutLoader and swh.loader.svn.directory.SvnExportLoader classes will then be in charge to ingest the origin as a directory (checking the associated integrity field first).

Note that no last_update is available in guix manifest so listed origins do not have it set.

For URL type artifacts, this tries to determine the artifact’s nature, tarball or file. It first tries to compute out of the URL extension. In case of no extension, it fallbacks to HEAD query the URL to retrieve the origin out of the Location response header, and then checks the extension again. As a last resort, a few bytes will be downloaded from the artifact URL to detect its nature from its mime type. The swh.loader.core.loader.ContentLoader and swh.loader.core.loader.TarballDirectoryLoader classes will then be in charge to ingest the origin (checking the associated integrity field first).

Optionally, when the extension_to_ignore parameter is provided, it extends the default extensions to ignore (DEFAULT_EXTENSIONS_TO_IGNORE) with those passed. This can be optionally used to filter some more binary files detected in the wild.

LISTER_NAME: str = 'nixguix'#
build_artifact(artifact_url: str, artifact_type: str) Tuple[ArtifactType, VCS] | None[source]#

Build a canonicalized vcs artifact when possible.

convert_integrity_to_checksums(integrity: str, failure_log: str) Dict[str, str] | None[source]#

Determine the content checksum stored in the integrity field and convert into a dict of checksums. This only parses the hash-expression (hash-<b64-encoded-checksum>) as defined in https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute

get_pages() Iterator[Tuple[ArtifactType, Artifact | VCS]][source]#

Yield one page per “typed” origin referenced in manifest.

vcs_to_listed_origin(artifact: VCS) Iterator[ListedOrigin][source]#

Given a vcs repository, yield a ListedOrigin.

artifact_to_listed_origin(artifact: Artifact) Iterator[ListedOrigin][source]#

Given an artifact (tarball, file), yield one ListedOrigin.

get_origins_from_page(artifact_tuple: Tuple[ArtifactType, Artifact | VCS]) Iterator[ListedOrigin][source]#

Given an artifact tuple (type, artifact), yield a ListedOrigin.