swh.loader.package.loader module

swh.loader.package.loader.SWH_METADATA_AUTHORITY = MetadataAuthority(type=<MetadataAuthorityType.REGISTRY: 'registry'>, url='https://softwareheritage.org/', metadata=ImmutableDict({}))

Metadata authority for extrinsic metadata generated by Software Heritage. Used for metadata on “original artifacts”, ie. length, filename, and checksums of downloaded archive files.


The extid_type and extid fields of an ExtID object.

alias of Tuple[str, bytes]

class swh.loader.package.loader.RawExtrinsicMetadataCore(format: str, metadata: bytes, discovery_date: Optional[datetime.datetime] = None)[source]

Bases: object

Contains the core of the metadata extracted by a loader, that will be used to build a full RawExtrinsicMetadata object by adding object identifier, context, and provenance information.

Method generated by attrs for class RawExtrinsicMetadataCore.


Defaults to the visit date.

class swh.loader.package.loader.BasePackageInfo(url: str, filename: Optional[str], *, directory_extrinsic_metadata: List[swh.loader.package.loader.RawExtrinsicMetadataCore] = [])[source]

Bases: object

Compute the primary key for a dict using the id_keys as primary key


  • d – A dict entry to compute the primary key on

  • id_keys – Sequence of keys to use as primary key


The identity for that dict entry

Method generated by attrs for class BasePackageInfo.

MANIFEST_FORMAT: Optional[string.Template] = None

If not None, used by the default extid() implementation to format a manifest, before hashing it to produce an ExtID.

EXTID_TYPE: str = 'package-manifest-sha256'

extrinsic metadata collected by the loader, that will be attached to the loaded directory and added to the Metadata storage.

extid() Optional[Tuple[str, bytes]][source]

Returns a unique intrinsic identifier of this package info, or None if this package info is not ‘deduplicatable’ (meaning that we will always load it, instead of checking the ExtID storage to see if we already did)

class swh.loader.package.loader.PackageLoader(storage: swh.storage.interface.StorageInterface, url: str, max_content_size: Optional[int] = None)[source]

Bases: swh.loader.core.loader.BaseLoader, Generic[swh.loader.package.loader.TPackageInfo]

Loader’s constructor. This raises exception if the minimal required

configuration is missing (cf. fn:check method).

  • storage – Storage instance

  • url – Origin url to load data from

visit_type: Optional[str] = ''
visit_date: datetime.datetime
get_versions() Sequence[str][source]

Return the list of all published package versions.


classswh.loader.exception.NotFound error when failing to read the published package versions.


Sequence of published versions

get_package_info(version: str) Iterator[Tuple[str, swh.loader.package.loader.TPackageInfo]][source]
Given a release version of a package, retrieve the associated

package information for such version.


version – Package version


(branch name, package metadata)

build_revision(p_info: swh.loader.package.loader.TPackageInfo, uncompressed_path: str, directory: bytes) Optional[swh.model.model.Revision][source]

Build the revision from the archive metadata (extrinsic artifact metadata) and the intrinsic metadata.

  • p_info – Package information

  • uncompressed_path – Artifact uncompressed path on disk


Revision object

get_default_version() str[source]

Retrieve the latest release version if any.


Latest version

last_snapshot() Optional[swh.model.model.Snapshot][source]

Retrieve the last snapshot out of the last visit.

known_artifacts(snapshot: Optional[swh.model.model.Snapshot]) Dict[bytes, Optional[swh.model.collections.ImmutableDict[str, object]]][source]

Retrieve the known releases/artifact for the origin.


snapshot: snapshot for the visit


Dict of keys revision id (bytes), values a metadata Dict.

new_packageinfo_to_extid(p_info: swh.loader.package.loader.TPackageInfo) Optional[Tuple[str, bytes]][source]
resolve_revision_from_extids(known_extids: Dict[Tuple[str, bytes], List[swh.model.swhids.CoreSWHID]], p_info: swh.loader.package.loader.TPackageInfo, revision_whitelist: Set[bytes]) Optional[bytes][source]

Resolve the revision from known ExtIDs and a package info object.

If the artifact has already been downloaded, this will return the existing revision targeting that uncompressed artifact directory. Otherwise, this returns None.

  • known_extids – Dict built from a list of ExtID, with the target as value

  • p_info – Package information

  • revision_whitelist – Any ExtID with target not in this set is filtered out


None or revision identifier

download_package(p_info: swh.loader.package.loader.TPackageInfo, tmpdir: str) List[Tuple[str, Mapping]][source]

Download artifacts for a specific package. All downloads happen in in the tmpdir folder.

Default implementation expects the artifacts package info to be about one artifact per package.

Note that most implementation have 1 artifact per package. But some implementation have multiple artifacts per package (debian), some have none, the package is the artifact (gnu).

  • artifacts_package_info – Information on the package artifacts to download (url, filename, etc…)

  • tmpdir – Location to retrieve such artifacts


List of (path, computed hashes)

uncompress(dl_artifacts: List[Tuple[str, Mapping[str, Any]]], dest: str) str[source]

Uncompress the artifact(s) in the destination folder dest.

Optionally, this could need to use the p_info dict for some more information (debian).

extra_branches() Dict[bytes, Mapping[str, Any]][source]

Return an extra dict of branches that are used to update the set of branches.

finalize_visit(*, snapshot: Optional[swh.model.model.Snapshot], visit: swh.model.model.OriginVisit, status_visit: str, status_load: str, failed_branches: List[str], errors: Optional[List[str]] = None) Dict[str, Any][source]

Finalize the visit:

  • flush eventual unflushed data to storage

  • update origin visit’s status

  • return the task’s status

load() Dict[source]

Load for a specific origin the associated contents.

  1. Get the list of versions in an origin.

  2. Get the snapshot from the previous run of the loader, and filter out versions that were already loaded, if their extids match

Then, for each remaining version in the origin

  1. Fetch the files for one package version By default, this can be implemented as a simple HTTP request. Loaders with more specific requirements can override this, e.g.: the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.

  2. Extract the downloaded files. By default, this would be a universal archive/tarball extraction.

    Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).

  3. Convert the extracted directory to a set of Software Heritage objects Using swh.model.from_disk.

  4. Extract the metadata from the unpacked directories This would only be applicable for “smart” loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).

    On “minimal-metadata” sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.

  5. Generate the revision/release objects for the given version. From the data generated at steps 3 and 4.

end for each

  1. Generate and load the snapshot for the visit

Using the revisions/releases collected at step 7., and the branch information from step 2., generate a snapshot and load it into the Software Heritage archive

get_loader_name() str[source]

Returns a fully qualified name of this loader.

get_loader_version() str[source]

Returns the version of the current loader.

get_metadata_fetcher() swh.model.model.MetadataFetcher[source]

Returns a MetadataFetcher instance representing this package loader; which is used to for adding provenance information to extracted extrinsic metadata, if any.

get_metadata_authority() swh.model.model.MetadataAuthority[source]

For package loaders that get extrinsic metadata, returns the authority the metadata are coming from.

get_extrinsic_origin_metadata() List[swh.loader.package.loader.RawExtrinsicMetadataCore][source]

Returns metadata items, used by build_extrinsic_origin_metadata.

build_extrinsic_origin_metadata() List[swh.model.model.RawExtrinsicMetadata][source]

Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_origin_metadata.

get_extrinsic_snapshot_metadata() List[swh.loader.package.loader.RawExtrinsicMetadataCore][source]

Returns metadata items, used by build_extrinsic_snapshot_metadata.

build_extrinsic_snapshot_metadata(snapshot_id: bytes) List[swh.model.model.RawExtrinsicMetadata][source]

Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_snapshot_metadata.

build_extrinsic_directory_metadata(p_info: swh.loader.package.loader.TPackageInfo, revision_id: bytes, directory_id: bytes) List[swh.model.model.RawExtrinsicMetadata][source]