swh.loader.package.loader module#
- swh.loader.package.loader.SWH_METADATA_AUTHORITY = MetadataAuthority(type=MetadataAuthorityType.REGISTRY, url='https://softwareheritage.org/', metadata=ImmutableDict({}))#
Metadata authority for extrinsic metadata generated by Software Heritage. Used for metadata on “original artifacts”, ie. length, filename, and checksums of downloaded archive files.
- swh.loader.package.loader.PartialExtID#
The
extid_type
andextid
fields of anExtID
object.
- class swh.loader.package.loader.RawExtrinsicMetadataCore(format: str, metadata: bytes, discovery_date: datetime | None = None)[source]#
Bases:
object
Contains the core of the metadata extracted by a loader, that will be used to build a full RawExtrinsicMetadata object by adding object identifier, context, and provenance information.
Method generated by attrs for class RawExtrinsicMetadataCore.
- discovery_date#
Defaults to the visit date.
- class swh.loader.package.loader.BasePackageInfo(url: str, filename: str | None, version: str, *, directory_extrinsic_metadata: List[RawExtrinsicMetadataCore] = [], checksums: Dict[str, str] = {})[source]#
Bases:
object
- Compute the primary key for a dict using the id_keys as primary key
composite.
- Parameters:
d – A dict entry to compute the primary key on
id_keys – Sequence of keys to use as primary key
- Returns:
The identity for that dict entry
Method generated by attrs for class BasePackageInfo.
- version#
Version name/number.
- MANIFEST_FORMAT: Template | None = None#
If not None, used by the default extid() implementation to format a manifest, before hashing it to produce an ExtID.
- directory_extrinsic_metadata#
extrinsic metadata collected by the loader, that will be attached to the loaded directory and added to the Metadata storage.
- checksums#
Dictionary holding package tarball checksums for integrity check after download, keys are hash algorithm names and values are checksums in hexadecimal format. The supported algorithms are defined in the
swh.model.hashutil.ALGORITHMS
set.
- class swh.loader.package.loader.PackageLoader(storage: StorageInterface, url: str, **kwargs: Any)[source]#
Bases:
BaseLoader
,Generic
[TPackageInfo
]- Loader’s constructor. This raises exception if the minimal required
configuration is missing (cf. fn:check method).
- Parameters:
storage – Storage instance
url – Origin url to load data from
- get_versions() Sequence[str] [source]#
Return the list of all published package versions.
- Raises:
class – swh.loader.exception.NotFound error when failing to read the published package versions.
- Returns:
Sequence of published versions
- get_package_info(version: str) Iterator[Tuple[str, TPackageInfo]] [source]#
- Given a release version of a package, retrieve the associated
package information for such version.
- Parameters:
version – Package version
- Returns:
(branch name, package metadata)
- build_release(p_info: TPackageInfo, uncompressed_path: str, directory: bytes) Release | None [source]#
Build the release from the archive metadata (extrinsic artifact metadata) and the intrinsic metadata.
- Parameters:
p_info – Package information
uncompressed_path – Artifact uncompressed path on disk
- get_default_version() str [source]#
Retrieve the latest release version if any.
- Returns:
Latest version
- resolve_object_from_extids(known_extids: Dict[Tuple[str, int, bytes], List[CoreSWHID]], p_info: TPackageInfo, whitelist: Set[bytes]) CoreSWHID | None [source]#
Resolve the revision/release from known ExtIDs and a package info object.
If the artifact has already been downloaded, this will return the existing release (or revision) targeting that uncompressed artifact directory. Otherwise, this returns None.
- Parameters:
known_extids – Dict built from a list of ExtID, with the target as value
p_info – Package information
whitelist – Any ExtID with target not in this set is filtered out
- Returns:
None or release/revision SWHID
- select_extid_target(p_info: TPackageInfo, extid_targets: Set[CoreSWHID]) CoreSWHID | None [source]#
Given a list of release extid targets, choses one appropriate for the given package info.
Package loaders shyould implement this if their ExtIDs may map to multiple releases, so they can fetch releases from the storage and inspect their fields to select the right one for this
p_info
.
- download_package(p_info: TPackageInfo, tmpdir: str) List[Tuple[str, Mapping]] [source]#
Download artifacts for a specific package. All downloads happen in in the tmpdir folder.
Default implementation expects the artifacts package info to be about one artifact per package.
Note that most implementation have 1 artifact per package. But some implementation have multiple artifacts per package (debian), some have none, the package is the artifact (gnu).
- Parameters:
artifacts_package_info – Information on the package artifacts to download (url, filename, etc…)
tmpdir – Location to retrieve such artifacts
- Returns:
List of (path, computed hashes)
- uncompress(dl_artifacts: List[Tuple[str, Mapping[str, Any]]], dest: str) str [source]#
Uncompress the artifact(s) in the destination folder dest.
Optionally, this could need to use the p_info dict for some more information (debian).
- extra_branches() Dict[bytes, Mapping[str, Any]] [source]#
Return an extra dict of branches that are used to update the set of branches.
- finalize_visit(*, snapshot: Snapshot | None, visit: OriginVisit, status_visit: str, status_load: str, failed_branches: List[str], errors: List[str] | None = None) Dict[str, Any] [source]#
Finalize the visit:
flush eventual unflushed data to storage
update origin visit’s status
return the task’s status
- load() Dict [source]#
Load for a specific origin the associated contents.
Get the list of versions in an origin.
Get the snapshot from the previous run of the loader, and filter out versions that were already loaded, if their extids match
Then, for each remaining version in the origin
Fetch the files for one package version By default, this can be implemented as a simple HTTP request. Loaders with more specific requirements can override this, e.g.: the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.
Extract the downloaded files. By default, this would be a universal archive/tarball extraction.
Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).
Convert the extracted directory to a set of Software Heritage objects Using swh.model.from_disk.
Extract the metadata from the unpacked directories This would only be applicable for “smart” loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).
On “minimal-metadata” sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.
Generate the revision/release objects for the given version. From the data generated at steps 3 and 4.
end for each
Generate and load the snapshot for the visit
Using the revisions/releases collected at step 7., and the branch information from step 2., generate a snapshot and load it into the Software Heritage archive
- get_metadata_fetcher() MetadataFetcher [source]#
Returns a MetadataFetcher instance representing this package loader; which is used to for adding provenance information to extracted extrinsic metadata, if any.
- get_metadata_authority() MetadataAuthority [source]#
For package loaders that get extrinsic metadata, returns the authority the metadata are coming from.
- get_extrinsic_origin_metadata() List[RawExtrinsicMetadataCore] [source]#
Returns metadata items, used by build_extrinsic_origin_metadata.
- build_extrinsic_origin_metadata() List[RawExtrinsicMetadata] [source]#
Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_origin_metadata.
- get_extrinsic_snapshot_metadata() List[RawExtrinsicMetadataCore] [source]#
Returns metadata items, used by build_extrinsic_snapshot_metadata.
- build_extrinsic_snapshot_metadata(snapshot_id: bytes) List[RawExtrinsicMetadata] [source]#
Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_snapshot_metadata.