swh.loader.package.loader module¶
-
swh.loader.package.loader.
SWH_METADATA_AUTHORITY
= MetadataAuthority(type=<MetadataAuthorityType.REGISTRY: 'registry'>, url='https://softwareheritage.org/', metadata=ImmutableDict({}))¶ Metadata authority for extrinsic metadata generated by Software Heritage. Used for metadata on “original artifacts”, ie. length, filename, and checksums of downloaded archive files.
-
class
swh.loader.package.loader.
RawExtrinsicMetadataCore
(format: str, metadata: bytes, discovery_date: Optional[datetime.datetime] = None)[source]¶ Bases:
object
Contains the core of the metadata extracted by a loader, that will be used to build a full RawExtrinsicMetadata object by adding object identifier, context, and provenance information.
-
discovery_date
¶ Defaults to the visit date.
-
-
class
swh.loader.package.loader.
BasePackageInfo
(url: str, filename: Optional[str], *, directory_extrinsic_metadata: List[swh.loader.package.loader.RawExtrinsicMetadataCore] = [])[source]¶ Bases:
object
- Compute the primary key for a dict using the id_keys as primary key
composite.
- Parameters
d – A dict entry to compute the primary key on
id_keys – Sequence of keys to use as primary key
- Returns
The identity for that dict entry
-
property
ID_KEYS
¶
-
class
swh.loader.package.loader.
PackageLoader
(url)[source]¶ Bases:
Generic
[swh.loader.package.loader.TPackageInfo
]-
visit_type
= ''¶
-
get_versions
() → Sequence[str][source]¶ Return the list of all published package versions.
- Returns
Sequence of published versions
-
get_package_info
(version: str) → Iterator[Tuple[str, TPackageInfo]][source]¶ - Given a release version of a package, retrieve the associated
package information for such version.
- Parameters
version – Package version
- Returns
(branch name, package metadata)
-
build_revision
(p_info: TPackageInfo, uncompressed_path: str, directory: bytes) → Optional[swh.model.model.Revision][source]¶ Build the revision from the archive metadata (extrinsic artifact metadata) and the intrinsic metadata.
- Parameters
p_info – Package information
uncompressed_path – Artifact uncompressed path on disk
- Returns
Revision object
-
get_default_version
() → str[source]¶ Retrieve the latest release version if any.
- Returns
Latest version
-
last_snapshot
() → Optional[swh.model.model.Snapshot][source]¶ Retrieve the last snapshot out of the last visit.
-
known_artifacts
(snapshot: Optional[swh.model.model.Snapshot]) → Dict[bytes, Optional[swh.model.collections.ImmutableDict[str, object]]][source]¶ Retrieve the known releases/artifact for the origin.
- Args
snapshot: snapshot for the visit
- Returns
Dict of keys revision id (bytes), values a metadata Dict.
-
resolve_revision_from
(known_artifacts: Dict, p_info: TPackageInfo) → Optional[bytes][source]¶ Resolve the revision from a snapshot and an artifact metadata dict.
If the artifact has already been downloaded, this will return the existing revision targeting that uncompressed artifact directory. Otherwise, this returns None.
- Parameters
snapshot – Snapshot
p_info – Package information
- Returns
None or revision identifier
-
download_package
(p_info: TPackageInfo, tmpdir: str) → List[Tuple[str, Mapping]][source]¶ Download artifacts for a specific package. All downloads happen in in the tmpdir folder.
Default implementation expects the artifacts package info to be about one artifact per package.
Note that most implementation have 1 artifact per package. But some implementation have multiple artifacts per package (debian), some have none, the package is the artifact (gnu).
- Parameters
artifacts_package_info – Information on the package artifacts to download (url, filename, etc…)
tmpdir – Location to retrieve such artifacts
- Returns
List of (path, computed hashes)
-
uncompress
(dl_artifacts: List[Tuple[str, Mapping[str, Any]]], dest: str) → str[source]¶ Uncompress the artifact(s) in the destination folder dest.
Optionally, this could need to use the p_info dict for some more information (debian).
-
extra_branches
() → Dict[bytes, Mapping[str, Any]][source]¶ Return an extra dict of branches that are used to update the set of branches.
-
load
() → Dict[source]¶ Load for a specific origin the associated contents.
for each package version of the origin
Fetch the files for one package version By default, this can be implemented as a simple HTTP request. Loaders with more specific requirements can override this, e.g.: the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.
Extract the downloaded files By default, this would be a universal archive/tarball extraction.
Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).
Convert the extracted directory to a set of Software Heritage objects Using swh.model.from_disk.
Extract the metadata from the unpacked directories This would only be applicable for “smart” loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).
On “minimal-metadata” sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.
Generate the revision/release objects for the given version. From the data generated at steps 3 and 4.
end for each
Generate and load the snapshot for the visit
Using the revisions/releases collected at step 5., and the branch information from step 0., generate a snapshot and load it into the Software Heritage archive
-
get_metadata_fetcher
() → swh.model.model.MetadataFetcher[source]¶ Returns a MetadataFetcher instance representing this package loader; which is used to for adding provenance information to extracted extrinsic metadata, if any.
For package loaders that get extrinsic metadata, returns the authority the metadata are coming from.
-
get_extrinsic_origin_metadata
() → List[swh.loader.package.loader.RawExtrinsicMetadataCore][source]¶ Returns metadata items, used by build_extrinsic_origin_metadata.
-
build_extrinsic_origin_metadata
() → List[swh.model.model.RawExtrinsicMetadata][source]¶ Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_origin_metadata.
-
get_extrinsic_snapshot_metadata
() → List[swh.loader.package.loader.RawExtrinsicMetadataCore][source]¶ Returns metadata items, used by build_extrinsic_snapshot_metadata.
-
build_extrinsic_snapshot_metadata
(snapshot_id: bytes) → List[swh.model.model.RawExtrinsicMetadata][source]¶ Builds a list of full RawExtrinsicMetadata objects, using metadata returned by get_extrinsic_snapshot_metadata.
-
build_extrinsic_directory_metadata
(p_info: TPackageInfo, revision_id: bytes, directory_id: bytes) → List[swh.model.model.RawExtrinsicMetadata][source]¶
-