swh.loader.core.loader module#
- class swh.loader.core.loader.BaseLoader(storage: StorageInterface, origin_url: str, logging_class: str | None = None, save_data_path: str | None = None, max_content_size: int | None = None, lister_name: str | None = None, lister_instance_name: str | None = None, metadata_fetcher_credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, create_partial_snapshot: bool = False)[source]#
Bases:
object
Base class for (D)VCS loaders (e.g Svn, Git, Mercurial, …) or PackageLoader (e.g PyPI, Npm, CRAN, …)
A loader retrieves origin information (git/mercurial/svn repositories, pypi/npm/… package artifacts), ingests the contents/directories/revisions/releases/snapshot read from those artifacts and send them to the archive through the storage backend.
The main entry point for the loader is the
load()
function.2 static methods (
from_config()
,from_configfile()
) centralizes and eases the loader instantiation from either configuration dict or configuration file.Some class examples:
SvnLoader
GitLoader
PyPILoader
NpmLoader
- Parameters:
lister_name – Name of the lister which triggered this load. If provided, the loader will try to use the forge’s API to retrieve extrinsic metadata
lister_instance_name – Name of the lister instance which triggered this load. Must be None iff lister_name is, but it may be the empty string for listers with a single instance.
- parent_origins: List[Origin] | None#
If the given origin is a “forge fork” (ie. created with the “Fork” button of GitHub-like forges),
build_extrinsic_origin_metadata()
sets this to a list of origins it was forked from; closest parent first.
- classmethod from_config(storage: Dict[str, Any], overrides: Dict[str, Any] | None = None, **extra_kwargs: Any)[source]#
Instantiate a loader from a configuration dict.
This is basically a backwards-compatibility shim for the CLI.
- Parameters:
storage – instantiation config for the storage
overrides – A dict of extra configuration for loaders. Maps fully qualified class names (e.g.
"swh.loader.git.loader.GitLoader"
) to a dict of extra keyword arguments to pass to this (and only this) loader.extra_kwargs – all extra keyword arguments are passed to all loaders
- Returns:
the instantiated loader
- classmethod from_configfile(**kwargs: Any)[source]#
Instantiate a loader from the configuration loaded from the SWH_CONFIG_FILENAME envvar, with potential extra keyword arguments if their value is not None.
- Parameters:
kwargs – kwargs passed to the loader instantiation
- flush() Dict[str, int] [source]#
Flush any potential buffered data not sent to swh-storage. Returns the same value as
swh.storage.interface.StorageInterface.flush()
.
- prepare() None [source]#
- Second step executed by the loader to prepare some state needed by
the loader.
- Raises
NotFound exception if the origin to ingest is not found.
- get_origin() Origin [source]#
Get the origin that is currently being loaded. self.origin should be set in
prepare_origin()
- Returns:
an origin ready to be sent to storage by
origin_add()
.- Return type:
- fetch_data() bool [source]#
- Fetch the data from the source the loader is currently loading
(ex: git/hg/svn/… repository).
- Returns:
a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.
- process_data() bool [source]#
Run any additional processing between fetching and storing the data
- Returns:
a value that is interpreted as a boolean. If True,
fetch_data()
needs to be called again to complete loading. Ignored iffetch_data()
already returnedFalse
.
- store_data() None [source]#
Store fetched and processed data in the storage.
This should call the storage.<object>_add methods, which handle the objects to store in the storage.
- load_status() Dict[str, str] [source]#
Detailed loading status.
Defaults to logging an eventful load.
- Returns: a dictionary that is eventually passed back as the task’s
result to the scheduler, allowing tuning of the task recurrence mechanism.
- post_load(success: bool = True) None [source]#
Permit the loader to do some additional actions according to status after the loading is done. The flag success indicates the loading’s status.
Defaults to doing nothing.
This is up to the implementer of this method to make sure this does not break.
- Parameters:
success (bool) – the success status of the loading
- pre_cleanup() None [source]#
As a first step, will try and check for dangling data to cleanup. This should do its best to avoid raising issues.
- build_partial_snapshot() Snapshot | None [source]#
When the loader is configured to serialize partial snapshot, this allows the loader to give an implementation that builds a partial snapshot. This is used when the ingestion is taking multiple calls to
fetch_data()
andstore_data()
. Ignored when the loader is not configured to serialize partial snapshot.
- load() Dict[str, str] [source]#
Loading logic for the loader to follow:
Store the actual
origin_visit
to storageCall
prepare()
to prepare any eventual stateCall
get_origin()
to get the origin we work with and storewhile True:
Call
fetch_data()
to fetch the data to storeCall
process_data()
to optionally run processing betweenfetch_data()
andstore_data()
Call
store_data()
to store the data
- load_metadata_objects(metadata_objects: List[RawExtrinsicMetadata]) None [source]#
- build_extrinsic_origin_metadata() List[RawExtrinsicMetadata] [source]#
Builds a list of full RawExtrinsicMetadata objects, using a metadata fetcher returned by
get_fetcher_classes()
.
- statsd_timed(name: str, tags: Dict[str, Any] = {}) ContextManager [source]#
Wrapper for
swh.core.statsd.Statsd.timed()
, which uses the standard metric name and tags for loaders.
- class swh.loader.core.loader.NodeLoader(storage: StorageInterface, url: str, checksums: Dict[str, str], checksums_computation: str | None = None, checksum_layout: str | None = None, fallback_urls: List[str] | None = None, **kwargs)[source]#
Bases:
BaseLoader
,ABC
Common abstract class for
ContentLoader
andDirectoryloader
.The “checksums” field is a dictionary of hex hashes on the object retrieved (content or directory). When “checksum_layout” is “standard”, the checksums are computed on the content of the remote file to retrieve itself (as unix cli allows, “sha1sum”, “sha256sum”, …). When “checksum_layout” is “nar”, the checks is delegated to Nar class (which does an equivalent hash computation as the nix store –dump cli). It’s actually checksums on the content of the remote artifact retrieved (be it a file or an archive). Other “checksum_layout” will raise UnsupportedChecksumLayout.
The multiple “fallback” urls received are mirror urls only used to fetch the object if the main origin is no longer available. Those are not stored.
Ingestion is considered eventful on the first ingestion. Subsequent load of the same object should end up being an uneventful visit (matching snapshot).
- extid_version = 1#
- prepare() None [source]#
- Second step executed by the loader to prepare some state needed by
the loader.
- Raises
NotFound exception if the origin to ingest is not found.
- load_status() Dict[str, Any] [source]#
Detailed loading status.
Defaults to logging an eventful load.
- Returns: a dictionary that is eventually passed back as the task’s
result to the scheduler, allowing tuning of the task recurrence mechanism.
- abstract fetch_artifact() Iterator[Path] [source]#
This fetches an artifact representation and yields its associated local representation (as Path). Depending on the implementation, this may yield contents coming from a remote location, or directories coming from tarball, svn tree, git tree, hg tree, …
- Raises
NotFound if nothing is found; ValueError in case of mismatched checksums
- abstract process_artifact(artifact_path: Path) None [source]#
Build the DAG objects out of the locally retrieved artifact.
- fetch_data() bool [source]#
Fetch artifact (e.g. content, directory), checks and ingests the DAG objects coming from the artifact.
This iterates over the generator
fetch_artifact()
to retrieve artifact. As soon as one is retrieved and pass the checks (e.g. nar checks if the “checksum_layout” is “nar”), the method proceeds with the DAG ingestion as usual. If the artifact does not pass the check, this tries to retrieve the next mirrored artifact. If no artifacts is retrievable, this raises.- Raises
NotFound if no artifact is found; ValueError in case of mismatched checksums
- class swh.loader.core.loader.ContentLoader(*args, **kwargs)[source]#
Bases:
NodeLoader
Basic loader for edge case ingestion of url resolving to bare ‘content’ file.
A visit ends up in full visit with a snapshot when the artifact is retrieved with success, match the checksums provided and is ingested with success in the archive.
An extid mapping entry is recorded in the extid table. The extid_type depends on the checksums’ type provided (see
NodeLoader
docstring).The output snapshot has the following structure:
id: <bytes> branches: HEAD: target_type: content target: <content-id>
- fetch_artifact() Iterator[Path] [source]#
Iterates over the mirror urls to find a content.
- Raises
NotFound if nothing is found; ValueError in case of any error when fetching/computing (length, checksums mismatched…)
- class swh.loader.core.loader.BaseDirectoryLoader(*args, path_filter: ~typing.Callable[[bytes, bytes, ~typing.Iterable[bytes] | None], bool] = <function accept_all_paths>, **kwargs)[source]#
Bases:
NodeLoader
Abstract base Directory Loader for ‘tree’ ingestion (through any media).
Implementations should inherit from this class and provide the:
required
fetch_artifact()
method to retrieve the Directory (from the proper media protocol, e.g. git, svn, hg, …)optional
build_snapshot()
method to build the Snapshot with the proper structure if the default is not enough.
- process_artifact(artifact_path: Path) None [source]#
Build the Directory and other DAG objects out of the remote artifact retrieved (self.artifact_path).
This needs to happen in this method because it’s within a context manager block.
- build_snapshot() Snapshot [source]#
Build and return the snapshot to store in the archive.
By default, this builds the snapshot with the structure:
id: <bytes> branches: HEAD: target_type: directory target: <directory-id>
Other directory loader implementations could override this method to build a more specific snapshot.
- class swh.loader.core.loader.TarballDirectoryLoader(*args, path_filter: ~typing.Callable[[bytes, bytes, ~typing.Iterable[bytes] | None], bool] = <function accept_all_paths>, **kwargs)[source]#
Bases:
BaseDirectoryLoader
TarballDirectoryLoader for ingestion of url resolving to a tarball. The tarball is uncompressed and checked against its provided checksums (either standard checksums or
Nar
checksums).A visit ends up in full visit with a snapshot when the artifact is retrieved with success, match the checksums provided and is ingested with success in the archive.
An extid mapping entry is recorded in the extid table. The extid_type depends on the checksums’ type provided (see
NodeLoader
docstring).The output snapshot has the following structure:
id: <bytes> branches: HEAD: target_type: directory target: <directory-id>