swh.loader.core.loader module

class swh.loader.core.loader.BaseLoader(storage: swh.storage.interface.StorageInterface, logging_class: Optional[str] = None, save_data_path: Optional[str] = None, max_content_size: Optional[int] = None)[source]

Bases: object

Base class for (D)VCS loaders (e.g Svn, Git, Mercurial, …) or PackageLoader (e.g PyPI, Npm, CRAN, …)

A loader retrieves origin information (git/mercurial/svn repositories, pypi/npm/… package artifacts), ingests the contents/directories/revisions/releases/snapshot read from those artifacts and send them to the archive through the storage backend.

The main entry point for the loader is the load() function.

2 static methods (from_config(), from_configfile()) centralizes and eases the loader instantiation from either configuration dict or configuration file.

Some class examples:

  • SvnLoader

  • GitLoader

  • PyPILoader

  • NpmLoader

visit_date: Optional[datetime.datetime]
origin: Optional[swh.model.model.Origin]
origin_metadata: Dict[str, Any]
loaded_snapshot_id: Optional[bytes]
classmethod from_config(storage: Dict[str, Any], **config: Any)[source]

Instantiate a loader from a configuration dict.

This is basically a backwards-compatibility shim for the CLI.

Parameters
  • storage – instantiation config for the storage

  • config – the configuration dict for the loader, with the following keys: - credentials (optional): credentials list for the scheduler - any other kwargs passed to the loader.

Returns

the instantiated loader

classmethod from_configfile(**kwargs: Any)[source]

Instantiate a loader from the configuration loaded from the SWH_CONFIG_FILENAME envvar, with potential extra keyword arguments if their value is not None.

Parameters

kwargs – kwargs passed to the loader instantiation

save_data() None[source]

Save the data associated to the current load

get_save_data_path() str[source]

The path to which we archive the loader’s raw data

flush() None[source]

Flush any potential buffered data not sent to swh-storage.

cleanup() None[source]

Last step executed by the loader.

prepare_origin_visit() None[source]

First step executed by the loader to prepare origin and visit references. Set/update self.origin, and optionally self.origin_url, self.visit_date.

prepare() None[source]
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

get_origin() swh.model.model.Origin[source]

Get the origin that is currently being loaded. self.origin should be set in prepare_origin()

Returns

an origin ready to be sent to storage by origin_add().

Return type

dict

fetch_data() bool[source]
Fetch the data from the source the loader is currently loading

(ex: git/hg/svn/… repository).

Returns

a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.

store_data()[source]

Store fetched data in the database.

Should call the maybe_load_xyz() methods, which handle the bundles sent to storage, rather than send directly.

store_metadata() None[source]

Store fetched metadata in the database.

For more information, see implementation in DepositLoader.

load_status() Dict[str, str][source]

Detailed loading status.

Defaults to logging an eventful load.

Returns: a dictionary that is eventually passed back as the task’s

result to the scheduler, allowing tuning of the task recurrence mechanism.

post_load(success: bool = True) None[source]

Permit the loader to do some additional actions according to status after the loading is done. The flag success indicates the loading’s status.

Defaults to doing nothing.

This is up to the implementer of this method to make sure this does not break.

Parameters

success (bool) – the success status of the loading

visit_status() str[source]

Detailed visit status.

Defaults to logging a full visit.

pre_cleanup() None[source]

As a first step, will try and check for dangling data to cleanup. This should do its best to avoid raising issues.

load() Dict[str, str][source]

Loading logic for the loader to follow:

class swh.loader.core.loader.DVCSLoader(storage: swh.storage.interface.StorageInterface, logging_class: Optional[str] = None, save_data_path: Optional[str] = None, max_content_size: Optional[int] = None)[source]

Bases: swh.loader.core.loader.BaseLoader

This base class is a pattern for dvcs loaders (e.g. git, mercurial).

Those loaders are able to load all the data in one go. For example, the loader defined in swh-loader-git BulkUpdater.

For other loaders (stateful one, (e.g SWHSvnLoader), inherit directly from BaseLoader.

cleanup() None[source]

Clean up an eventual state installed for computations.

has_contents() bool[source]

Checks whether we need to load contents

get_contents() Iterable[swh.model.model.BaseContent][source]

Get the contents that need to be loaded

has_directories() bool[source]

Checks whether we need to load directories

get_directories() Iterable[swh.model.model.Directory][source]

Get the directories that need to be loaded

has_revisions() bool[source]

Checks whether we need to load revisions

get_revisions() Iterable[swh.model.model.Revision][source]

Get the revisions that need to be loaded

has_releases() bool[source]

Checks whether we need to load releases

get_releases() Iterable[swh.model.model.Release][source]

Get the releases that need to be loaded

visit_date: Optional[datetime.datetime]
origin: Optional[swh.model.model.Origin]
origin_metadata: Dict[str, Any]
loaded_snapshot_id: Optional[bytes]
visit_type: Optional[str]
get_snapshot() swh.model.model.Snapshot[source]

Get the snapshot that needs to be loaded

eventful() bool[source]

Whether the load was eventful

store_data() None[source]

Store fetched data in the database.

Should call the maybe_load_xyz() methods, which handle the bundles sent to storage, rather than send directly.