swh.loader.svn.loader module

Loader in charge of injecting either new or existing svn mirrors to swh-storage.

class swh.loader.svn.loader.SvnLoader(storage: swh.storage.interface.StorageInterface, url: str, origin_url: Optional[str] = None, visit_date: Optional[datetime.datetime] = None, incremental: bool = True, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]

Bases: swh.loader.core.loader.BaseLoader

SVN loader. The repository is either remote or local. The loader deals with update on an already previously loaded repository.

Load a svn repository (either remote or local).

Parameters
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

visit_type: str = 'svn'
pre_cleanup()[source]

Cleanup potential dangling files from prior runs (e.g. OOM killed tasks)

cleanup()[source]

Clean up the svn repository’s working representation on disk.

swh_revision_hash_tree_at_svn_revision(revision: int) bytes[source]

Compute and return the hash tree at a given svn revision.

Parameters

rev – the svn revision we want to check

Returns

The hash tree directory as bytes.

build_swh_revision(rev: int, commit: Dict, dir_id: bytes, parents: Sequence[bytes]) swh.model.model.Revision[source]

Build the swh revision dictionary.

This adds:

  • the ‘synthetic’ flag to true

  • the ‘extra_headers’ containing the repository’s uuid and the svn revision number.

Parameters
  • rev – the svn revision number

  • commit – the commit data: revision id, date, author, and message

  • dir_id – the upper tree’s hash identifier

  • parents – the parents’ identifiers

Returns

The swh revision corresponding to the svn revision.

check_history_not_altered(revision_start: int, swh_rev: swh.model.model.Revision) bool[source]

Given a svn repository, check if the history was modified in between visits.

start_from() Tuple[int, int][source]

Determine from where to start the loading.

Returns

tuple (revision_start, revision_end)

Raises
process_svn_revisions(svnrepo, revision_start, revision_end) Iterator[Tuple[List[swh.model.model.Content], List[swh.model.model.SkippedContent], List[swh.model.model.Directory], swh.model.model.Revision]][source]

Process svn revisions from revision_start to revision_end.

At each svn revision, apply new diffs and simultaneously compute swh hashes. This yields those computed swh hashes as a tuple (contents, directories, revision).

Note that at every self.check_revision, a supplementary check takes place to check for hash-tree divergence (related T570).

Yields

tuple (contents, directories, revision) of dict as a dictionary with keys, sha1_git, sha1, etc…

Raises

ValueError in case of a hash divergence detection

svn_repo(*args, **kwargs)[source]

Wraps the creation of SvnRepo object and handles not found repository errors.

prepare()[source]
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

fetch_data()[source]

Fetching svn revision information.

This will apply svn revision as patch on disk, and at the same time, compute the swh hashes.

In effect, fetch_data fetches those data and compute the necessary swh objects. It’s then stored in the internal state instance variables (initialized in _prepare_state).

This is up to store_data to actually discuss with the storage to store those objects.

Returns

True to continue fetching data (next svn revision), False to stop.

Return type

bool

store_data()[source]

We store the data accumulated in internal instance variable. If the iteration over the svn revisions is done, we create the snapshot and flush to storage the data.

This also resets the internal instance variable state.

generate_and_load_snapshot(revision: Optional[swh.model.model.Revision] = None, snapshot: Optional[swh.model.model.Snapshot] = None) swh.model.model.Snapshot[source]

Create the snapshot either from existing revision or snapshot.

Revision (supposedly new) has priority over the snapshot (supposedly existing one).

Parameters
  • revision (dict) – Last revision seen if any (None by default)

  • snapshot (dict) – Snapshot to use if any (None by default)

Returns

Optional[Snapshot] The newly created snapshot

load_status()[source]

Detailed loading status.

Defaults to logging an eventful load.

Returns: a dictionary that is eventually passed back as the task’s

result to the scheduler, allowing tuning of the task recurrence mechanism.

visit_status()[source]

Detailed visit status.

Defaults to logging a full visit.

post_load(success: bool = True) None[source]

Permit the loader to do some additional actions according to status after the loading is done. The flag success indicates the loading’s status.

Defaults to doing nothing.

This is up to the implementer of this method to make sure this does not break.

Parameters

success (bool) – the success status of the loading

origin: swh.model.model.Origin
loaded_snapshot_id: Optional[bytes]
parent_origins: Optional[List[swh.model.model.Origin]]

If the given origin is a “forge fork” (ie. created with the “Fork” button of GitHub-like forges), build_extrinsic_origin_metadata() sets this to a list of origins it was forked from; closest parent first.

class swh.loader.svn.loader.SvnLoaderFromDumpArchive(storage: swh.storage.interface.StorageInterface, url: str, archive_path: str, origin_url: Optional[str] = None, incremental: bool = False, visit_date: Optional[datetime.datetime] = None, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]

Bases: swh.loader.svn.loader.SvnLoader

Uncompress an archive containing an svn dump, mount the svn dump as a local svn repository and load that repository.

Load a svn repository (either remote or local).

Parameters
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

prepare()[source]
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

cleanup()[source]

Clean up the svn repository’s working representation on disk.

origin: swh.model.model.Origin
loaded_snapshot_id: Optional[bytes]
parent_origins: Optional[List[swh.model.model.Origin]]

If the given origin is a “forge fork” (ie. created with the “Fork” button of GitHub-like forges), build_extrinsic_origin_metadata() sets this to a list of origins it was forked from; closest parent first.

snapshot: Optional[swh.model.model.Snapshot]
latest_revision: Optional[swh.model.model.Revision]
class swh.loader.svn.loader.SvnLoaderFromRemoteDump(storage: swh.storage.interface.StorageInterface, url: str, origin_url: Optional[str] = None, incremental: bool = True, visit_date: Optional[datetime.datetime] = None, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]

Bases: swh.loader.svn.loader.SvnLoader

Create a subversion repository dump out of a remote svn repository (using the svnrdump utility). Then, mount the repository locally and load that repository.

Load a svn repository (either remote or local).

Parameters
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

get_last_loaded_svn_rev(svn_url: str) int[source]

Check if the svn repository has already been visited and return the last loaded svn revision number or -1 otherwise.

dump_svn_revisions(svn_url: str, last_loaded_svn_rev: int = - 1) str[source]

Generate a subversion dump file using the svnrdump tool. If the svnrdump command failed somehow, the produced dump file is analyzed to determine if a partial loading is still feasible.

Raises

NotFound when the repository is no longer found at url

Returns

The dump_path of the repository mounted

prepare()[source]
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

cleanup()[source]

Clean up the svn repository’s working representation on disk.

visit_status()[source]

Detailed visit status.

Defaults to logging a full visit.

origin: swh.model.model.Origin
loaded_snapshot_id: Optional[bytes]
parent_origins: Optional[List[swh.model.model.Origin]]

If the given origin is a “forge fork” (ie. created with the “Fork” button of GitHub-like forges), build_extrinsic_origin_metadata() sets this to a list of origins it was forked from; closest parent first.

snapshot: Optional[swh.model.model.Snapshot]
latest_revision: Optional[swh.model.model.Revision]