swh.loader.svn.loader module#

Loader in charge of injecting either new or existing svn mirrors to swh-storage.

class swh.loader.svn.loader.SvnLoader(storage: StorageInterface, url: str, origin_url: str | None = None, visit_date: datetime | None = None, incremental: bool = True, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]#

Bases: BaseLoader

SVN loader. The repository is either remote or local. The loader deals with update on an already previously loaded repository.

Load a svn repository (either remote or local).

Parameters:
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

visit_type: str = 'svn'#
pre_cleanup()[source]#

Cleanup potential dangling files from prior runs (e.g. OOM killed tasks)

cleanup()[source]#

Clean up the svn repository’s working representation on disk.

swh_revision_hash_tree_at_svn_revision(revision: int) Directory[source]#

Compute and return the hash tree at a given svn revision.

Parameters:

rev – the svn revision we want to check

Returns:

The hash tree directory as bytes.

build_swh_revision(rev: int, commit: Dict, dir_id: bytes, parents: Sequence[bytes]) Revision[source]#

Build the swh revision dictionary.

This adds:

  • the ‘synthetic’ flag to true

  • the ‘extra_headers’ containing the repository’s uuid and the svn revision number.

Parameters:
  • rev – the svn revision number

  • commit – the commit data: revision id, date, author, and message

  • dir_id – the upper tree’s hash identifier

  • parents – the parents’ identifiers

Returns:

The swh revision corresponding to the svn revision.

check_history_not_altered(revision_start: int, swh_rev: Revision) bool[source]#

Given a svn repository, check if the history was modified in between visits.

start_from() Tuple[int, int][source]#

Determine from where to start the loading.

Returns:

tuple (revision_start, revision_end)

Raises:
process_svn_revisions(svnrepo, revision_start, revision_end) Iterator[Tuple[List[Content], List[SkippedContent], List[Directory], Revision]][source]#

Process svn revisions from revision_start to revision_end.

At each svn revision, apply new diffs and simultaneously compute swh hashes. This yields those computed swh hashes as a tuple (contents, directories, revision).

Note that at every self.check_revision, a supplementary check takes place to check for hash-tree divergence (related T570).

Yields:

tuple (contents, directories, revision) of dict as a dictionary with keys, sha1_git, sha1, etc…

Raises:

ValueError in case of a hash divergence detection

prepare()[source]#
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

fetch_data()[source]#

Fetching svn revision information.

This will apply svn revision as patch on disk, and at the same time, compute the swh hashes.

In effect, fetch_data fetches those data and compute the necessary swh objects. It’s then stored in the internal state instance variables (initialized in _prepare_state).

This is up to store_data to actually discuss with the storage to store those objects.

Returns:

True to continue fetching data (next svn revision), False to stop.

Return type:

bool

store_data()[source]#

We store the data accumulated in internal instance variable. If the iteration over the svn revisions is done, we create the snapshot and flush to storage the data.

This also resets the internal instance variable state.

generate_and_load_snapshot(revision: Revision | None = None, snapshot: Snapshot | None = None) Snapshot[source]#

Create the snapshot either from existing revision or snapshot.

Revision (supposedly new) has priority over the snapshot (supposedly existing one).

Parameters:
  • revision (dict) – Last revision seen if any (None by default)

  • snapshot (dict) – Snapshot to use if any (None by default)

Returns:

Optional[Snapshot] The newly created snapshot

load_status()[source]#

Detailed loading status.

Defaults to logging an eventful load.

Returns: a dictionary that is eventually passed back as the task’s

result to the scheduler, allowing tuning of the task recurrence mechanism.

visit_status()[source]#

Detailed visit status.

Defaults to logging a full visit.

post_load(success: bool = True) None[source]#

Permit the loader to do some additional actions according to status after the loading is done. The flag success indicates the loading’s status.

Defaults to doing nothing.

This is up to the implementer of this method to make sure this does not break.

Parameters:

success (bool) – the success status of the loading

class swh.loader.svn.loader.SvnLoaderFromDumpArchive(storage: StorageInterface, url: str, archive_path: str, origin_url: str | None = None, incremental: bool = False, visit_date: datetime | None = None, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]#

Bases: SvnLoader

Uncompress an archive containing an svn dump, mount the svn dump as a local svn repository and load that repository.

Load a svn repository (either remote or local).

Parameters:
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

prepare()[source]#
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

cleanup()[source]#

Clean up the svn repository’s working representation on disk.

class swh.loader.svn.loader.SvnLoaderFromRemoteDump(storage: StorageInterface, url: str, origin_url: str | None = None, incremental: bool = True, visit_date: datetime | None = None, temp_directory: str = '/tmp', debug: bool = False, check_revision: int = 0, **kwargs: Any)[source]#

Bases: SvnLoader

Create a subversion repository dump out of a remote svn repository (using the svnrdump utility). Then, mount the repository locally and load that repository.

Load a svn repository (either remote or local).

Parameters:
  • url – The default origin url

  • origin_url – Optional original url override to use as origin reference in the archive. If not provided, “url” is used as origin.

  • visit_date – Optional date to override the visit date

  • incremental – If True, the default, starts from the last snapshot (if any). Otherwise, starts from the initial commit of the repository.

  • temp_directory – The temporary directory to use as root directory for working directory computations

  • debug – If true, run the loader in debug mode. At the end of the loading, the temporary working directory is not cleaned up to ease inspection. Defaults to false.

  • check_revision – The number of svn commits between checks for hash divergence

get_last_loaded_svn_rev(svn_url: str) int[source]#

Check if the svn repository has already been visited and return the last loaded svn revision number or -1 otherwise.

dump_svn_revisions(svn_url: str, last_loaded_svn_rev: int = -1) Tuple[str, int][source]#

Generate a compressed subversion dump file using the svnrdump tool and gzip. If the svnrdump command failed somehow, the produced dump file is analyzed to determine if a partial loading is still feasible.

Raises:

NotFound when the repository is no longer found at url

Returns:

The dump_path of the repository mounted and the max dumped revision number (-1 if all revisions were dumped)

prepare()[source]#
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

cleanup()[source]#

Clean up the svn repository’s working representation on disk.

visit_status()[source]#

Detailed visit status.

Defaults to logging a full visit.