VCS Loader Overview

In this overview, we will see how to write a loader for Software Heritage that loads artifacts from a Version Control System, such as Git, Mercurial, or Subversion

First, you should be familiar with Python, unit-testing, Software Heritage’s Data model and Software Architecture, and go through the Developer setup.

As seen in the swh-loader-core homepage, SWH loaders can be sorted into two large categories: Package Loaders and VCS loaders.

This page is an overview of how to write a VCS loader. This is not a tutorial, because VCS loaders are hooked deeply into their respective VCS’ internals; unlike Package Loaders which are somewhat uniform (list tarballs, download tarballs, load content of tarball, done).

Architecture

A loader is a Python package, usually a subpackage of swh.loader but in its own directory (eg. swh-loader-git/swh/loader/git, as swh.loader is a namespace package), based on the swh-py-template repository.

It has at least one entrypoint, declared in setup.py to be recognized by swh-loader-core:

entry_points="""
    [swh.workers]
    loader.newloader=swh.loader.newloader:register
""",

This entrypoint declares the task name (to be run by SWH Celery workers) and the loader class. For example, for the Subversion loader:

from typing import Any, Dict

def register() -> Dict[str, Any]:
    from swh.loader.svn.loader import SvnLoader

    return {
        "task_modules": ["%s.tasks" % __name__],
        "loader": SvnLoader,
    }

The bulk of the work is done by the returned loader class: it loads artifacts from the upstream VCS and writes them to the Software Heritage archive. Because of the heterogeneity of VCS loaders, it has a lot of freedom in how to achieve this. Once the initial setup is done (see the next section), its load method is called, and it is expected to do all this work as a black box.

Base classes

All loaders inherit from swh.loader.core.loader.BaseLoader, which takes care of all the SWH-specific setup and finalization:

It also provides a default implementation of the load method, which takes care of:

  • calling its fetch_data (from the VCS) and store_data (to SWH) in a loop

  • on error, notifies swh-storage the loading failed, reports the error to the monitoring infrastructure (Sentry), and cleanup

  • on success, cleanup and notify swh-storage the loading succeeded

See its documentation for details.

Distributed VCS loaders will usually want to inherit from its child, swh.loader.core.DVCSLoader, which takes care of implementing store_data. Classes inheriting from DVCSLoader only need to implement fetch_data, and a method for each object type: get_contents, get_directories, get_revisions, get_releases, and get_snapshot, each returning an iterable of the corresponding object from swh.model.model (except get_snapshot, which returns a single one).

If you are writing a DVCS loader, this allows your loader to fetch all the objects locally, then return them lazily on demand.

Incremental loading

Loading a repository from scratch can be costly, so swh-storage provides ways to remember what objects in the repository were already loaded, through extids. They are represented by swh.model.model.ExtID, which is essentially a 3-tuple that contains a SWHID, an id internal to the VCS type, (which is the actual “extid” itself), and the type of this id (eg. hg-nodeid).

When your loader is done loading, it can store extids for some of its objects (eg. the heads/tips of each branch of the snapshot and some intermediate revisions in the history), with swh.storage.interface.StorageInterface.extid_add().

And when it starts loading a known repository, fetches the previous snapshot using swh.storage.algos.snapshot.snapshot_get_latest(), then the extids it stores using swh.storage.interface.StorageInterface.extid_get_from_target() for each of the branch targets. This way, it can find which objects from the origin were already loaded, without having to download them first.

Note

For legacy reasons, the Subversion loader uses an alternative to ExtID, which is to encode the repository UUID and the revision ID (an incremental integer) directly in swh.model.model.Revision.extra_headers.

This is discouraged because it prevents deduplication across repositories, and extra_headers does not have a well-defined schema.

Integrity

Loaders may be interrupted at any point, for various reasons (unhandled crash, out of memory, hardware failure, blocking IO, system or daemon restart, etc.)

Therefore, they must take great care that if a load was interrupted, the next load will finish loading all objects. If they don’t, this may happen:

  1. loader loads revision R, pointing to directory D

  2. loader starts loading D, but crashes before it does

  3. [loader restarts]

  4. loader sees R is already loaded, so it doesn’t load its children

And D will never be loaded.

The solution to this is to load objects in topological order of the DAG.

Another reason to load objects in topological order is that it avoid having “holes” in the graph (aka. dangling references), even temporarily. Holes in the graph cause bad user experiences, when users click a link from an existing object and get a “not found” error.