VCS Loader Overview#
In this overview, we will see how to write a loader for Software Heritage that loads artifacts from a Version Control System, such as Git, Mercurial, or Subversion
First, you should be familiar with Python, unit-testing, Software Heritage’s Data model and Software Architecture, and go through the Developer setup.
As seen in the swh-loader-core homepage, SWH loaders can be sorted into two large categories: Package Loaders and VCS loaders.
This page is an overview of how to write a VCS loader. This is not a tutorial, because VCS loaders are hooked deeply into their respective VCS’ internals; unlike Package Loaders which are somewhat uniform (list tarballs, download tarballs, load content of tarball, done).
Architecture#
A loader is a Python package, usually a subpackage of swh.loader
but in its own directory (eg. swh-loader-git/swh/loader/git
, as swh.loader
is a namespace package), based on the swh-py-template repository.
It has at least one entrypoint, declared in setup.py
to be recognized
by swh-loader-core
:
entry_points="""
[swh.workers]
loader.newloader=swh.loader.newloader:register
""",
This entrypoint declares the task name (to be run by SWH Celery workers) and the loader class. For example, for the Subversion loader:
from typing import Any, Dict
def register() -> Dict[str, Any]:
from swh.loader.svn.loader import SvnLoader
return {
"task_modules": ["%s.tasks" % __name__],
"loader": SvnLoader,
}
The bulk of the work is done by the returned loader
class: it loads
artifacts from the upstream VCS and writes them to the Software Heritage archive.
Because of the heterogeneity of VCS loaders, it has a lot of freedom in how to
achieve this. Once the initial setup is done (see the next section), its load
method is called, and it is expected to do all this work as a black box.
Base classes#
All loaders inherit from swh.loader.core.loader.BaseLoader
, which takes care of
all the SWH-specific setup and finalization:
Reading the configuration
Connecting to the storage database
It also provides a default implementation of the load
method, which takes care of:
calling its
fetch_data
(from the VCS) andstore_data
(to SWH) in a loopon error, notifies swh-storage the loading failed, reports the error to the monitoring infrastructure (Sentry), and cleanup
on success, cleanup and notify swh-storage the loading succeeded
See its documentation
for details.
Distributed VCS loaders will usually want to inherit from its child,
swh.loader.core.DVCSLoader
, which takes care of implementing store_data
.
Classes inheriting from DVCSLoader
only need to implement fetch_data
, and
a method for each object type: get_contents
, get_directories
, get_revisions
,
get_releases
, and get_snapshot
, each returning an iterable of the corresponding
object from swh.model.model
(except get_snapshot
, which returns a single one).
If you are writing a DVCS loader, this allows your loader to fetch all the objects locally, then return them lazily on demand.
Incremental loading#
Loading a repository from scratch can be costly, so swh-storage
provides
ways to remember what objects in the repository were already loaded,
through extids.
They are represented by swh.model.model.ExtID
,
which is essentially a 3-tuple that contains a SWHID, an id internal to the VCS type,
(which is the actual “extid” itself), and the type of this id (eg. hg-nodeid
).
When your loader is done loading, it can store extids for some of its objects
(eg. the heads/tips of each branch of the snapshot and some intermediate
revisions in the history),
with swh.storage.interface.StorageInterface.extid_add()
.
And when it starts loading a known repository, fetches the previous snapshot
using swh.storage.algos.snapshot.snapshot_get_latest()
, then the extids
it stores using swh.storage.interface.StorageInterface.extid_get_from_target()
for each of the branch targets.
This way, it can find which objects from the origin were already loaded,
without having to download them first.
Note
For legacy reasons, the Subversion loader uses an alternative to ExtID,
which is to encode the repository UUID and the revision ID (an incremental integer)
directly in swh.model.model.Revision.extra_headers
.
This is discouraged because it prevents deduplication across repositories,
and extra_headers
does not have a well-defined schema.
Integrity#
Loaders may be interrupted at any point, for various reasons (unhandled crash, out of memory, hardware failure, blocking IO, system or daemon restart, etc.)
Therefore, they must take great care that if a load was interrupted, the next load will finish loading all objects. If they don’t, this may happen:
loader loads revision
R
, pointing to directoryD
loader starts loading
D
, but crashes before it does[loader restarts]
loader sees
R
is already loaded, so it doesn’t load its children
And D
will never be loaded.
The solution to this is to load objects in topological order of the DAG.
Another reason to load objects in topological order is that it avoid having “holes” in the graph (aka. dangling references), even temporarily. Holes in the graph cause bad user experiences, when users click a link from an existing object and get a “not found” error.