swh.loader.git.loader module

class swh.loader.git.loader.RepoRepresentation(storage, base_snapshot: Optional[swh.model.model.Snapshot] = None, ignore_history=False)[source]

Bases: object

Repository representation for a Software Heritage origin.

get_parents(commit: bytes) → List[bytes][source]

This method should return the list of known parents

graph_walker() → dulwich.object_store.ObjectStoreGraphWalker[source]
determine_wants(refs: Dict[bytes, bytes]) → List[bytes][source]

Get the list of bytehex sha1s that the git loader should fetch.

This compares the remote refs sent by the server with the base snapshot provided by the loader.

class swh.loader.git.loader.FetchPackReturn(remote_refs: Dict[bytes, bytes], symbolic_refs: Dict[bytes, bytes], pack_buffer: _io.BytesIO, pack_size: int)[source]

Bases: object

remote_refs: Dict[bytes, bytes]
symbolic_refs: Dict[bytes, bytes]
pack_buffer: BytesIO
pack_size: int
class swh.loader.git.loader.GitLoader(url: str, base_url: Optional[str] = None, ignore_history: bool = False, repo_representation: Type[swh.loader.git.loader.RepoRepresentation] = <class 'swh.loader.git.loader.RepoRepresentation'>, config: Optional[Dict[str, Any]] = None)[source]

Bases: swh.loader.core.loader.DVCSLoader

A bulk loader for a git repository

ADDITIONAL_CONFIG = {'pack_size_bytes': ('int', 4294967296)}
visit_type = 'git'
fetch_pack_from_origin(origin_url: str, base_snapshot: Optional[swh.model.model.Snapshot], do_activity: Callable[[bytes], None])swh.loader.git.loader.FetchPackReturn[source]

Fetch a pack from the origin

list_pack(pack_data, pack_size) → Tuple[Dict[bytes, bytes], Dict[bytes, Set[bytes]]][source]
prepare_origin_visit(*args, **kwargs) → None[source]

First step executed by the loader to prepare origin and visit references. Set/update self.origin, and optionally self.origin_url, self.visit_date.

get_full_snapshot(origin_url) → Optional[swh.model.model.Snapshot][source]
prepare(*args, **kwargs) → None[source]

Second step executed by the loader to prepare some state needed by the loader.

fetch_data() → bool[source]
Fetch the data from the source the loader is currently loading

(ex: git/hg/svn/… repository).


a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.

save_data() → None[source]

Store a pack for archival

get_inflater() → dulwich.pack.PackInflater[source]

Reset the pack buffer and get an object inflater from it

has_contents() → bool[source]

Checks whether we need to load contents

get_content_ids() → Iterable[Dict[str, Any]][source]

Get the content identifiers from the git repository

get_contents() → Iterable[swh.model.model.BaseContent][source]

Format the blobs from the git repository as swh contents

has_directories() → bool[source]

Checks whether we need to load directories

get_directory_ids() → Iterable[bytes][source]

Get the directory identifiers from the git repository

get_directories() → Iterable[swh.model.model.Directory][source]

Format the trees as swh directories

has_revisions() → bool[source]

Checks whether we need to load revisions

get_revision_ids() → Iterable[bytes][source]

Get the revision identifiers from the git repository

get_revisions() → Iterable[swh.model.model.Revision][source]

Format commits as swh revisions

has_releases() → bool[source]

Checks whether we need to load releases

get_release_ids() → Iterable[bytes][source]

Get the release identifiers from the git repository

get_releases() → Iterable[swh.model.model.Release][source]

Retrieve all the release objects from the git repository


Get the snapshot for the current visit.

The main complexity of this function is mapping target objects to their types, as the refs dictionaries returned by the git server only give us the identifiers for the target objects, and not their types.

The loader itself only knows the types of the objects that it has fetched from the server (as it has parsed them while loading them to the archive). As we only fetched an increment between the previous snapshot and the current state of the server, we are missing the type information for the objects that would already have been referenced by the previous snapshot, and that the git server didn’t send us. We infer the type of these objects from the previous snapshot.

get_fetch_history_result() → Dict[str, int][source]
load_status() → Dict[str, Any][source]

The load was eventful if the current snapshot is different to the one we retrieved at the beginning of the run