- class swh.loader.git.loader.RepoRepresentation(storage, base_snapshot: Optional[swh.model.model.Snapshot] = None, ignore_history=False)
Repository representation for a Software Heritage origin.
- graph_walker() dulwich.object_store.ObjectStoreGraphWalker
- class swh.loader.git.loader.FetchPackReturn(remote_refs: Dict[bytes, HexBytes], symbolic_refs: Dict[bytes, HexBytes], pack_buffer: tempfile.SpooledTemporaryFile, pack_size: int)
- pack_buffer: tempfile.SpooledTemporaryFile
- class swh.loader.git.loader.GitLoader(storage: swh.storage.interface.StorageInterface, url: str, base_url: Optional[str] = None, ignore_history: bool = False, repo_representation: Type[swh.loader.git.loader.RepoRepresentation] = <class 'swh.loader.git.loader.RepoRepresentation'>, pack_size_bytes: int = 4294967296, temp_file_cutoff: int = 104857600, save_data_path: Optional[str] = None, max_content_size: Optional[int] = None)
A bulk loader for a git repository
Initialize the bulk updater.
repo_representation – swh’s repository representation
remote (which is in charge of filtering between known and) –
- fetch_pack_from_origin(origin_url: str, base_repo: swh.loader.git.loader.RepoRepresentation, do_activity: Callable[[bytes], None]) swh.loader.git.loader.FetchPackReturn
Fetch a pack from the origin
- prepare_origin_visit() None
First step executed by the loader to prepare origin and visit references. Set/update self.origin, and optionally self.origin_url, self.visit_date.
- prepare() None
- Second step executed by the loader to prepare some state needed by
NotFound exception if the origin to ingest is not found.
- fetch_data() bool
- Fetch the data from the source the loader is currently loading
(ex: git/hg/svn/… repository).
a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.
- iter_objects(object_type: bytes) Iterator[dulwich.objects.ShaFile]
Read all the objects of type object_type from the packfile
- get_contents() Iterable[swh.model.model.BaseContent]
Format the blobs from the git repository as swh contents
- get_releases() Iterable[swh.model.model.Release]
Retrieve all the release objects from the git repository
- get_snapshot() swh.model.model.Snapshot
Get the snapshot for the current visit.
The main complexity of this function is mapping target objects to their types, as the refs dictionaries returned by the git server only give us the identifiers for the target objects, and not their types.
The loader itself only knows the types of the objects that it has fetched from the server (as it has parsed them while loading them to the archive). As we only fetched an increment between the previous snapshot and the current state of the server, we are missing the type information for the objects that would already have been referenced by the previous snapshot, and that the git server didn’t send us. We infer the type of these objects from the previous snapshot.
- load_status() Dict[str, Any]
The load was eventful if the current snapshot is different to the one we retrieved at the beginning of the run