swh.loader.git.loader module#

swh.loader.git.loader.split_lines_and_remainder(buf: bytes) Tuple[List[bytes], bytes][source]#

Get newline-terminated (b"\r" or b"\n") lines from buf, and the beginning of the last line if it isn’t terminated.

class swh.loader.git.loader.RepoRepresentation(storage, base_snapshots: List[Snapshot] | None = None, incremental: bool = True, statsd: Statsd | None = None)[source]#

Bases: object

Repository representation for a Software Heritage origin.

graph_walker() ObjectStoreGraphWalker[source]#
determine_wants(refs: Dict[bytes, HexBytes]) List[HexBytes][source]#

Get the list of bytehex sha1s that the git loader should fetch.

This compares the remote refs sent by the server with the base snapshot provided by the loader.

class swh.loader.git.loader.FetchPackReturn(remote_refs: Dict[bytes, swh.loader.git.utils.HexBytes], symbolic_refs: Dict[bytes, swh.loader.git.utils.HexBytes], pack_buffer: tempfile.SpooledTemporaryFile, pack_size: int)[source]#

Bases: object

remote_refs: Dict[bytes, HexBytes]#
symbolic_refs: Dict[bytes, HexBytes]#
pack_buffer: SpooledTemporaryFile#
pack_size: int#
class swh.loader.git.loader.GitLoader(storage: ~swh.storage.interface.StorageInterface, url: str, incremental: bool = True, repo_representation: ~typing.Type[~swh.loader.git.loader.RepoRepresentation] = <class 'swh.loader.git.loader.RepoRepresentation'>, pack_size_bytes: int = 4294967296, temp_file_cutoff: int = 104857600, connect_timeout: float = 120, read_timeout: float = 60, verify_certs: bool = True, urllib3_extra_kwargs: ~typing.Dict[str, ~typing.Any] = {}, requests_extra_kwargs: ~typing.Dict[str, ~typing.Any] = {}, **kwargs: ~typing.Any)[source]#

Bases: BaseGitLoader

A bulk loader for a git repository

Emits the following statsd stats:

  • increments swh_loader_git

  • histogram swh_loader_git_ignored_refs_percent is the ratio of refs ignored over all refs of the remote repository

  • histogram swh_loader_git_known_refs_percent is the ratio of (non-ignored) remote heads that are already local over all non-ignored remote heads

All three are tagged with {{"incremental": "<incremental_mode>"}} where incremental_mode is one of:

  • from_same_origin when the origin was already loaded

  • from_parent_origin when the origin was not already loaded, but it was detected as a forge-fork of an origin that was already loaded

  • no_previous_snapshot when the origin was not already loaded, and it was detected as a forge-fork of origins that were not already loaded either

  • no_parent_origin when the origin was no already loaded, and it was not detected as a forge-fork of any other origin

  • disabled when incremental loading is disabled by configuration

Initialize the bulk updater.

Parameters:
  • repo_representation – swh’s repository representation

  • remote (which is in charge of filtering between known and)

  • data.

  • ...

  • incremental – If True, the default, this starts from the last known snapshot (if any) references. Otherwise, this loads the full repository.

visit_type: str = 'git'#
fetch_pack_from_origin(origin_url: str, base_repo: RepoRepresentation, do_activity: Callable[[bytes], None]) FetchPackReturn[source]#

Fetch a pack from the origin

get_full_snapshot(origin_url) Snapshot | None[source]#
load_metadata_objects(metadata_objects: List[RawExtrinsicMetadata]) None[source]#
prepare() None[source]#
Second step executed by the loader to prepare some state needed by

the loader.

Raises

NotFound exception if the origin to ingest is not found.

fetch_data() bool[source]#
Fetch the data from the source the loader is currently loading

(ex: git/hg/svn/… repository).

Returns:

a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.

save_data() None[source]#

Store a pack for archival

iter_objects(object_type: bytes) Iterator[ShaFile][source]#

Read all the objects of type object_type from the packfile

get_contents() Iterable[BaseContent][source]#

Format the blobs from the git repository as swh contents

get_directories() Iterable[Directory][source]#

Format the trees as swh directories

get_revisions() Iterable[Revision][source]#

Format commits as swh revisions

get_releases() Iterable[Release][source]#

Retrieve all the release objects from the git repository

get_snapshot() Snapshot[source]#

Get the snapshot for the current visit.

The main complexity of this function is mapping target objects to their types, as the refs dictionaries returned by the git server only give us the identifiers for the target objects, and not their types.

The loader itself only knows the types of the objects that it has fetched from the server (as it has parsed them while loading them to the archive). As we only fetched an increment between the previous snapshot and the current state of the server, we are missing the type information for the objects that would already have been referenced by the previous snapshot, and that the git server didn’t send us. We infer the type of these objects from the previous snapshot.

load_status() Dict[str, Any][source]#

The load was eventful if the current snapshot is different to the one we retrieved at the beginning of the run