swh.loader.git package

Submodules

swh.loader.git.converters module

Convert dulwich objects to dictionaries suitable for swh.storage

swh.loader.git.converters.origin_url_to_origin(origin_url)[source]

Format a pygit2.Repository as an origin suitable for swh.storage

swh.loader.git.converters.dulwich_blob_to_content_id(blob)[source]

Convert a dulwich blob to a Software Heritage content id

swh.loader.git.converters.dulwich_blob_to_content(blob, log=None, max_content_size=None, origin_id=None)[source]

Convert a dulwich blob to a Software Heritage content

swh.loader.git.converters.dulwich_tree_to_directory(tree, log=None)[source]

Format a tree as a directory

swh.loader.git.converters.parse_author(name_email)[source]

Parse an author line

swh.loader.git.converters.dulwich_tsinfo_to_timestamp(timestamp, timezone, timezone_neg_utc)[source]

Convert the dulwich timestamp information to a structure compatible with Software Heritage

swh.loader.git.converters.dulwich_commit_to_revision(commit, log=None)[source]
swh.loader.git.converters.dulwich_tag_to_release(tag, log=None)[source]
swh.loader.git.converters.branches_to_snapshot(branches)[source]

swh.loader.git.from_disk module

class swh.loader.git.from_disk.GitLoaderFromDisk(config=None)[source]

Bases: swh.loader.core.loader.UnbufferedLoader

Load a git repository from a directory.

CONFIG_BASE_FILENAME = 'loader/git-disk'
__init__(config=None)[source]

Initialize self. See help(type(self)) for accurate signature.

_prepare_origin_visit(origin_url, visit_date)[source]
prepare_origin_visit(origin_url, directory, visit_date)[source]

First step executed by the loader to prepare origin and visit references. Set/update self.origin, self.origin_id and optionally self.origin_url, self.visit_date.

prepare(origin_url, directory, visit_date)[source]

Second step executed by the loader to prepare some state needed by the loader.

iter_objects()[source]
_check(obj)[source]

Check the object’s repository representation.

If any errors in check exists, an ObjectFormatException is raised.

Parameters:obj (object) – Dulwich object read from the repository.
get_object(oid)[source]
Given an object id, return the object if it is found and not
malformed in some way.
Parameters:oid (bytes) – the object’s identifier
Returns:The object if found without malformation
fetch_data()[source]

Fetch the data from the data source

has_contents()[source]

Checks whether we need to load contents

get_content_ids()[source]

Get the content identifiers from the git repository

get_contents()[source]

Get the contents that need to be loaded

has_directories()[source]

Checks whether we need to load directories

get_directory_ids()[source]

Get the directory identifiers from the git repository

get_directories()[source]

Get the directories that need to be loaded

has_revisions()[source]

Checks whether we need to load revisions

get_revision_ids()[source]

Get the revision identifiers from the git repository

get_revisions()[source]

Get the revisions that need to be loaded

has_releases()[source]

Checks whether we need to load releases

get_release_ids()[source]

Get the release identifiers from the git repository

get_releases()[source]

Get the releases that need to be loaded

get_snapshot()[source]

Turn the list of branches into a snapshot to load

get_fetch_history_result()[source]

Return the data to store in fetch_history for the current loader

save_data()[source]

We already have the data locally, no need to save it

load_status()[source]

The load was eventful if the current occurrences are different to the ones we retrieved at the beginning of the run

__abstractmethods__ = frozenset()
__module__ = 'swh.loader.git.from_disk'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 115
_abc_registry = <_weakrefset.WeakSet object>
class swh.loader.git.from_disk.GitLoaderFromArchive(*args, **kwargs)[source]

Bases: swh.loader.git.from_disk.GitLoaderFromDisk

Load a git repository from an archive.

This loader ingests a git repository compressed into an archive. The supported archive formats are .zip and .tar.gz.

From an input tarball named my-git-repo.zip, the following layout is expected in it:

my-git-repo/
├── .git
│   ├── branches
│   ├── COMMIT_EDITMSG
│   ├── config
│   ├── description
│   ├── HEAD
...

Nevertheless, the loader is able to ingest tarballs with the following layouts too:

.
├── .git
│   ├── branches
│   ├── COMMIT_EDITMSG
│   ├── config
│   ├── description
│   ├── HEAD
...

or:

other-repo-name/
├── .git
│   ├── branches
│   ├── COMMIT_EDITMSG
│   ├── config
│   ├── description
│   ├── HEAD
...
__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

project_name_from_archive(archive_path)[source]

Compute the project name from the archive’s path.

prepare_origin_visit(origin_url, archive_path, visit_date)[source]

First step executed by the loader to prepare origin and visit references. Set/update self.origin, self.origin_id and optionally self.origin_url, self.visit_date.

prepare(origin_url, archive_path, visit_date)[source]
  1. Uncompress the archive in temporary location.
  2. Prepare as the GitLoaderFromDisk does
  3. Load as GitLoaderFromDisk does
cleanup()[source]

Cleanup the temporary location (if it exists).

__abstractmethods__ = frozenset()
__module__ = 'swh.loader.git.from_disk'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 115
_abc_registry = <_weakrefset.WeakSet object>

swh.loader.git.loader module

class swh.loader.git.loader.RepoRepresentation(storage, origin_id, base_snapshot=None, ignore_history=False)[source]

Bases: object

Repository representation for a Software Heritage origin.

__init__(storage, origin_id, base_snapshot=None, ignore_history=False)[source]

Initialize self. See help(type(self)) for accurate signature.

_fill_parents_cache(commits)[source]

When querying for a commit’s parents, we fill the cache to a depth of 1000 commits.

_cache_heads(origin_id, base_snapshot)[source]

Return all the known head commits for origin_id

get_parents(commit)[source]

Bogus method to prevent expensive recursion, at the expense of less efficient downloading

get_heads()[source]
static _encode_for_storage(objects)[source]
static _decode_from_storage(objects)[source]
graph_walker()[source]
static filter_unwanted_refs(refs)[source]

Filter the unwanted references from refs

determine_wants(refs)[source]

Filter the remote references to figure out which ones Software Heritage needs.

get_stored_objects(objects)[source]

Find which of these objects were stored in the archive.

Do the request in packets to avoid a server timeout.

find_remote_ref_types_in_swh(remote_refs)[source]

Parse the remote refs information and list the objects that exist in Software Heritage.

__dict__ = mappingproxy({'get_heads': <function RepoRepresentation.get_heads>, 'get_parents': <function RepoRepresentation.get_parents>, 'find_remote_ref_types_in_swh': <function RepoRepresentation.find_remote_ref_types_in_swh>, 'determine_wants': <function RepoRepresentation.determine_wants>, '_encode_for_storage': <staticmethod object>, '__weakref__': <attribute '__weakref__' of 'RepoRepresentation' objects>, '__init__': <function RepoRepresentation.__init__>, '__doc__': 'Repository representation for a Software Heritage origin.', '_cache_heads': <function RepoRepresentation._cache_heads>, '_decode_from_storage': <staticmethod object>, '__module__': 'swh.loader.git.loader', 'get_stored_objects': <function RepoRepresentation.get_stored_objects>, 'graph_walker': <function RepoRepresentation.graph_walker>, 'filter_unwanted_refs': <staticmethod object>, '_fill_parents_cache': <function RepoRepresentation._fill_parents_cache>, '__dict__': <attribute '__dict__' of 'RepoRepresentation' objects>})
__module__ = 'swh.loader.git.loader'
__weakref__

list of weak references to the object (if defined)

class swh.loader.git.loader.GitLoader(repo_representation=<class 'swh.loader.git.loader.RepoRepresentation'>, config=None)[source]

Bases: swh.loader.core.loader.UnbufferedLoader

A bulk loader for a git repository

CONFIG_BASE_FILENAME = 'loader/git'
ADDITIONAL_CONFIG = {'pack_size_bytes': ('int', 4294967296)}
__init__(repo_representation=<class 'swh.loader.git.loader.RepoRepresentation'>, config=None)[source]

Initialize the bulk updater.

Parameters:
  • repo_representation – swh’s repository representation
  • is in charge of filtering between known and remote (which) –
  • data.
fetch_pack_from_origin(origin_url, base_origin_id, base_snapshot, do_activity)[source]

Fetch a pack from the origin

list_pack(pack_data, pack_size)[source]
prepare_origin_visit(origin_url, **kwargs)[source]

First step executed by the loader to prepare origin and visit references. Set/update self.origin, self.origin_id and optionally self.origin_url, self.visit_date.

get_full_snapshot(origin_id)[source]
prepare(origin_url, base_url=None, ignore_history=False)[source]

Second step executed by the loader to prepare some state needed by the loader.

fetch_data()[source]
Fetch the data from the source the loader is currently loading
(ex: git/hg/svn/… repository).
Returns:a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading.
save_data()[source]

Store a pack for archival

get_inflater()[source]

Reset the pack buffer and get an object inflater from it

has_contents()[source]

Checks whether we need to load contents

get_content_ids()[source]

Get the content identifiers from the git repository

get_contents()[source]

Format the blobs from the git repository as swh contents

has_directories()[source]

Checks whether we need to load directories

get_directory_ids()[source]

Get the directory identifiers from the git repository

get_directories()[source]

Format the trees as swh directories

has_revisions()[source]

Checks whether we need to load revisions

get_revision_ids()[source]

Get the revision identifiers from the git repository

get_revisions()[source]

Format commits as swh revisions

has_releases()[source]

Checks whether we need to load releases

get_release_ids()[source]

Get the release identifiers from the git repository

get_releases()[source]

Retrieve all the release objects from the git repository

get_snapshot()[source]

Get the snapshot that needs to be loaded

get_fetch_history_result()[source]

Return the data to store in fetch_history for the current loader

load_status()[source]

The load was eventful if the current snapshot is different to the one we retrieved at the beginning of the run

__abstractmethods__ = frozenset()
__module__ = 'swh.loader.git.loader'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 115
_abc_registry = <_weakrefset.WeakSet object>

swh.loader.git.tasks module

swh.loader.git.utils module

Utilities helper functions

swh.loader.git.utils.init_git_repo_from_archive(project_name, archive_path, root_temp_dir='/tmp')[source]

Given a path to an archive containing a git repository.

Uncompress that archive to a temporary location and returns the path.

If any problem whatsoever is raised, clean up the temporary location.

Parameters:
  • project_name (str) – Project’s name
  • archive_path (str) – Full path to the archive
  • root_temp_dir (str) – Optional temporary directory mount point (default to /tmp)
Returns
A tuple: - temporary folder: containing the mounted repository - repo_path, path to the mounted repository inside the temporary folder
Raises
ValueError in case of failure to run the command to uncompress
swh.loader.git.utils.check_date_time(timestamp)[source]

Check date time for overflow errors.

Parameters:timestamp (timestamp) – Timestamp in seconds
Raise:
Any error raised by datetime fromtimestamp conversion error.

Module contents