swh.loader.tar package

Submodules

swh.loader.tar.build module

swh.loader.tar.build._time_from_last_modified(last_modified)[source]

Compute the modification time from the tarpath.

Parameters:last_modified (str) – Last modification time
Returns:dict representing a timestamp with keys {seconds, microseconds}
swh.loader.tar.build.compute_revision(tarpath, last_modified)[source]

Compute a revision.

Parameters:
  • tarpath (str) – absolute path to the tarball
  • last_modified (str) – Time of last modification read from the source remote (most probably by the lister)
Returns:

  • date (dict): the modification timestamp as returned by
    _time_from_path function
  • committer_date: the modification timestamp as returned by
    _time_from_path function
  • author: cf. SWH_PERSON
  • committer: cf. SWH_PERSON
  • type: cf. REVISION_TYPE
  • message: cf. REVISION_MESSAGE

Return type:

Revision as dict

swh.loader.tar.build.set_original_artifact(*, revision, filepath, nature, hashes)[source]

Set the original artifact data on the given revision for the tarball currently being loaded.

swh.loader.tar.loader module

class swh.loader.tar.loader.LocalResponse(path)[source]

Bases: object

Local Response class with iter_content api

__init__(path)[source]

Initialize self. See help(type(self)) for accurate signature.

iter_content(chunk_size=None)[source]
__dict__ = mappingproxy({'__module__': 'swh.loader.tar.loader', '__doc__': 'Local Response class with iter_content api\n\n ', '__init__': <function LocalResponse.__init__>, 'iter_content': <function LocalResponse.iter_content>, '__dict__': <attribute '__dict__' of 'LocalResponse' objects>, '__weakref__': <attribute '__weakref__' of 'LocalResponse' objects>})
__module__ = 'swh.loader.tar.loader'
__weakref__

list of weak references to the object (if defined)

class swh.loader.tar.loader.ArchiveFetcher(temp_directory=None)[source]

Bases: object

Http/Local client in charge of downloading archives from a
remote/local server.
Parameters:temp_directory (str) – Path to the temporary disk location used for downloading the release artifacts
__init__(temp_directory=None)[source]

Initialize self. See help(type(self)) for accurate signature.

download(url)[source]

Download the remote tarball url locally.

Parameters:url (str) – Url (file or http*)
Raises:ValueError in case of failing to query
Returns:Tuple of local (filepath, hashes of filepath)
__dict__ = mappingproxy({'__module__': 'swh.loader.tar.loader', '__doc__': 'Http/Local client in charge of downloading archives from a\n remote/local server.\n\n Args:\n temp_directory (str): Path to the temporary disk location used\n for downloading the release artifacts\n\n ', '__init__': <function ArchiveFetcher.__init__>, 'download': <function ArchiveFetcher.download>, '__dict__': <attribute '__dict__' of 'ArchiveFetcher' objects>, '__weakref__': <attribute '__weakref__' of 'ArchiveFetcher' objects>})
__module__ = 'swh.loader.tar.loader'
__weakref__

list of weak references to the object (if defined)

class swh.loader.tar.loader.BaseTarLoader(logging_class='swh.loader.tar.TarLoader', config=None)[source]

Bases: swh.loader.core.loader.BufferedLoader

Base Tarball Loader class.

This factorizes multiple loader implementations:

  • RemoteTarLoader: New implementation able to deal with
    remote archives.
  • TarLoader: Old implementation which dealt with only
    local archive. It also was only passing along objects to persist (revision, etc…)
CONFIG_BASE_FILENAME = 'loader/tar'
ADDITIONAL_CONFIG = {'debug': ('bool', False), 'working_dir': ('string', '/tmp')}
visit_type = 'tar'
__init__(logging_class='swh.loader.tar.TarLoader', config=None)[source]

Initialize self. See help(type(self)) for accurate signature.

cleanup()[source]

Clean up temporary disk folders used.

prepare_origin_visit(*, origin, visit_date=None, **kwargs)[source]

Prepare the origin visit information.

Parameters:
  • origin (dict) – Dict with keys {url, type}
  • visit_date (str) – Date representing the date of the visit. None by default will make it the current time during the loading process.
get_tarball_url_to_retrieve()[source]

Compute the tarball url to allow retrieval

fetch_data()[source]

Retrieve, uncompress archive and fetch objects from the tarball. The actual ingestion takes place in the store_data() implementation below.

store_data()[source]

Store the objects in the swh archive.

__abstractmethods__ = frozenset({'prepare'})
__module__ = 'swh.loader.tar.loader'
_abc_impl = <_abc_data object>
class swh.loader.tar.loader.RemoteTarLoader(logging_class='swh.loader.tar.TarLoader', config=None)[source]

Bases: swh.loader.tar.loader.BaseTarLoader

This is able to load from remote/local archive into the swh
archive.

This will:

  • create an origin (if it does not exist) and a visit
  • fetch the tarball in a temporary location
  • uncompress it locally in a temporary location
  • process the content of the tarball to persist on swh storage
  • clean up the temporary location
prepare(*, last_modified, **kwargs)[source]

last_modified is the time of last modification of the tarball.

E.g https://ftp.gnu.org/gnu/8sync/:
[ ] 8sync-0.1.0.tar.gz 2016-04-22 16:35 217K [ ] 8sync-0.1.0.tar.gz.sig 2016-04-22 16:35 543 [ ] …
Parameters:
  • origin (dict) – Dict with keys {url, type}
  • last_modified (str) – The date of last modification of the archive to ingest.
  • visit_date (str) – Date representing the date of the visit. None by default will make it the current time during the loading process.
get_tarball_url_to_retrieve()[source]

Compute the tarball url to allow retrieval

build_revision(filepath, nature, hashes)[source]

Build the revision with identifier

We use the last_modified date provided by the caller to build the revision.

build_snapshot(revision)[source]

Build the snapshot targeting the revision.

__abstractmethods__ = frozenset()
__module__ = 'swh.loader.tar.loader'
_abc_impl = <_abc_data object>
class swh.loader.tar.loader.LegacyLocalTarLoader(logging_class='swh.loader.tar.TarLoader', config=None)[source]

Bases: swh.loader.tar.loader.BaseTarLoader

This loads local tarball into the swh archive. It’s using the

revision and branch provided by the caller as scaffolding to create the full revision and snapshot (with identifiers).

This is what’s: - been used to ingest our 2015 rsync copy of gnu.org - still used by the loader deposit

This will:

  • create an origin (if it does not exist) and a visit
  • uncompress a tarball in a local and temporary location
  • process the content of the tarball to persist on swh storage
  • associate it to a passed revision and snapshot
  • clean up the temporary location
prepare(*, tar_path, revision, branch_name, **kwargs)[source]

Prepare the data prior to ingest it in SWH archive.

Parameters:
  • tar_path (str) – Path to the archive to ingest
  • revision (dict) – The synthetic revision to associate the archive to (no identifiers within)
  • branch_name (str) – The branch name to use for the snapshot.
get_tarball_url_to_retrieve()[source]

Compute the tarball url to allow retrieval

build_revision(filepath, nature, hashes)[source]

Build the revision with identifier

We use the revision provided by the caller as a scaffolding revision.

__abstractmethods__ = frozenset()
__module__ = 'swh.loader.tar.loader'
_abc_impl = <_abc_data object>
build_snapshot(revision)[source]

Build the snapshot targeting the revision.

We use the branch_name provided by the caller as a scaffolding as well.

swh.loader.tar.tasks module

swh.loader.tar.utils module

swh.loader.tar.utils.random_blocks(iterable, block=100)[source]

Randomize iterable per block of size block.

Given an iterable:

  • slice the iterable in data set of block-sized elements
  • randomized the block-sized elements
  • yield each element of that randomized block-sized
  • continue onto the next block-sized block
Parameters:
  • iterable (Iterable) – an iterable
  • block (int) – number of elements per block
Yields:

random element of the iterable

Module contents