swh.loader.mercurial.objects module

This document contains various helper classes used in converting Mercurial bundle files into SWH Contents, Directories, etc.

class swh.loader.mercurial.objects.SimpleBlob(file_hash, is_symlink, file_perms)[source]

Bases: object

Stores basic metadata of a blob object.when constructing deep trees from commit file manifests.

Parameters
  • file_hash – unique hash of the file contents

  • is_symlink – (bool) is this file a symlink?

  • file_perms – (string) 3 digit permission code as a string or bytestring, e.g. ‘755’ or b’755’

kind = 'file'
size()[source]

Return the size in byte.

class swh.loader.mercurial.objects.SimpleTree[source]

Bases: dict

Stores data for a nested directory object. Uses shallow cloning to stay compact after forking and change monitoring for efficient re-hashing.

kind = 'dir'
perms = 16384
remove_tree_node_for_path(path)[source]

Deletes a SimpleBlob or SimpleTree from inside nested SimpleTrees according to the given relative file path, and then recursively removes any newly depopulated SimpleTrees. It keeps the old history by doing a shallow clone before any change.

Parameters

path – bytestring containing a relative path from self to a nested file or directory. e.g. b’foodir/bardir/bazdir/quxfile.txt’

Returns

the new root node

add_blob(file_path, file_hash, is_symlink, file_perms)[source]

Shallow clones the root node and then deeply nests a SimpleBlob inside nested SimpleTrees according to the given file path, shallow cloning all all intermediate nodes and marking them as changed and in need of new hashes.

Parameters
  • file_path – bytestring containing the relative path from self to a nested file

  • file_hash – primary identifying hash computed from the blob contents

  • is_symlink – True/False whether this item is a symbolic link

  • file_perms – int or string representation of file permissions

Returns

the new root node

yield_swh_directories()[source]

Converts nested SimpleTrees into a stream of SWH Directories.

Yields

an SWH Directory for every node in the tree

hash_changed(new_dirs=None)[source]

Computes and sets primary identifier hashes for unhashed subtrees.

Parameters

new_dirs (optional) – an empty list to be populated with the SWH Directories for all of the new (not previously hashed) nodes

Returns

the top level hash of the whole tree

flatten(_curpath=None, _files=None)[source]

Converts nested sub-SimpleTrees and SimpleBlobs into a list of file paths. Useful for counting the number of files in a manifest.

Returns

a flat list of all of the contained file paths

size()[source]

Return the (approximate?) memory utilization in bytes of the nested structure.

class swh.loader.mercurial.objects.SelectiveCache(max_size=None, cache_hints=None, size_function=None, filename=None)[source]

Bases: collections.OrderedDict

Special cache for storing past data upon which new data is known to be dependent. Optional hinting of how many instances of which keys will be needed down the line makes utilization more efficient. And, because the distance between related data can be arbitrarily long and the data fragments can be arbitrarily large, a disk-based secondary storage is used if the primary RAM-based storage area is filled to the designated capacity.

Storage is occupied in three phases:

1) The most recent key/value pair is always held, regardless of other factors, until the next entry replaces it.

2) Stored key/value pairs are pushed into a randomly accessible expanding buffer in memory with a stored size function, maximum size value, and special hinting about which keys to store for how long optionally declared at instantiation.

3) The in-memory buffer pickles into a randomly accessible disk-backed secondary buffer when it becomes full.

Occupied space is calculated by default as whatever the len() function returns on the values being stored. This can be changed by passing in a new size_function at instantiation.

The cache_hints parameter is a dict of key/int pairs recording how many subsequent fetches that particular key’s value should stay in storage for before being erased. If you provide a set of hints and then try to store a key that is not in that set of hints, the cache will store it only while it is the most recent entry, and will bypass storage phases 2 and 3.

DEFAULT_SIZE = 838860800
store(key, data)[source]

Primary method for putting data into the cache.

Parameters
  • key – any hashable value

  • data – any python object (preferably one that is measurable)

has(key)[source]

Tests whether the data for the provided key is being stored.

Parameters

key – the key of the data whose storage membership property you wish to discover

Returns

True or False

fetch(key)[source]
Pulls a value out of storage and decrements the hint counter for the

given key.

Parameters

key – the key of the data that you want to retrieve

Returns

the retrieved value or None

dereference(key)[source]

Remove one instance of expected future retrieval of the data for the given key. This is called automatically by fetch requests that aren’t satisfied by phase 1 of storage.

Parameters
  • key of the data for which the future retrievals hint is to be (the) –

  • decremented

keys() → a set-like object providing a view on D’s keys[source]
values() → an object providing a view on D’s values[source]
items() → a set-like object providing a view on D’s items[source]