swh.objstorage.backends.pathslicing module

class swh.objstorage.backends.pathslicing.PathSlicingObjStorage(root, slicing, compression='gzip', **kwargs)[source]

Bases: swh.objstorage.objstorage.ObjStorage

Implementation of the ObjStorage API based on the hash of the content.

On disk, an object storage is a directory tree containing files named after their object IDs. An object ID is a checksum of its content, depending on the value of the ID_HASH_ALGO constant (see swh.model.hashutil for its meaning).

To avoid directories that contain too many files, the object storage has a given slicing. Each slicing correspond to a directory that is named according to the hash of its content.

So for instance a file with SHA1 34973274ccef6ab4dfaaf86599792fa9c3fe4689 will be stored in the given object storages :

  • 0:2/2:4/4:6 : 34/97/32/34973274ccef6ab4dfaaf86599792fa9c3fe4689

  • 0:1/0:5/ : 3/34973/34973274ccef6ab4dfaaf86599792fa9c3fe4689

The files in the storage are stored in gzipped compressed format.

root

path to the root directory of the storage on the disk.

Type

string

bounds

list of tuples that indicates the beginning and the end of each subdirectory for a content.

check_config(*, check_write)[source]

Check whether this object storage is properly configured

add(content, obj_id=None, check_presence=True)[source]

Add a new object to the object storage.

Parameters
  • content (bytes) – object’s raw content to add in storage.

  • obj_id (bytes) – checksum of [bytes] using [ID_HASH_ALGO] algorithm. When given, obj_id will be trusted to match the bytes. If missing, obj_id will be computed on the fly.

  • check_presence (bool) – indicate if the presence of the content should be verified before adding the file.

Returns

the id (bytes) of the object into the storage.

get(obj_id)[source]

Retrieve the content of a given object.

Parameters

obj_id (bytes) – object id.

Returns

the content of the requested object as bytes.

Raises

ObjNotFoundError – if the requested object is missing.

check(obj_id)[source]

Perform an integrity check for a given object.

Verify that the file object is in place and that the content matches the object id.

Parameters

obj_id (bytes) – object identifier.

Raises
delete(obj_id)[source]

Delete an object.

Parameters

obj_id (bytes) – object identifier.

Raises

ObjNotFoundError – if the requested object is missing.

get_random(batch_size)[source]

Get random ids of existing contents.

This method is used in order to get random ids to perform content integrity verifications on random contents.

Parameters

batch_size (int) – Number of ids that will be given

Yields

An iterable of ids (bytes) of contents that are in the current object storage.

chunk_writer(obj_id)[source]
add_stream(content_iter, obj_id, check_presence=True)[source]

Add a new object to the object storage using streaming.

This function is identical to add() except it takes a generator that yields the chunked content instead of the whole content at once.

Parameters
  • content (bytes) – chunked generator that yields the object’s raw content to add in storage.

  • obj_id (bytes) – object identifier

  • check_presence (bool) – indicate if the presence of the content should be verified before adding the file.

Returns

the id (bytes) of the object into the storage.

get_stream(obj_id, chunk_size=2097152)[source]

Retrieve the content of a given object as a chunked iterator.

Parameters

obj_id (bytes) – object id.

Returns

the content of the requested object as bytes.

Raises

ObjNotFoundError – if the requested object is missing.

list_content(last_obj_id=None, limit=10000)[source]

Generates known object ids.

Parameters
  • last_obj_id (bytes) – object id from which to iterate from (excluded).

  • limit (int) – max number of object ids to generate.

Generates:

obj_id (bytes): object ids.

iter_from(obj_id, n_leaf=False)[source]