swh.graph.pid module

class swh.graph.pid.PidType(value)[source]

Bases: enum.Enum

types of existing PIDs, used to serialize PID type as a (char) integer

Note that the order does matter also for driving the binary search in PID-indexed maps. Integer values also matter, for compatibility with the Java layer.

content = 0
directory = 1
origin = 2
release = 3
revision = 4
snapshot = 5
swh.graph.pid.str_to_bytes(pid_str: str) → bytes[source]

Convert a PID to a byte sequence

The binary format used to represent PIDs as 22-byte long byte sequences as follows:

  • 1 byte for the namespace version represented as a C unsigned char

  • 1 byte for the object type, as the int value of PidType enums, represented as a C unsigned char

  • 20 bytes for the SHA1 digest as a byte sequence

Parameters

pid – persistent identifier

Returns

byte sequence representation of pid

Return type

bytes

swh.graph.pid.bytes_to_str(bytes: bytes) → str[source]

Inverse function of str_to_bytes()

See str_to_bytes() for a description of the binary PID format.

Parameters

bytes – byte sequence representation of pid

Returns

persistent identifier

Return type

pid

class swh.graph.pid.PidToNodeMap(fname: str, mode: str = 'rb', length: int = None)[source]

Bases: swh.graph.pid._OnDiskMap, collections.abc.MutableMapping

memory mapped map from SWHIDs to a continuous range 0..N of (8-byte long) integers

This is the converse mapping of NodeToPidMap.

The on-disk serialization format is a sequence of fixed length (30 bytes) records with the following fields:

  • PID (22 bytes): binary PID representation as per str_to_bytes()

  • long (8 bytes): big endian long integer

The records are sorted lexicographically by PID type and checksum, where type is the integer value of PidType. PID lookup in the map is performed via binary search. Hence a huge map with, say, 11 B entries, will require ~30 disk seeks.

Note that, due to fixed size + ordering, it is not possible to create these maps by random writing. Hence, __setitem__ can be used only to update the value associated to an existing key, rather than to add a missing item. To create an entire map from scratch, you should do so sequentially, using static method write_record() (or, at your own risk, by hand via the mmap mm).

RECORD_BIN_FMT = '>BB20sq'
RECORD_SIZE = 30
classmethod write_record(f: BinaryIO, pid: str, int: int) → None[source]

write a logical record to a file-like object

Parameters
  • f – file-like object to write the record to

  • pid – textual PID

  • int – PID integer identifier

iter_prefix(prefix: str)[source]
iter_type(pid_type: str) → Iterator[Tuple[str, int]][source]
class swh.graph.pid.NodeToPidMap(fname: str, mode: str = 'rb', length: int = None)[source]

Bases: swh.graph.pid._OnDiskMap, collections.abc.MutableMapping

memory mapped map from a continuous range of 0..N (8-byte long) integers to SWHIDs

This is the converse mapping of PidToNodeMap.

The on-disk serialization format is a sequence of fixed length records (22 bytes), each being the binary representation of a PID as per str_to_bytes().

The records are sorted by long integer, so that integer lookup is possible via fixed-offset seek.

RECORD_BIN_FMT = 'BB20s'
RECORD_SIZE = 22
classmethod write_record(f: BinaryIO, pid: str) → None[source]

write a PID to a file-like object

Parameters
  • f – file-like object to write the record to

  • pid – textual PID