swh.model.from_disk module

class swh.model.from_disk.DiskBackedContent(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', ctime: Optional[datetime.datetime] = None, path: Optional[bytes] = None)[source]

Bases: swh.model.model.BaseContent

Content-like class, which allows lazy-loading data from the disk.

object_type: typing_extensions.Final = 'content_file'
classmethod from_dict(d)[source]

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

with_data()swh.model.model.Content[source]
class swh.model.from_disk.DentryPerms(value)[source]

Bases: enum.IntEnum

Admissible permissions for directory entries.

content = 33188

Content

executable_content = 33261

Executable content (e.g. executable script)

Symbolic link

directory = 16384

Directory

revision = 57344

Revision (e.g. submodule)

swh.model.from_disk.mode_to_perms(mode)[source]

Convert a file mode to a permission compatible with Software Heritage directory entries

Parameters

mode (int) – a file mode as returned by os.stat() in os.stat_result.st_mode

Returns

one of the following values:

DentryPerms.content: plain file DentryPerms.executable_content: executable file DentryPerms.symlink: symbolic link DentryPerms.directory: directory

Return type

DentryPerms

class swh.model.from_disk.Content(data=None)[source]

Bases: swh.model.merkle.MerkleLeaf

Representation of a Software Heritage content as a node in a Merkle tree.

The current Merkle hash for the Content nodes is the sha1_git, which makes it consistent with what Directory uses for its own hash computation.

object_type: typing_extensions.Final = 'content'
classmethod from_bytes(*, mode, data)[source]

Convert data (raw bytes) to a Software Heritage content entry

Parameters
  • mode (int) – a file mode (passed to mode_to_perms())

  • data (bytes) – raw contents of the file

Convert a symbolic link to a Software Heritage content entry

classmethod from_file(*, path, max_content_length=None)[source]

Compute the Software Heritage content entry corresponding to an on-disk file.

The returned dictionary contains keys useful for both: - loading the content in the archive (hashes, length) - using the content as a directory entry in a directory

Parameters
  • save_path (bool) – add the file path to the entry

  • max_content_length (Optional[int]) – if given, all contents larger than this will be skipped.

compute_hash()[source]

Compute the hash of the current node.

The hash should depend on the data of the node, as well as on hashes of the children nodes.

to_model()swh.model.model.BaseContent[source]

Builds a model.BaseContent object based on this leaf.

swh.model.from_disk.accept_all_directories(dirpath: str, dirname: str, entries: Iterable[Any]) → bool[source]

Default filter for Directory.from_disk() accepting all directories

Parameters
  • dirname (bytes) – directory name

  • entries (list) – directory entries

swh.model.from_disk.ignore_empty_directories(dirpath: str, dirname: str, entries: Iterable[Any]) → bool[source]

Filter for directory_to_objects() ignoring empty directories

Parameters
  • dirname (bytes) – directory name

  • entries (list) – directory entries

Returns

True if the directory is not empty, false if the directory is empty

swh.model.from_disk.ignore_named_directories(names, *, case_sensitive=True)[source]

Filter for directory_to_objects() to ignore directories named one of names.

Parameters
  • names (list of bytes) – names to ignore

  • case_sensitive (bool) – whether to do the filtering in a case sensitive way

Returns

a directory filter for directory_to_objects()

swh.model.from_disk.extract_regex_objs(root_path: bytes, patterns: Iterable[bytes]) → Iterator[Pattern[bytes]][source]
Generates a regex object for each pattern given in input and checks if

the path is a subdirectory or relative to the root path.

Parameters
  • root_path (bytes) – path to the root directory

  • patterns – patterns to match

swh.model.from_disk.ignore_directories_patterns(root_path: bytes, patterns: Iterable[bytes])[source]

Filter for directory_to_objects() to ignore directories matching certain patterns.

Parameters
  • root_path (bytes) – path of the root directory

  • patterns (list of byte) – patterns to ignore

Returns

a directory filter for directory_to_objects()

swh.model.from_disk.iter_directory(directory) → Tuple[List[swh.model.model.Content], List[swh.model.model.SkippedContent], List[swh.model.model.Directory]][source]

Return the directory listing from a disk-memory directory instance.

Raises

TypeError in case an unexpected object type is listed.

Returns

Tuple of respectively iterable of content, skipped content and directories.

class swh.model.from_disk.Directory(data=None)[source]

Bases: swh.model.merkle.MerkleNode

Representation of a Software Heritage directory as a node in a Merkle Tree.

This class can be used to generate, from an on-disk directory, all the objects that need to be sent to the Software Heritage archive.

The from_disk() constructor allows you to generate the data structure from a directory on disk. The resulting Directory can then be manipulated as a dictionary, using the path as key.

The collect() method is used to retrieve all the objects that need to be added to the Software Heritage archive since the last collection, by class (contents and directories).

When using the dict-like methods to update the contents of the directory, the affected levels of hierarchy are reset and can be collected again using the same method. This enables the efficient collection of updated nodes, for instance when the client is applying diffs.

object_type: typing_extensions.Final = 'directory'
classmethod from_disk(*, path, dir_filter=<function accept_all_directories>, max_content_length=None)[source]

Compute the Software Heritage objects for a given directory tree

Parameters
  • path (bytes) – the directory to traverse

  • data (bool) – whether to add the data to the content objects

  • save_path (bool) – whether to add the path to the content objects

  • dir_filter (function) – a filter to ignore some directories by name or contents. Takes two arguments: dirname and entries, and returns True if the directory should be added, False if the directory should be ignored.

  • max_content_length (Optional[int]) – if given, all contents larger than this will be skipped.

invalidate_hash()[source]

Invalidate the cached hash of the current node.

static child_to_directory_entry(name, child)[source]
get_data(**kwargs)[source]

Retrieve and format the collected data for the current node, for use by collect().

Can be overridden, for instance when you want the collected data to contain information about the child nodes.

Parameters

kwargs – allow subclasses to alter behaviour depending on how collect() is called.

Returns

data formatted for collect()

property entries

Child nodes, sorted by name in the same way directory_identifier does.

compute_hash()[source]

Compute the hash of the current node.

The hash should depend on the data of the node, as well as on hashes of the children nodes.

to_model()swh.model.model.Directory[source]

Builds a model.Directory object based on this node; ignoring its children.