swh.model.from_disk module#

Conversion from filesystem tree to SWH objects.

This module allows reading a tree of directories and files from a local filesystem, and convert them to in-memory data structures, which can then be exported to SWH data model objects, as defined in swh.model.model.

class swh.model.from_disk.DiskBackedContent(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', ctime: datetime | None = None, path: bytes | None = None)[source]#

Bases: BaseContent

Content-like class, which allows lazy-loading data from the disk.

Method generated by attrs for class DiskBackedContent.

object_type: Final = 'content_file'#
sha1#
sha1_git#
sha256#
blake2s256#
length#
status#
ctime#
path#
classmethod from_dict(d)[source]#

Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects.

with_data() Content[source]#
class swh.model.from_disk.DentryPerms(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: IntEnum

Admissible permissions for directory entries.

content = 33188#

Content

executable_content = 33261#

Executable content (e.g. executable script)

Symbolic link

directory = 16384#

Directory

revision = 57344#

Revision (e.g. submodule)

swh.model.from_disk.mode_to_perms(mode)[source]#

Convert a file mode to a permission compatible with Software Heritage directory entries

Parameters:

mode (int) – a file mode as returned by os.stat() in os.stat_result.st_mode

Returns:

one of the following values:

DentryPerms.content: plain file DentryPerms.executable_content: executable file DentryPerms.symlink: symbolic link DentryPerms.directory: directory

Return type:

DentryPerms

class swh.model.from_disk.Content(data=None)[source]#

Bases: MerkleLeaf

Representation of a Software Heritage content as a node in a Merkle tree.

The current Merkle hash for the Content nodes is the sha1_git, which makes it consistent with what Directory uses for its own hash computation.

object_type: Final = 'content'#
classmethod from_bytes(*, mode, data)[source]#

Convert data (raw bytes) to a Software Heritage content entry

Parameters:

Convert a symbolic link to a Software Heritage content entry

classmethod from_file(*, path, max_content_length=None)[source]#

Compute the Software Heritage content entry corresponding to an on-disk file.

The returned dictionary contains keys useful for both: - loading the content in the archive (hashes, length) - using the content as a directory entry in a directory

Parameters:
  • save_path (bool) – add the file path to the entry

  • max_content_length (Optional[int]) – if given, all contents larger than this will be skipped.

swhid() CoreSWHID[source]#

Return node identifier as a SWHID

compute_hash()[source]#

Compute the hash of the current node.

The hash should depend on the data of the node, as well as on hashes of the children nodes.

to_model() BaseContent[source]#

Builds a model.BaseContent object based on this leaf.

swh.model.from_disk.accept_all_directories(dirpath: bytes, dirname: bytes, entries: Iterable[Any] | None) bool[source]#

Default filter for Directory.from_disk() accepting all directories

Parameters:
  • dirname (bytes) – directory name

  • entries (list) – directory entries

swh.model.from_disk.accept_all_paths(path: bytes, name: bytes, entries: Iterable[Any] | None) bool[source]#

Default filter for Directory.from_disk() accepting all paths

swh.model.from_disk.ignore_empty_directories(dirpath: bytes, dirname: bytes, entries: Iterable[Any] | None) bool[source]#

Filter for directory_to_objects() ignoring empty directories

Parameters:
  • dirname (bytes) – directory name

  • entries (list) – directory entries

Returns:

True if the directory is not empty, false if the directory is empty

swh.model.from_disk.ignore_named_directories(names, *, case_sensitive=True)[source]#

Filter for directory_to_objects() to ignore directories named one of names.

Parameters:
  • names (list of bytes) – names to ignore

  • case_sensitive (bool) – whether to do the filtering in a case sensitive way

Returns:

a directory filter for directory_to_objects()

swh.model.from_disk.extract_regex_objs(root_path: bytes, patterns: Iterable[bytes]) Iterator[Pattern[bytes]][source]#
Generates a regex object for each pattern given in input and checks if

the path is a subdirectory or relative to the root path.

Parameters:
  • root_path (bytes) – path to the root directory

  • patterns – shell patterns to match

swh.model.from_disk.ignore_directories_patterns(root_path: bytes, patterns: Iterable[bytes])[source]#

Filter for directory_to_objects() to ignore directories matching certain patterns.

Parameters:
  • root_path (bytes) – path of the root directory

  • patterns (list of bytes) – patterns to ignore

Returns:

a directory filter for directory_to_objects()

swh.model.from_disk.iter_directory(directory) Tuple[List[Content], List[SkippedContent], List[Directory]][source]#

Return the directory listing from a disk-memory directory instance.

Raises:

TypeError in case an unexpected object type is listed.

Returns:

Tuple of respectively iterable of content, skipped content and directories.

class swh.model.from_disk.Directory(data=None)[source]#

Bases: MerkleNode

Representation of a Software Heritage directory as a node in a Merkle Tree.

This class can be used to generate, from an on-disk directory, all the objects that need to be sent to the Software Heritage archive.

The from_disk() constructor allows you to generate the data structure from a directory on disk. The resulting Directory can then be manipulated as a dictionary, using the path as key.

The collect() method is used to retrieve all the objects that need to be added to the Software Heritage archive since the last collection, by class (contents and directories).

When using the dict-like methods to update the contents of the directory, the affected levels of hierarchy are reset and can be collected again using the same method. This enables the efficient collection of updated nodes, for instance when the client is applying diffs.

object_type: Final = 'directory'#
classmethod from_disk(*, path: bytes, path_filter: ~typing.Callable[[bytes, bytes, ~typing.List[bytes] | None], bool] = <function accept_all_paths>, dir_filter: ~typing.Callable[[bytes, bytes, ~typing.List[bytes] | None], bool] | None = None, max_content_length: int | None = None, progress_callback: ~typing.Callable[[int], None] | None = None) Directory[source]#

Compute the Software Heritage objects for a given directory tree

Parameters:
  • path (bytes) – the directory to traverse

  • data (bool) – whether to add the data to the content objects

  • save_path (bool) – whether to add the path to the content objects

  • path_filter (function) – a filter to ignore some paths. Takes three arguments: path, name and entries. entries is None for files, and a (possibly empty) list of names for directories. Returns True if the path should be added, False if the path should be ignored.

  • dir_filter (DEPRECATED, function) – a filter to ignore some directories by name or contents. Takes two arguments: dirname and entries, and returns True if the directory should be added, False if the directory should be ignored.

  • max_content_length (Optional[int]) – if given, all contents larger than this will be skipped.

  • progress_callback (Optional function) – if given, returns for each

  • entries. (non empty directories traversed the number of computed)

invalidate_hash()[source]#

Invalidate the cached hash of the current node.

static child_to_directory_entry(name, child)[source]#
get_data(**kwargs)[source]#

Retrieve and format the collected data for the current node, for use by collect().

Can be overridden, for instance when you want the collected data to contain information about the child nodes.

Parameters:

kwargs – allow subclasses to alter behaviour depending on how collect() is called.

Returns:

data formatted for collect()

property entries#

Child nodes, sorted by name in the same way swh.model.git_objects.directory_git_object() does.

swhid() CoreSWHID[source]#

Return node identifier as a SWHID

compute_hash()[source]#

Compute the hash of the current node.

The hash should depend on the data of the node, as well as on hashes of the children nodes.

to_model() Directory[source]#

Builds a model.Directory object based on this node; ignoring its children.