swh.model.from_disk module#
Conversion from filesystem tree to SWH objects.
This module allows reading a tree of directories and files from a local
filesystem, and convert them to in-memory data structures, which can then
be exported to SWH data model objects, as defined in swh.model.model
.
- class swh.model.from_disk.FromDiskType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
_StringCompatibleEnum
Possible object types for “from disk” object.
- CONTENT = 'content'#
- DIRECTORY = 'directory'#
- class swh.model.from_disk.DiskBackedData(path: bytes)[source]#
Bases:
object
Method generated by attrs for class DiskBackedData.
- path#
- class swh.model.from_disk.DentryPerms(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
IntEnum
Admissible permissions for directory entries.
- content = 33188#
Content
- executable_content = 33261#
Executable content (e.g. executable script)
- symlink = 40960#
Symbolic link
- directory = 16384#
Directory
- revision = 57344#
Revision (e.g. submodule)
- swh.model.from_disk.mode_to_perms(mode)[source]#
Convert a file mode to a permission compatible with Software Heritage directory entries
- Parameters:
mode (int) – a file mode as returned by
os.stat()
inos.stat_result.st_mode
- Returns:
- one of the following values:
DentryPerms.content
: plain fileDentryPerms.executable_content
: executable fileDentryPerms.symlink
: symbolic linkDentryPerms.directory
: directory
- Return type:
- class swh.model.from_disk.Content(data=None)[source]#
Bases:
MerkleLeaf
Representation of a Software Heritage content as a node in a Merkle tree.
The current Merkle hash for the Content nodes is the sha1_git, which makes it consistent with what
Directory
uses for its own hash computation.- classmethod from_bytes(*, mode, data)[source]#
Convert data (raw
bytes
) to a Software Heritage content entry- Parameters:
mode (int) – a file mode (passed to
mode_to_perms()
)data (bytes) – raw contents of the file
- classmethod from_symlink(*, path, mode)[source]#
Convert a symbolic link to a Software Heritage content entry
- classmethod from_file(*, path, max_content_length=None)[source]#
Compute the Software Heritage content entry corresponding to an on-disk file.
The returned dictionary contains keys useful for both: - loading the content in the archive (hashes, length) - using the content as a directory entry in a directory
- compute_hash()[source]#
Compute the hash of the current node.
The hash should depend on the data of the node, as well as on hashes of the children nodes.
- to_model() BaseContent [source]#
Builds a model.BaseContent object based on this leaf.
- swh.model.from_disk.accept_all_directories(dirpath: bytes, dirname: bytes, entries: Iterable[Any] | None) bool [source]#
Default filter for
Directory.from_disk()
accepting all directories
- swh.model.from_disk.accept_all_paths(path: bytes, name: bytes, entries: Iterable[Any] | None) bool [source]#
Default filter for
Directory.from_disk()
accepting all paths
- swh.model.from_disk.ignore_empty_directories(dirpath: bytes, dirname: bytes, entries: Iterable[Any] | None) bool [source]#
Filter for
directory_to_objects()
ignoring empty directories
- swh.model.from_disk.ignore_named_directories(names, *, case_sensitive=True)[source]#
Filter for
directory_to_objects()
to ignore directories named one of names.
- swh.model.from_disk.extract_regex_objs(root_path: bytes, patterns: Iterable[bytes]) Iterator[Pattern[bytes]] [source]#
- Generates a regex object for each pattern given in input and checks if
the path is a subdirectory or relative to the root path.
- Parameters:
root_path (bytes) – path to the root directory
patterns – shell patterns to match
- swh.model.from_disk.ignore_directories_patterns(root_path: bytes, patterns: Iterable[bytes])[source]#
Filter for
directory_to_objects()
to ignore directories matching certain patterns.
- swh.model.from_disk.iter_directory(directory: Directory) Tuple[List[Content], List[SkippedContent], List[Directory]] [source]#
Return the directory listing from a disk-memory directory instance.
- Raises:
TypeError in case an unexpected object type is listed. –
- Returns:
Tuple of respectively iterable of content, skipped content and directories.
- class swh.model.from_disk.Directory(data=None)[source]#
Bases:
MerkleNode
Representation of a Software Heritage directory as a node in a Merkle Tree.
This class can be used to generate, from an on-disk directory, all the objects that need to be sent to the Software Heritage archive.
The
from_disk()
constructor allows you to generate the data structure from a directory on disk. The resultingDirectory
can then be manipulated as a dictionary, using the path as key.The
collect()
method is used to retrieve all the objects that need to be added to the Software Heritage archive since the last collection, by class (contents and directories).When using the dict-like methods to update the contents of the directory, the affected levels of hierarchy are reset and can be collected again using the same method. This enables the efficient collection of updated nodes, for instance when the client is applying diffs.
- classmethod from_disk(*, path: bytes, path_filter: ~typing.Callable[[bytes, bytes, ~typing.List[bytes] | None], bool] = <function accept_all_paths>, max_content_length: int | None = None, progress_callback: ~typing.Callable[[int], None] | None = None) Directory [source]#
Compute the Software Heritage objects for a given directory tree
- Parameters:
path (bytes) – the directory to traverse
data (bool) – whether to add the data to the content objects
save_path (bool) – whether to add the path to the content objects
path_filter (function) – a filter to ignore some paths. Takes three arguments: path, name and entries. entries is None for files, and a (possibly empty) list of names for directories. Returns True if the path should be added, False if the path should be ignored.
max_content_length (Optional[int]) – if given, all contents larger than this will be skipped.
progress_callback (Optional function) – if given, returns for each non empty directories traversed the number of computed entries.
- iter_tree(dedup=True) Iterator[Directory | Content] [source]#
Yields all children nodes, recursively. Common nodes are deduplicated by default (deduplication can be turned off setting the given argument ‘dedup’ to False).
- get_data(**kwargs)[source]#
Retrieve and format the collected data for the current node, for use by
collect()
.Can be overridden, for instance when you want the collected data to contain information about the child nodes.
- Parameters:
kwargs – allow subclasses to alter behaviour depending on how
collect()
is called.- Returns:
data formatted for
collect()
- property entries#
Child nodes, sorted by name in the same way
swh.model.git_objects.directory_git_object()
does.