swh.loader.mercurial.bundle20_reader module¶
This document contains code for extracting all of the data from Mercurial version 2 bundle file. It is referenced by bundle20_loader.py
-
swh.loader.mercurial.bundle20_reader.
unpack
(fmt_str, source)[source]¶ Utility function for fetching the right number of bytes from a stream to satisfy a struct.unpack pattern.
- Parameters
fmt_str – a struct.unpack string pattern (e.g. ‘>I’ for 4 bytes big-endian)
source – any IO object that has a read(<size>) method which returns an appropriate sequence of bytes
-
class
swh.loader.mercurial.bundle20_reader.
Bundle20Reader
(bundlefile, cache_filename, cache_size=None)[source]¶ Bases:
object
Parser for extracting data from Mercurial Bundle20 files. NOTE: Currently only works on uncompressed HG20 bundles, but checking for COMPRESSION=<2chars> and loading the appropriate stream decompressor at that point would be trivial to add if necessary.
- Parameters
bundlefile (str) – name of the binary repository bundle file
cache_filename (str) – path to the disk cache used (transited to the SelectiveCache instance)
cache_size (int) – tuning parameter for the upper RAM limit used by historical data caches. The default is defined in the SelectiveCache class.
-
NAUGHT_NODE
= b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'¶
-
read_bundle_header
(bfile)[source]¶ Parse the file header which describes the format and parameters. See the structure diagram at the top of the file for more insight.
- Parameters
bfile – bundle file handle with the cursor at the start offset of the content header (the 9th byte in the file)
- Returns
dict of decoded bundle parameters
-
revdata_iterator
(bytes_to_read)[source]¶ A chunk’s revdata section is a series of start/end/length/data_delta content updates called RevDiffs that indicate components of a text diff applied to the node’s basenode. The sum length of all diffs is the length indicated at the beginning of the chunk at the start of the header. See the structure diagram at the top of the file for more insight.
- Parameters
bytes_to_read – int total number of bytes in the chunk’s revdata
- Yields
(int, int, read iterator) representing a single text diff component
-
read_chunk_header
()[source]¶ The header of a RevChunk describes the id (‘node’) for the current change, the commit id (‘linknode’) associated with this change, the parental heritage (‘p1’ and ‘p2’), and the node to which the revdata updates will apply (‘basenode’). ‘linknode’ is the same as ‘node’ when reading the commit log because any commit is already itself. ‘basenode’ for a changeset will be NAUGHT_NODE, because changeset chunks include complete information and not diffs. See the structure diagram at the top of the file for more insight.
- Returns
dict of the next delta header
-
read_revchunk
()[source]¶ Fetch a complete RevChunk. A RevChunk contains the collection of line changes made in a particular update. header[‘node’] identifies which update. Commits, manifests, and files all have these. Each chunk contains an indicator of the whole chunk size, an update header, and then the body of the update as a series of text diff components. See the structure diagram at the top of the file for more insight.
- Returns
tuple(dict, iterator) of (header, chunk data) if there is another chunk in the group, else None
-
extract_commit_metadata
(data)[source]¶ Converts the binary commit metadata format into a dict.
- Parameters
data – bytestring of encoded commit information
- Returns
dict of decoded commit information
-
skip_sections
(num_sections=1)[source]¶ Skip past <num_sections> sections quickly.
- Parameters
num_sections – int number of sections to skip
-
apply_revdata
(revdata_it, prev_state)[source]¶ Compose the complete text body for a change from component deltas.
- Parameters
revdata_it – output from the revdata_iterator method
prev_state – bytestring the base complete text on which the new deltas will be applied
- Returns
(bytestring, list, list) the new complete string and lists of added and removed components (used in manifest processing)
-
skim_headers
()[source]¶ Get all header data from a change group but bypass processing of the contained delta components.
- Yields
output of read_chunk_header method for all chunks in the group
-
group_iterator
()[source]¶ Bundle sections are called groups. These are composed of one or more revision chunks of delta components. Iterate over all the chunks in a group and hand each one back.
- Yields
see output of read_revchunk method
-
yield_group_objects
(cache_hints=None, group_offset=None)[source]¶ Bundles are sectioned into groups: the log of all commits, the log of all manifest changes, and a series of logs of blob changes (one for each file). All groups are structured the same way, as a series of revisions each with a series of delta components. Iterate over the current group and return the completed object data for the current update by applying all of the internal delta components to each prior revision.
- Parameters
cache_hints – see build_cache_hints (this will be built automatically if not pre-built and passed in)
group_offset – int file position of the start of the desired group
- Yields
- (dict, bytestring, list, list) the output from read_chunk_header
followed by the output from apply_revdata
-
extract_meta_from_blob
(data)[source]¶ File revision data sometimes begins with a metadata section of dubious value. Strip it off and maybe decode it. It seems to be mostly useless. Why indicate that a file node is a copy of another node? You can already get that information from the delta header.
- Parameters
data – bytestring of one revision of a file, possibly with metadata embedded at the start
- Returns
(bytestring, dict) of (the blob data, the meta information)
-
yield_all_blobs
()[source]¶ Gets blob data from the bundle.
- Yields
- (bytestring, (bytestring, int, dict)) of
(blob data, (file name, start offset of the file within the bundle, node header))
-
yield_all_changesets
()[source]¶ Gets commit data from the bundle.
- Yields
- (dict, dict) of (read_chunk_header output,
extract_commit_metadata output)
-
yield_all_manifest_deltas
(cache_hints=None)[source]¶ Gets manifest data from the bundle. In order to process the manifests in a reasonable amount of time, we want to use only the deltas and not the entire manifest at each change, because if we’re processing them in sequential order (we are) then we already have the previous state so we only need the changes.
- Parameters
cache_hints – see build_cache_hints method
- Yields
- (dict, dict, dict) of (read_chunk_header output,
extract_manifest_elements output on added/modified files, extract_manifest_elements on removed files)
-
build_manifest_hints
()[source]¶ Just a minor abstraction shortcut for the build_cache_hints method.
- Returns
see build_cache_hints method
-
build_cache_hints
()[source]¶ The SelectiveCache class that we use in building nodes can accept a set of key counters that makes its memory usage much more efficient.
- Returns
dict of key=a node id, value=the number of times we will need data from that node when building subsequent nodes
-
extract_manifest_elements
(data)[source]¶ Parses data that looks like a manifest. In practice we only pass in the bits extracted from the application of a manifest delta describing which files were added/modified or which ones were removed.
- Parameters
data –
either a string or a list of strings that, when joined, embodies the composition of a manifest.
This takes the form of repetitions of (without the brackets):
b'<file_path><file_node>[flag]\n' ...repeat...
where
[flag]
may or may not be there depending on whether the file is specially flagged as executable or something- Returns
{file_path: (file_node, permissions), ...}
where permissions is given according to the flag that optionally exists in the data- Return type
dict