swh.loader.mercurial.bundle20_reader module

This document contains code for extracting all of the data from Mercurial version 2 bundle file. It is referenced by bundle20_loader.py

swh.loader.mercurial.bundle20_reader.unpack(fmt_str, source)[source]

Utility function for fetching the right number of bytes from a stream to satisfy a struct.unpack pattern.

  • fmt_str – a struct.unpack string pattern (e.g. ‘>I’ for 4 bytes big-endian)

  • source – any IO object that has a read(<size>) method which returns an appropriate sequence of bytes

class swh.loader.mercurial.bundle20_reader.Bundle20Reader(bundlefile, cache_filename, cache_size=None)[source]

Bases: object

Parser for extracting data from Mercurial Bundle20 files. NOTE: Currently only works on uncompressed HG20 bundles, but checking for COMPRESSION=<2chars> and loading the appropriate stream decompressor at that point would be trivial to add if necessary.

  • bundlefile (str) – name of the binary repository bundle file

  • cache_filename (str) – path to the disk cache used (transited to the SelectiveCache instance)

  • cache_size (int) – tuning parameter for the upper RAM limit used by historical data caches. The default is defined in the SelectiveCache class.

NAUGHT_NODE = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Parse the file header which describes the format and parameters. See the structure diagram at the top of the file for more insight.


bfile – bundle file handle with the cursor at the start offset of the content header (the 9th byte in the file)


dict of decoded bundle parameters


A chunk’s revdata section is a series of start/end/length/data_delta content updates called RevDiffs that indicate components of a text diff applied to the node’s basenode. The sum length of all diffs is the length indicated at the beginning of the chunk at the start of the header. See the structure diagram at the top of the file for more insight.


bytes_to_read – int total number of bytes in the chunk’s revdata


(int, int, read iterator) representing a single text diff component


The header of a RevChunk describes the id (‘node’) for the current change, the commit id (‘linknode’) associated with this change, the parental heritage (‘p1’ and ‘p2’), and the node to which the revdata updates will apply (‘basenode’). ‘linknode’ is the same as ‘node’ when reading the commit log because any commit is already itself. ‘basenode’ for a changeset will be NAUGHT_NODE, because changeset chunks include complete information and not diffs. See the structure diagram at the top of the file for more insight.


dict of the next delta header


Fetch a complete RevChunk. A RevChunk contains the collection of line changes made in a particular update. header[‘node’] identifies which update. Commits, manifests, and files all have these. Each chunk contains an indicator of the whole chunk size, an update header, and then the body of the update as a series of text diff components. See the structure diagram at the top of the file for more insight.


tuple(dict, iterator) of (header, chunk data) if there is another chunk in the group, else None


Converts the binary commit metadata format into a dict.


data – bytestring of encoded commit information


dict of decoded commit information


Skip past <num_sections> sections quickly.


num_sections – int number of sections to skip

apply_revdata(revdata_it, prev_state)[source]

Compose the complete text body for a change from component deltas.

  • revdata_it – output from the revdata_iterator method

  • prev_state – bytestring the base complete text on which the new deltas will be applied


(bytestring, list, list) the new complete string and lists of added and removed components (used in manifest processing)


Get all header data from a change group but bypass processing of the contained delta components.


output of read_chunk_header method for all chunks in the group


Bundle sections are called groups. These are composed of one or more revision chunks of delta components. Iterate over all the chunks in a group and hand each one back.


see output of read_revchunk method

yield_group_objects(cache_hints=None, group_offset=None)[source]

Bundles are sectioned into groups: the log of all commits, the log of all manifest changes, and a series of logs of blob changes (one for each file). All groups are structured the same way, as a series of revisions each with a series of delta components. Iterate over the current group and return the completed object data for the current update by applying all of the internal delta components to each prior revision.

  • cache_hints – see build_cache_hints (this will be built automatically if not pre-built and passed in)

  • group_offset – int file position of the start of the desired group

(dict, bytestring, list, list) the output from read_chunk_header

followed by the output from apply_revdata


File revision data sometimes begins with a metadata section of dubious value. Strip it off and maybe decode it. It seems to be mostly useless. Why indicate that a file node is a copy of another node? You can already get that information from the delta header.


data – bytestring of one revision of a file, possibly with metadata embedded at the start


(bytestring, dict) of (the blob data, the meta information)


Seek to the beginning of the change logs section.


Seek to the beginning of the manifests section.


Seek to the beginning of the file changes section.


Gets blob data from the bundle.

(bytestring, (bytestring, int, dict)) of

(blob data, (file name, start offset of the file within the bundle, node header))


Gets commit data from the bundle.

(dict, dict) of (read_chunk_header output,

extract_commit_metadata output)


Gets manifest data from the bundle. In order to process the manifests in a reasonable amount of time, we want to use only the deltas and not the entire manifest at each change, because if we’re processing them in sequential order (we are) then we already have the previous state so we only need the changes.


cache_hints – see build_cache_hints method

(dict, dict, dict) of (read_chunk_header output,

extract_manifest_elements output on added/modified files, extract_manifest_elements on removed files)


Just a minor abstraction shortcut for the build_cache_hints method.


see build_cache_hints method


The SelectiveCache class that we use in building nodes can accept a set of key counters that makes its memory usage much more efficient.


dict of key=a node id, value=the number of times we will need data from that node when building subsequent nodes


Parses data that looks like a manifest. In practice we only pass in the bits extracted from the application of a manifest delta describing which files were added/modified or which ones were removed.



either a string or a list of strings that, when joined, embodies the composition of a manifest.

This takes the form of repetitions of (without the brackets):

b'<file_path><file_node>[flag]\n' ...repeat...

where [flag] may or may not be there depending on whether the file is specially flagged as executable or something


{file_path: (file_node, permissions), ...} where permissions is given according to the flag that optionally exists in the data

Return type