swh.loader.mercurial.bundle20_reader module

This document contains code for extracting all of the data from Mercurial version 2 bundle file. It is referenced by bundle20_loader.py

swh.loader.mercurial.bundle20_reader.unpack(fmt_str, source)[source]

Utility function for fetching the right number of bytes from a stream to satisfy a struct.unpack pattern.

Parameters
  • fmt_str – a struct.unpack string pattern (e.g. ‘>I’ for 4 bytes big-endian)

  • source – any IO object that has a read(<size>) method which returns an appropriate sequence of bytes

class swh.loader.mercurial.bundle20_reader.Bundle20Reader(bundlefile, cache_filename, cache_size=None)[source]

Bases: object

Parser for extracting data from Mercurial Bundle20 files. NOTE: Currently only works on uncompressed HG20 bundles, but checking for COMPRESSION=<2chars> and loading the appropriate stream decompressor at that point would be trivial to add if necessary.

Parameters
  • bundlefile (str) – name of the binary repository bundle file

  • cache_filename (str) – path to the disk cache used (transited to the SelectiveCache instance)

  • cache_size (int) – tuning parameter for the upper RAM limit used by historical data caches. The default is defined in the SelectiveCache class.

NAUGHT_NODE = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
read_bundle_header(bfile)[source]

Parse the file header which describes the format and parameters. See the structure diagram at the top of the file for more insight.

Parameters

bfile – bundle file handle with the cursor at the start offset of the content header (the 9th byte in the file)

Returns

dict of decoded bundle parameters

revdata_iterator(bytes_to_read)[source]

A chunk’s revdata section is a series of start/end/length/data_delta content updates called RevDiffs that indicate components of a text diff applied to the node’s basenode. The sum length of all diffs is the length indicated at the beginning of the chunk at the start of the header. See the structure diagram at the top of the file for more insight.

Parameters

bytes_to_read – int total number of bytes in the chunk’s revdata

Yields

(int, int, read iterator) representing a single text diff component

read_chunk_header()[source]

The header of a RevChunk describes the id (‘node’) for the current change, the commit id (‘linknode’) associated with this change, the parental heritage (‘p1’ and ‘p2’), and the node to which the revdata updates will apply (‘basenode’). ‘linknode’ is the same as ‘node’ when reading the commit log because any commit is already itself. ‘basenode’ for a changeset will be NAUGHT_NODE, because changeset chunks include complete information and not diffs. See the structure diagram at the top of the file for more insight.

Returns

dict of the next delta header

read_revchunk()[source]

Fetch a complete RevChunk. A RevChunk contains the collection of line changes made in a particular update. header[‘node’] identifies which update. Commits, manifests, and files all have these. Each chunk contains an indicator of the whole chunk size, an update header, and then the body of the update as a series of text diff components. See the structure diagram at the top of the file for more insight.

Returns

tuple(dict, iterator) of (header, chunk data) if there is another chunk in the group, else None

extract_commit_metadata(data)[source]

Converts the binary commit metadata format into a dict.

Parameters

data – bytestring of encoded commit information

Returns

dict of decoded commit information

skip_sections(num_sections=1)[source]

Skip past <num_sections> sections quickly.

Parameters

num_sections – int number of sections to skip

apply_revdata(revdata_it, prev_state)[source]

Compose the complete text body for a change from component deltas.

Parameters
  • revdata_it – output from the revdata_iterator method

  • prev_state – bytestring the base complete text on which the new deltas will be applied

Returns

(bytestring, list, list) the new complete string and lists of added and removed components (used in manifest processing)

skim_headers()[source]

Get all header data from a change group but bypass processing of the contained delta components.

Yields

output of read_chunk_header method for all chunks in the group

group_iterator()[source]

Bundle sections are called groups. These are composed of one or more revision chunks of delta components. Iterate over all the chunks in a group and hand each one back.

Yields

see output of read_revchunk method

yield_group_objects(cache_hints=None, group_offset=None)[source]

Bundles are sectioned into groups: the log of all commits, the log of all manifest changes, and a series of logs of blob changes (one for each file). All groups are structured the same way, as a series of revisions each with a series of delta components. Iterate over the current group and return the completed object data for the current update by applying all of the internal delta components to each prior revision.

Parameters
  • cache_hints – see build_cache_hints (this will be built automatically if not pre-built and passed in)

  • group_offset – int file position of the start of the desired group

Yields
(dict, bytestring, list, list) the output from read_chunk_header

followed by the output from apply_revdata

extract_meta_from_blob(data)[source]

File revision data sometimes begins with a metadata section of dubious value. Strip it off and maybe decode it. It seems to be mostly useless. Why indicate that a file node is a copy of another node? You can already get that information from the delta header.

Parameters

data – bytestring of one revision of a file, possibly with metadata embedded at the start

Returns

(bytestring, dict) of (the blob data, the meta information)

seek_changelog()[source]

Seek to the beginning of the change logs section.

seek_manifests()[source]

Seek to the beginning of the manifests section.

seek_filelist()[source]

Seek to the beginning of the file changes section.

yield_all_blobs()[source]

Gets blob data from the bundle.

Yields
(bytestring, (bytestring, int, dict)) of

(blob data, (file name, start offset of the file within the bundle, node header))

yield_all_changesets()[source]

Gets commit data from the bundle.

Yields
(dict, dict) of (read_chunk_header output,

extract_commit_metadata output)

yield_all_manifest_deltas(cache_hints=None)[source]

Gets manifest data from the bundle. In order to process the manifests in a reasonable amount of time, we want to use only the deltas and not the entire manifest at each change, because if we’re processing them in sequential order (we are) then we already have the previous state so we only need the changes.

Parameters

cache_hints – see build_cache_hints method

Yields
(dict, dict, dict) of (read_chunk_header output,

extract_manifest_elements output on added/modified files, extract_manifest_elements on removed files)

build_manifest_hints()[source]

Just a minor abstraction shortcut for the build_cache_hints method.

Returns

see build_cache_hints method

build_cache_hints()[source]

The SelectiveCache class that we use in building nodes can accept a set of key counters that makes its memory usage much more efficient.

Returns

dict of key=a node id, value=the number of times we will need data from that node when building subsequent nodes

extract_manifest_elements(data)[source]

Parses data that looks like a manifest. In practice we only pass in the bits extracted from the application of a manifest delta describing which files were added/modified or which ones were removed.

Parameters

data

either a string or a list of strings that, when joined, embodies the composition of a manifest.

This takes the form of repetitions of (without the brackets):

b'<file_path><file_node>[flag]\n' ...repeat...

where [flag] may or may not be there depending on whether the file is specially flagged as executable or something

Returns

{file_path: (file_node, permissions), ...} where permissions is given according to the flag that optionally exists in the data

Return type

dict