Shard File Format for the Software Heritage Object Storage#

This module implement the support and tooling to manipulate SWH Shard files based on a perfect hash table, typically used by the software heritage object storage.

It is both a Python extension that can be used as a library to manuipulate SWH shard files, and a set of command line tools.

Quick Start#

This packages uses pybind11 to build the wrapper around the cmph minimal perfect hashmap library. To build the binary extension, in addition to the python development tools, you will need cmph, gtest and valgrind. On de Debian system, you can install these using:

sudo apt install build-essential python3-dev libcmph-dev libgtest-dev valgrind lcov

Command Line Tool#

You may use several methods to install swh-shard, e.g. using uv or pip.

For example:

$ uv tool install swh-shard
[...]
Installed 1 executable: swh-shard

$ swh-shard
Usage: swh-shard [OPTIONS] COMMAND [ARGS]...

  Software Heritage Shard tools.

Options:
  -C, --config-file FILE  Configuration file.
  -h, --help              Show this message and exit.

Commands:
  create  Create a shard file from given files
  get     List objects in a shard file
  info    Display shard file information
  ls      List objects in a shard file

Then you can create a shard file from local files:

$ swh-shard create volume.shard *.py
There are 3 entries
Checking files to add  [####################################]  100%
after deduplication: 3 entries
Adding files to the shard  [####################################]  100%
Done

This will use the sha256 checksum of each file content given as argument as key in the shard file.

Then you can check the header of the shard file:

$ swh-shard info volume.shard
Shard volume.shard
├─version:    1
├─objects:    3
│ ├─position: 512
│ └─size:     5633
├─index
│ ├─position: 6145
│ └─size:     440
└─hash
  └─position: 6585

List the content of a shard:

$ swh-shard ls volume.shard
8bb71bce4885c526bb4114295f5b2b9a23a50e4a8d554c17418d1874b1a233ac: 834 bytes
06340a7a5fa9e18d72a587a69e4dc7e79f4d6a56632ea6900c22575dc207b07f: 4210 bytes
d39790a3af51286d2d10d73e72e2447cf97b149ff2d8e275b200a1ee33e4a3c5: 565 bytes

Retrieve an object from a shard:

$ swh-shard get volume.shard 06340a7a5fa9e18d72a587a69e4dc7e79f4d6a56632ea6900c22575dc207b07f | sha256sum
06340a7a5fa9e18d72a587a69e4dc7e79f4d6a56632ea6900c22575dc207b07f  -

And delete one or more objects from a shard:

$ swh-shard delete volume.shard 06340a7a5fa9e18d72a587a69e4dc7e79f4d6a56632ea6900c22575dc207b07f
About to remove these objects from the shard file misc/volume.shard
06340a7a5fa9e18d72a587a69e4dc7e79f4d6a56632ea6900c22575dc207b07f (4210 bytes)
Proceed? [y/N]: y
Deleting objects from the shard  [####################################]  100%
Done

Low level management for read-only content-addressable object storage indexed with a perfect hash table.

Reference Documentation#