swh.mosaic: MOdular Storage of Archived and Indexed Contents from Software Heritage#

MOSAIC is a file format designed to efficiently store and randomly read contents archived by Software Heritage. Target content is source code and therefore small objects (median size: 3kb), indexed over one (or more) of the possible objstorage keys.

The motivations and design of this file format are thoroughly explained in SWH Enhancement Proposal #5. Format evolutions are described in CHANGELOG.md in the package’s sources.

The Python module provides high-level classes (implemented in Rust) to read/write MOSAIC files.

Quick Start#

Basic Usage Example:

from swh.mosaic import MosaicCreator, MosaicReader, IdxDescription
from pathlib import Path

# Open a new MOSAIC file for writing
creator = MosaicCreator(
    Path("example.mosaic"),
    indexes=[IdxDescription.SHA1FMPHGO, IdxDescription.SHA256FMPHGO],
    comments=["Example MOSAIC file"]
)

# Add a sample object. Keys must match what was provided as `indexes` above.
obj1 = b"Hello World"
obj1_sha1 = b"1" * 20  # fake SHA1 hash
obj1_sha256 = b"1" * 32  # fake SHA256 hash
creator.add([obj1_sha1, obj1_sha256], obj1)

# Finalize the file (write its indexes)
creator.close()

# Open it for reading
reader = MosaicReader(Path("example.mosaic"))
print(f"Objects: {reader.objects_counter}")
print(f"Comments: {reader.comments}")

# Load an index to enable lookups
reader.load_index(IdxDescription.SHA1FMPHGO)
retrieved = reader.lookup(obj1_sha1)
print(f"Retrieved: {retrieved}")

# Loading an index also enables iteration
reader.load_index(IdxDescription.SHA256FMPHGO)
for (obj_sha256, obj_content) in reader:
    with open(obj_sha256, 'wb') as f:
        f.write(obj_content)

Note that:

  • MosaicReader is optimized for random accesses. If you need really fast iterations on objects in a MOSAIC, please use the Rust crate directly.

  • The MosaicReader constructor only reads the file header. You must call load_index() before performing lookups or iteration.

  • Tile threshold (default 32MB) determines when new tiles are created and affects the maximum object size calculation.

Available Index Types#

The following indexes are supported through the IdxDescription enum. Currently all indexes rely on an FMPHGO MPH, and differ by their keys’ semantics:

  • SHA1FMPHGO: keys are objects’ SHA1

  • SHA1GITFMPHGO: Git-style SHA1

  • SHA256FMPHGO: SHA256

  • BLAKE2FMPHGO: BLAKE2

Context Manager Support#

MosaicCreator support the context manager protocol:

# Writing with context manager (automatically closes)
with MosaicCreator(
    Path("example.mosaic"),
    indexes=[IdxDescription.SHA1GITFMPHGO]
) as creator:
    creator.add([b"1"*20], b"data")

In that setting, the file is finalized when exiting context.