swh.mosaic: MOdular Storage of Archived and Indexed Contents from Software Heritage#
MOSAIC is a file format designed to efficiently store and randomly read contents archived by Software Heritage. Target content is source code and therefore small objects (median size: 3kb), indexed over one (or more) of the possible objstorage keys.
The motivations and design of this file format are thoroughly explained in
SWH Enhancement Proposal #5.
Format evolutions are described in CHANGELOG.md in the package’s sources.
The Python module provides high-level classes (implemented in Rust) to read/write MOSAIC files.
Quick Start#
Basic Usage Example:
from swh.mosaic import MosaicCreator, MosaicReader, IdxDescription
from pathlib import Path
# Open a new MOSAIC file for writing
creator = MosaicCreator(
Path("example.mosaic"),
indexes=[IdxDescription.SHA1FMPHGO, IdxDescription.SHA256FMPHGO],
comments=["Example MOSAIC file"]
)
# Add a sample object. Keys must match what was provided as `indexes` above.
obj1 = b"Hello World"
obj1_sha1 = b"1" * 20 # fake SHA1 hash
obj1_sha256 = b"1" * 32 # fake SHA256 hash
creator.add([obj1_sha1, obj1_sha256], obj1)
# Finalize the file (write its indexes)
creator.close()
# Open it for reading
reader = MosaicReader(Path("example.mosaic"))
print(f"Objects: {reader.objects_counter}")
print(f"Comments: {reader.comments}")
# Load an index to enable lookups
reader.load_index(IdxDescription.SHA1FMPHGO)
retrieved = reader.lookup(obj1_sha1)
print(f"Retrieved: {retrieved}")
# Loading an index also enables iteration
reader.load_index(IdxDescription.SHA256FMPHGO)
for (obj_sha256, obj_content) in reader:
with open(obj_sha256, 'wb') as f:
f.write(obj_content)
Note that:
MosaicReaderis optimized for random accesses. If you need really fast iterations on objects in a MOSAIC, please use the Rust crate directly.The
MosaicReaderconstructor only reads the file header. You must callload_index()before performing lookups or iteration.Tile threshold (default 32MB) determines when new tiles are created and affects the maximum object size calculation.
Available Index Types#
The following indexes are supported through the IdxDescription enum.
Currently all indexes rely on an
FMPHGO MPH,
and differ by their keys’ semantics:
SHA1FMPHGO: keys are objects’ SHA1SHA1GITFMPHGO: Git-style SHA1SHA256FMPHGO: SHA256BLAKE2FMPHGO: BLAKE2
Context Manager Support#
MosaicCreator support the context manager protocol:
# Writing with context manager (automatically closes)
with MosaicCreator(
Path("example.mosaic"),
indexes=[IdxDescription.SHA1GITFMPHGO]
) as creator:
creator.add([b"1"*20], b"data")
In that setting, the file is finalized when exiting context.