.. _swh-graph-example-dataset: Example dataset =============== :mod:`swh.graph.example_dataset` contains a synthetic dataset made of only a very tiny number of objects. Any real-world input for the Software Heritage software stack is likely to be orders of magnitude bigger. However, it is small enough to be easily comprehensive for human beings. This makes it useful for exploring the behavior of ``swh.graph`` or writing tests. .. warning:: Except for origins, :ref:`SWHIDs` for these objects are synthetic and incorrect: they are arbitrary instead of having been computed from the content of the objects. While these are easier to comprehend for human brains, this quirk might introduce discrepancies or assertion errors whenever the value of the object :ref:`SWHIDs` is checked against their data. Using the dataset ----------------- Running a local server using this dataset can be done with: .. code:: console $ swh graph rpc-serve -g swh/graph/example_dataset/compressed/example & $ curl http://localhost:5009/graph/leaves/swh:1:dir:0000000000000000000000000000000000000002 swh:1:cnt:0000000000000000000000000000000000000001 .. _regenerate_swh-graph_example_dataset: Regenerating the dataset ------------------------ The package already contains files suitable for consumption by ``swh.graph`` made from this dataset. If needed, they can be regenerated by running: .. code:: console $ python -m swh.graph.example_dataset.generate_dataset \\ --compress \\ swh/graph/example_dataset The ``--compress`` optionally performs a graph compression step. .. warning:: While semantically equivalent, the graph compression output is not reproducible from a run to the next. Creating a new compressed graph would require to update many constants in the ``swh.graph`` test suite given the current implementation. Regenerating the dataset for the Rust implementation ---------------------------------------------------- The rust implementation needs different files to be generated, the following sections describe how to generate them. .cmph file ~~~~~~~~~~ The older Java version used to serialize the MPH structure using Java serialize which stores integers in big-endian order. As we do not need to depend on the java serialization format, we moved to the ``.cmph`` format which stores data in a little-endian order. This allows reading the file from C, Rust, or any other language. The older Java version used to serialize the MPH structure using Java serialize which stores integers in big-endian order, for this reason, we moved to a new format ``.cmph`` which stores data in a little-endian order and without the Java serialization format. This allows this file to be read from C, Rust, or any other language without worrying about Java object deserialization. Subsequently, this allows the file to be mmapped on little-endian machines. To convert from ``.mph`` to the new ``.cmph`` file with the swh-graph utility: .. code:: console $ java -classpath ~/src/swh-graph/java/target/swh-graph-3.0.1.jar ~/src/swh-graph/java/src/main/java/org/softwareheritage/graph/utils/Mph2Cmph.java graph.mph graph.cmph or just with ``webgraph-big`` you can use ``jshell`` to call the ``dump`` method: .. code:: console $ echo '((it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction)it.unimi.dsi.fastutil.io.BinIO.loadObject("test.mph")).dump("test.cmph");' | jshell -classpath /path/to/webgraph-big.jar .ef file ~~~~~~~~ The older Java version used the ``.offests`` file to build at runtime the elias-fano structure. The offsets are just a contiguous big-endian bitstream of the gaps between successive offsets written as elias-gamma-codes. To avoid re-building this structure every time we added the ``.ef`` file which can be memory-mapped with little parsing at the cost of being endianness dependent. The ``.ef`` file is in little-endian, To generate the ``.ef`` file from either a ``.offsets`` file or a ``.graph`` file, you can use the ``webgraph-rs`` bin utility: .. code:: console $ cargo run --release --bin build_eliasfano -- $BASENAME this will create a ``$BASENAME.ef`` file in the same directory. Content ------- The example dataset mostly mimics the development of a tiny project. It has been released once, and then its development has been picked up in a fork. .. figure:: images/example-dataset.svg :alt: A representation of the example dataset directed graph with nodes for origins, snapshots, releases, revisions, directories and contents. Dataset visualization The :mod:`swh.model` objects that this dataset is comprised of are available in :mod:`swh.graph.example_dataset`.