swh.graph.example_dataset contains a synthetic dataset made of only a
very tiny number of objects. Any real-world input for the Software Heritage
software stack is likely to be orders of magnitude bigger. However, it is small
enough to be easily comprehensive for human beings. This makes it useful for
exploring the behavior of
swh.graph or writing tests.
Except for origins, SoftWare Heritage persistent IDentifiers (SWHIDs) for these objects are synthetic and incorrect: they are arbitrary instead of having been computed from the content of the objects. While these are easier to comprehend for human brains, this quirk might introduce discrepancies or assertion errors whenever the value of the object SoftWare Heritage persistent IDentifiers (SWHIDs) is checked against their data.
Using the dataset#
Running a local server using this dataset can be done with:
$ swh graph rpc-serve -g swh/graph/example_dataset/compressed/example & $ curl http://localhost:5009/graph/leaves/swh:1:dir:0000000000000000000000000000000000000002 swh:1:cnt:0000000000000000000000000000000000000001
Regenerating the dataset#
The package already contains files suitable for consumption by
made from this dataset. If needed, they can be regenerated by running:
$ python -m swh.graph.example_dataset.generate_dataset \\ --compress \\ swh/graph/example_dataset
--compress optionally performs a graph compression step.
While semantically equivalent, the graph compression output is not reproducible
from a run to the next. Creating a new compressed graph would require to update many
constants in the
swh.graph test suite given the current implementation.
The example dataset mostly mimics the development of a tiny project. It has been released once, and then its development has been picked up in a fork.