Software Heritage - Storage#
Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata.
## Quick start
Python tests for this module include tests that cannot be run without a local Postgresql database, so you need the Postgresql server executable on your machine (no need to have a running Postgresql server). They also expect a cassandra server.
#### Debian-like host
$ sudo apt install libpq-dev postgresql-11 cassandra
#### Non Debian-like host
The tests expect the path to cassandra to either be unspecified, it is then looked up at /usr/sbin/cassandra, either specified through the environment variable SWH_CASSANDRA_BIN.
Optionally, you can avoid running the cassandra tests.
(swh) :~/swh-storage$ tox -- -m 'not cassandra'
It is strongly recommended to use a virtualenv. In the following, we consider you work in a virtualenv named swh. See the developer setup guide for a more details on how to setup a working environment.
You can install the package directly from pypi:
(swh) :~$ pip install swh.storage [...]
Or from sources:
(swh) :~$ git clone https://forge.softwareheritage.org/source/swh-storage.git [...] (swh) :~$ cd swh-storage (swh) :~/swh-storage$ pip install . [...]
Then you can check it’s properly installed:
(swh) :~$ swh storage --help Usage: swh storage [OPTIONS] COMMAND [ARGS]... Software Heritage Storage tools. Options: -h, --help Show this message and exit. Commands: rpc-serve Software Heritage Storage RPC server.
The best way of running Python tests for this module is to use tox.
(swh) :~$ pip install tox
From the sources directory, simply use tox:
(swh) :~/swh-storage$ tox [...] ========= 315 passed, 6 skipped, 15 warnings in 40.86 seconds ========== _______________________________ summary ________________________________ flake8: commands succeeded py3: commands succeeded congratulations :)
Note: it is possible to set the JAVA_HOME environment variable to specify the version of the JVM to be used by Cassandra. For example, at the time of writing this, Cassandra is meant to be run with Java 11. On Debian bookworm, one needs to manually install openjdk-11-jre-headless from bullseye or unstable and set the appropriate environment variable:
(swh) :~/swh-storage$ export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 (swh) :~/swh-storage$ tox [...]
The storage server can be locally started. It requires a configuration file and a running Postgresql database.
### Sample configuration
A typical configuration storage.yml file is:
storage: cls: postgresql db: "dbname=softwareheritage-dev user=<user> password=<pwd>" objstorage: cls: pathslicing root: /tmp/swh-storage/ slicing: 0:2/2:4/4:6
which means, this uses:
a local storage instance whose db connection is to softwareheritage-dev local instance,
the objstorage uses a local objstorage instance whose:
root path is /tmp/swh-storage,
slicing scheme is 0:2/2:4/4:6. This means that the identifier of the content (sha1) which will be stored on disk at first level with the first 2 hex characters, the second level with the next 2 hex characters and the third level with the next 2 hex characters. And finally the complete hash file holding the raw content. For example: 00062f8bd330715c4f819373653d97b3cd34394c will be stored at 00/06/2f/00062f8bd330715c4f819373653d97b3cd34394c
Note that the root path should exist on disk before starting the server.
### Starting the storage server
If the python package has been properly installed (e.g. in a virtual env), you should be able to use the command:
(swh) :~/swh-storage$ swh storage -C storage.yml rpc-serve
This runs a local swh-storage api at 5002 port.
(swh) :~/swh-storage$ curl http://127.0.0.1:5002 <html> <head><title>Software Heritage storage server</title></head> <body> <p>You have reached the <a href="https://www.softwareheritage.org/">Software Heritage</a> storage server.<br /> See its <a href="https://docs.softwareheritage.org/devel/swh-storage/">documentation and API</a> for more information</p>
### And then what?
storage: cls: remote url: http://localhost:5002/
You could directly define a postgresql storage with the following snippet:
storage: cls: postgresql db: service=swh-dev objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6
As an alternative to PostgreSQL, swh-storage can use Cassandra as a database backend. It can be used like this:
storage: cls: cassandra hosts: - localhost keyspace: swh objstorage: cls: pathslicing root: /home/storage/swh-storage/ slicing: 0:2/2:4/4:6
The Cassandra swh-storage implementation supports both Cassandra >= 4.0-alpha2 and ScyllaDB >= 4.4 (and possibly earlier versions, but this is untested).
While the main code supports both transparently, running tests
or configuring the schema requires specific code when using ScyllaDB,
enabled by setting the
SWH_USE_SCYLLADB=1 environment variable.
The Software Heritage storage consist of a high-level storage layer
swh.storage) that exposes a client/server API
swh.storage.api). The API is exposed by a server
swh.storage.api.server) and accessible via a client
The low-level implementation of the storage is split between an object storage
(swh.objstorage), which stores all “blobs” (i.e., the
leaves of the Data model) and a SQL representation of the rest of the
First, note that
swh-storage is an internal API of Software Heritage, that
is only available to software running on the SWH infrastructure and developers
running their own Software Heritage.
If you want to access the Software Heritage archive without running your own,
you should use the Web API instead.
swh-storage has multiple backends, it is instantiated via the
swh.storage.get_storage() function, which takes as argument the
backend type (usually
remote, if you already have access to a running
It returns an instance of a class implementing
swh.storage.interface.StorageInterface; which is mostly a set of
key-value stores, one for each object type.
Many of the arguments and return types are “model objects”, ie. immutable
objects that are instances of the classes defined in
Methods returning long lists of arguments are paginated; by returning both a
list of results and an opaque token to get the next page of results. For
example, to list all the visits of an origin using
visits at a time, you can do:
storage = get_storage("remote", url="http://localhost:5002") while True: page = storage.origin_visit_get(origin="https://github.com/torvalds/linux") for visit in page.results: print(visit) if page.next_page_token is None: break
swh.core.api.classes.stream_results() for convenience:
storage = get_storage("remote", url="http://localhost:5002") visits = stream_results( storage.origin_visit_get, origin="https://github.com/torvalds/linux" ) for visit in visits: print(visit)
- Command-line interface
- swh.storage package
- swh.storage.algos package
- swh.storage.api package
- swh.storage.cassandra package
- swh.storage.postgresql package
- swh.storage.proxies package
- swh.storage.backfill module
- swh.storage.cli module
- swh.storage.common module
- swh.storage.exc module
- swh.storage.fixer module
- swh.storage.in_memory module
- swh.storage.interface module
- swh.storage.metrics module
- swh.storage.migrate_extrinsic_metadata module
- swh.storage.objstorage module
- swh.storage.pytest_plugin module
- swh.storage.replay module
- swh.storage.utils module
- swh.storage.writer module