Software Heritage - Storage

Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata

The Software Heritage storage consist of a high-level storage layer (swh.storage) that exposes a client/server API (swh.storage.api). The API is exposed by a server (swh.storage.api.server) and accessible via a client (swh.storage.api.client).

The low-level implementation of the storage is split between an object storage (swh.objstorage), which stores all “blobs” (i.e., the leaves of the Data model) and a SQL representation of the rest of the graph (swh.storage.storage).

Using swh-storage

First, note that swh-storage is an internal API of Software Heritage, that is only available to software running on the SWH infrastructure and developers running their own Software Heritage. If you want to access the Software Heritage archive without running your own, you should use the Web API instead.

As swh-storage has multiple backends, it is instantiated via the swh.storage.get_storage() function, which takes as argument the backend type (usually remote, if you already have access to a running swh-storage).

It returns an instance of a class implementing swh.storage.interface.StorageInterface; which is mostly a set of key-value stores, one for each object type.

Many of the arguments and return types are “model objects”, ie. immutable objects that are instances of the classes defined in swh.model.model.

Methods returning long lists of arguments are paginated; by returning both a list of results and an opaque token to get the next page of results. For example, to list all the visits of an origin using origin_visit_get ten visits at a time, you can do:

storage = get_storage("remote", url="http://localhost:5002")
while True:
    page = storage.origin_visit_get(origin="https://github.com/torvalds/linux")
    for visit in page.results:
        print(visit)
    if page.next_page_token is None:
        break

Or, using swh.core.api.classes.stream_results() for convenience:

storage = get_storage("remote", url="http://localhost:5002")
visits = stream_results(
   storage.origin_visit_get, origin="https://github.com/torvalds/linux"
)
for visit in visits:
     print(visit)

Database schema

Archive copies