Software Heritage - Storage#
Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata
The Software Heritage storage consist of a high-level storage layer
(swh.storage
) that exposes a client/server API
(swh.storage.api
). The API is exposed by a server
(swh.storage.api.server
) and accessible via a client
(swh.storage.api.client
).
The low-level implementation of the storage is split between an object storage
(swh.objstorage), which stores all “blobs” (i.e., the
leaves of the Data model) and a SQL representation of the rest of the
graph (swh.storage.storage
).
Using swh-storage
#
First, note that swh-storage
is an internal API of Software Heritage, that
is only available to software running on the SWH infrastructure and developers
running their own Software Heritage.
If you want to access the Software Heritage archive without running your own,
you should use the Web API instead.
As swh-storage
has multiple backends, it is instantiated via the
swh.storage.get_storage()
function, which takes as argument the backend type
(usually remote
, if you already have access to a running swh-storage).
It returns an instance of a class implementing
swh.storage.interface.StorageInterface
; which is mostly a set of key-value
stores, one for each object type.
Many of the arguments and return types are “model objects”, ie. immutable objects
that are instances of the classes defined in swh.model.model
.
Methods returning long lists of arguments are paginated; by returning both a list
of results and an opaque token to get the next page of results.
For example, to list all the visits of an origin using origin_visit_get
ten visits at a time, you can do:
storage = get_storage("remote", url="http://localhost:5002")
while True:
page = storage.origin_visit_get(origin="https://github.com/torvalds/linux")
for visit in page.results:
print(visit)
if page.next_page_token is None:
break
Or, using swh.core.api.classes.stream_results()
for convenience:
storage = get_storage("remote", url="http://localhost:5002")
visits = stream_results(
storage.origin_visit_get, origin="https://github.com/torvalds/linux"
)
for visit in visits:
print(visit)
Database schema#
Archive copies#
Specifications#
Reference Documentation#
- Command-line interface
- swh.storage package
- swh.storage.algos package
- swh.storage.api package
- swh.storage.cassandra package
- swh.storage.postgresql package
- swh.storage.proxies package
- swh.storage.backfill module
- swh.storage.cli module
- swh.storage.common module
- swh.storage.exc module
- swh.storage.fixer module
- swh.storage.in_memory module
- swh.storage.interface module
- swh.storage.metrics module
- swh.storage.migrate_extrinsic_metadata module
- swh.storage.objstorage module
- swh.storage.pytest_plugin module
- swh.storage.replay module
- swh.storage.utils module
- swh.storage.writer module
get_storage()
get_datastore()
get_storage_pipeline()