Alterations of the Software Architecture Archive#
The main objective of an archive is to store facts forever. As such, it can be viewed as an append-only infrastructure. However, it may be necessary to alter the content of the archive to account for removal or alteration requests that may happen for several reasons.
We currently consider 2 types of alterations that may have to be applied to the archive:
content removal: some objects stored in the archive should not be visible any more; these can be either removed entirely or masked, depending on the situation.
personal identity modification: some personal information (namely the name and email of a person) needs not to be visible any more.
Note
We will not discuss in this section the administrative process of receiving, handling and processing an alteration request of the Software Heritage Archive. We will only focus on the technical aspects of the processes involved, and their impact on the architectural design.
Types of alteration#
Content removal#
A content removal request starts from one (or more) origin. All the removal handling process is based on an origin.
When dealing with a content removal request that needs to be applied to the archive, the following steps need to be done:
identify all the objects in the archive (mostly in the Merkle DAG) that need to be removed,
build a properly encrypted recovery bundle with all the objects listed previously,
store and identify this bundle in a dedicated storage,
remove all the identified
Content
objects from all the objstorages under the legal and technical responsibility of Software Heritage,remove all the identified objects from all the storages under the legal and technical responsibility of Software Heritage,
remove all the identified objects from all the secondary data silos, namely the kafka journal, them search index, the compressed graph and the vault cache,
possibly: ensure the origins the removal request is targeting are excluded from any future archival
Note that handling archive content removal can also imply masking them (temporarily or permanently); for example during the examination process of suppression request, it might be necessary to hide all the impacted objects until a decision is made for each of them.
Name change#
A person may ask for their former identity not to be published any more. When this request has been handled and accepted, any occurrence of the former identity of the person associated with archived version control system objects (such as commits) will be replaced by the new one when using the public endpoints of the archive (namely, browsing the archive, using public APIs, using the vault).
Note that currently, only Revision
and
Release
objects are affected by the
process.
Read Access - Altering results#
The Software Heritage component responsible for altering returned objects is the
MaskingProxyStorage
. It handles both the cases of
content that are still present in the archive but need to not to be published,
and the application of active name change requests. It stores in a dedicated
database a map of email to current display name to used to alter returned
Revision and Release objects, and a series of tables dedicated to handling
masking requests. These allow not to return at all an object from the archive
if it’s under a currently active masking request.
As such, all the publicly accessible storage instances – be it from the web
frontend, the public API (REST and GraphQL) or the vault service – are
using an access path that pass through the MaskingProxyStorage
.
Note that for services like the vault, it will make it fail to perform the requested cooking in some cases (especially for git history cooking, where the cryptographic integrity of the generated git content is altered, thus invalid.)
Write Access - Preventing ingesting origins#
When an origin has been identified as forbidden for any future archiving, we
use a dedicated storage proxy in the writing path to the archive to ensure this
cannot happen. The corresponding Software Heritage component is the
BlockingProxyStorage
. It is a simple proxy
storage keeping a list of forbidden origin URLs in a dedicated database, and
enforcing any matching origin URL to be ingested in the archive.