When taking down an origin that should not have been archived, we want to remove all the objects that have been added as a result of archiving this origin, without affecting objects that have been referenced from other origins. For example, we would not want to remove the content containing the text of the GPL v3 (which is shared by many origins), or other contents or directories included verbatim from other projects.
Objects in Software Heritage have an intrinsic identifier in the form of a SHWID. This means that it is not possible to remove a single directory entry without changing the identifier for the directory. Which in turn would change the identifier for the source revision… which would change the identifier of one or more releases, and the relevant snapshots then.
Therefore, we only support removal of origins and snapshots. In the future, adding the ability to remove content would be possible with the current object model. Removing other objects would need a serious amount of preliminary work to handle missing DAG nodes.
Listing objects to be removed#
The algorithm that lists the objects that can be removed in response to a takedown request works in two stages:
Get the list of candidates for removal, which is a complete, up-to-date subgraph of the Software Heritage archive rooted at the requested object.
For all the candidates for removal, check if they are referenced by any other object in the archive.
Getting the candidates for removal#
We first use Software Heritage - graph service to construct a subgraph using a breadth-first traversal starting from the given SWHID.
Retrieving data from Software Heritage - graph service is an optimization. It is fast but its content it only accurate to the point when the latest export was generated. The SWHID we are looking for could have been added later. Or we are looking at creating a subgraph from an origin and more recent visits might have been made.
This step of the algorithm is implemented in the module
Marking candidates used elsewhere#
When removing objects, we must be careful not to create dangling references. This means we can only remove objects if they are not currently referenced elsewhere. References to an object take the form of inbound edges in our directed graph.
But it is not enough to just skip a node because it has inbound edges. If we are trying to remove a source repository, the exact same file will likely be present in many revisions. Therefore, we need to be sure that the same content is only present in this particular project. In graph terminology: we can only remove a node if all its inbound edges are coming from nodes present in our subgraph, and there it at least one such edge.
To do that, we traverse our subgraph, For each node, we look at their inbound edges through Software Heritage - graph service and then Software Heritage - Storage. If any of them is not in the subgraph, that means the object must not be removed.
We can avoid unneeded lookups by traversing our subgraph in topological order (looking at all releases before all revisions before all directories before all contents). Then, when examining a node, we can look at its predecessors. If any of them must be kept, then we know it must be kept as well.
Example of removing an origin#
To create this example, we will re-use the dataset offered in the
Lets imagine that we receive a legitimate takedown requests for the forked
origin. We start our inventory by querying Software Heritage - graph service and
Software Heritage - Storage in turn until all objects pertaining to this origin have
been found. As most of the additions to the forked origin are recent, they are
not present in
swh-graph. Many round-trips to
Now we can look for any references coming from outside our inventory, and mark the relevant objects as removable or unremovable if they are any.