Removal algorithm#

When taking down an origin that should not have been archived, we want to remove all the objects that have been added as a result of archiving this origin, without affecting objects that have been referenced from other origins. For example, we would not want to remove the content containing the text of the GPL v3 (which is shared by many origins), or other contents or directories included verbatim from other projects.

Note

Objects in Software Heritage have an intrinsic identifier in the form of a SHWID. This means that it is not possible to remove a single directory entry without changing the identifier for the directory. Which in turn would change the identifier for the source revision… which would change the identifier of one or more releases, and the relevant snapshots then.

Therefore, we only support removal of origins and snapshots. In the future, adding the ability to remove content would be possible with the current object model. Removing other objects would need a serious amount of preliminary work to handle missing DAG nodes.

Listing objects to be removed#

The algorithm that lists the objects that can be removed in response to a takedown request works in two stages:

  1. Get the list of candidates for removal, which is a complete, up-to-date subgraph of the Software Heritage archive rooted at the requested object.

  2. For all the candidates for removal, check if they are referenced by any other object in the archive.

Getting the candidates for removal#

We first use Software Heritage - graph service to construct a subgraph using a breadth-first traversal starting from the given SWHID.

If we have not found the SWHID through Software Heritage - graph service, or if we are starting from an origin, we complete the list of candidates using Software Heritage - Storage.

Note

Retrieving data from Software Heritage - graph service is an optimization. It is fast but its content it only accurate to the point when the latest export was generated. The SWHID we are looking for could have been added later. Or we are looking at creating a subgraph from an origin and more recent visits might have been made.

This step of the algorithm is implemented in the module swh.alter.inventory.

Marking candidates used elsewhere#

When removing objects, we must be careful not to create dangling references. This means we can only remove objects if they are not currently referenced elsewhere. References to an object take the form of inbound edges in our directed graph.

But it is not enough to just skip a node because it has inbound edges. If we are trying to remove a source repository, the exact same file will likely be present in many revisions. Therefore, we need to be sure that the same content is only present in this particular project. In graph terminology: we can only remove a node if all its inbound edges are coming from nodes present in our subgraph, and there it at least one such edge.

To do that, we traverse our subgraph, For each node, we look at their inbound edges through Software Heritage - graph service and then Software Heritage - Storage. If any of them is not in the subgraph, that means the object must not be removed.

We can avoid unneeded lookups by traversing our subgraph in topological order (looking at all releases before all revisions before all directories before all contents). Then, when examining a node, we can look at its predecessors. If any of them must be kept, then we know it must be kept as well.

Example of removing an origin#

To create this example, we will re-use the dataset offered in the swh.graph.example_dataset module.

The dataset used for our examples

A dataset used to demonstrate how the algorithm works#

It shows an initial origin that has then been forked, all their snapshots, releases, revisions and contents. Objects with a bold background are old enough to be present in the export available through the swh-graph API. The others are only available through swh-storage.

Inventory candidates#

Lets imagine that we receive a legitimate takedown requests for the forked origin. We start our inventory by querying Software Heritage - graph service and Software Heritage - Storage in turn until all objects pertaining to this origin have been found. As most of the additions to the forked origin are recent, they are not present in swh-graph. Many round-trips to swh-storage are therefore needed.

Mark removable#

Now we can look for any references coming from outside our inventory, and mark the relevant objects as removable or unremovable if they are any.

A representation of our subgraph after tagging which objects can be removed and which need to be kept. We can see one reference pointing to an object from outside the inventory subgraph.

Our subgraph after looking up which nodes can safely be removed#

In red, nodes that have been deemed removable. In white, those that must be kept. They start by swh:rel:…010 for which a reference outside our inventory has been found.