An instance of the Software Heritage data store.


Archival Resource Key (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose persistent identifier for information objects of any type.

software artifact#

An artifact is one of many kinds of tangible by-products produced during the development of software.


A (specific version of a) file stored in the archive, identified by its cryptographic hashes (SHA1, “git-like” SHA1, SHA256) and its size. Also known as: blob. Note: it is incorrect to refer to Contents as “files”, because files are usually considered to be named, whereas Contents are nameless. It is only in the context of specific directories that contents acquire (local) names.


A software artifact that was pushed to the Software Heritage archive (unlike loaders, which pull artifacts). A deposit is useful when you want to ensure a software release’s source code is archived in SWH even if it is not published anywhere else.

See also: the Software Heritage - Deposit component, which implements a deposit client and server.


A set of named pointers to contents (file entries), directories (directory entries) and revisions (revision entries). All entries are associated to the local name of the entry (i.e., a relative path without any path separator) and permission metadata (e.g., chmod value or equivalent).


A Digital Object Identifier or DOI is a persistent identifier or handle used to uniquely identify objects, standardized by the International Organization for Standardization (ISO).

external identifier#

An identifier used by a system that does not fit the Software Heritage data model, such as Mercurial’s nodeid, or the hash of a tarball from a package manager. They may be stored in the Software Heritage archive independently of the identified object, to quickly match an external object (a changeset or tarball) to an object in the archive without downloading it.

extrinsic metadata#

Metadata about software that is not shipped as part of the software source code, but is available instead via out-of-band means. For example, homepage, maintainer contact information, and popularity information (“stars”) as listed on GitHub/GitLab repository pages.

See also: intrinsic metadata Metadata workflow and architecture.


The journal is the persistent logger of the Software Heritage architecture in charge of logging changes of the archive, with publish-subscribe support.


A lister is a component of the Software Heritage architecture that is in charge of enumerating the software origin (e.g., VCS, packages, etc.) available at a source code distribution place.


A loader is a component of the Software Heritage architecture responsible for reading a source code origin (typically a git repository) and import or update its content in the archive (ie. add new file contents int object storage and repository structure in the storage database).

cryptographic hash#

A fixed-size “summary” of a stream of bytes that is easy to compute, and hard to reverse. (Cryptographic hash function Wikipedia article) also known as: checksum, digest.


A component of the Software Heritage architecture dedicated to producing metadata linked to the known blobs in the archive.

intrinsic identifier#

A short character string that uniquely identifies an object, that can be generated deterministically, using only the content of the object, usually a cryptographic hash. This excludes network interaction and central authority.

Examples of intrinsic identifiers are: checksums (for files/strings only), git hashes, and SWHIDs

intrinsic metadata#

Metadata about software that is shipped as part of the source code of the software itself or as part of related artifacts (e.g., revisions, releases, etc). For example, metadata that is shipped in PKG-INFO files for Python packages, pom.xml for Maven-based Java projects, debian/control for Debian packages, metadata.json for NPM, etc.

See also: extrinsic metadata, Metadata workflow and architecture.

object store#
object storage#

Content-addressable object storage. It is the place where actual object blobs objects are stored.

software origin#
data source#

A location from which a coherent set of sources has been obtained, like a git repository, a directory containing tarballs, etc.


An entity referenced by a revision as either the author or the committer of the corresponding change. A person is associated to a full name and/or an email address.


a revision that has been marked as noteworthy with a specific name (e.g., a version number), together with associated development metadata (e.g., author, timestamp, etc).


A point in time snapshot of the content of a directory, together with associated development metadata (e.g., author, timestamp, log message, etc).


The component of the Software Heritage architecture dedicated to the management and the prioritization of the many tasks.


the state of all visible branches during a specific visit of an origin

storage database#

The main database of the Software Heritage platform in which the all the elements of the Data model but the content are stored as a Merkle DAG.

type of origin#

Information about the kind of hosting, e.g., whether it is a forge, a collection of repositories, an homepage publishing tarball, or a one shot source code repository. For all kind of repositories please specify which VCS system is in use (Git, SVN, CVS, etc.) object.

vault service#

User-facing service that allows to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)


The passage of Software Heritage on a given origin, to retrieve all source code and metadata available there at the time. A visit object stores the state of all visible branches (if any) available at the origin at visit time; each of them points to a revision object in the archive. Future visits of the same origin will create new visit objects, without removing previous ones.