Relational schema#
The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables.
This page documents the relational schema of the latest version of the graph dataset.
Note: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API.
content: contains information on the contents stored in the archive.
sha1(string): the SHA-1 of the content (hexadecimal)sha1_git(string): the Git SHA-1 of the content (hexadecimal)sha256(string): the SHA-256 of the content (hexadecimal)blake2s256(bytes): the BLAKE2s-256 of the content (hexadecimal)length(integer): the length of the contentstatus(string): the visibility status of the content
skipped_content: contains information on the contents that were not archived for various reasons.
sha1(string): the SHA-1 of the skipped content (hexadecimal)sha1_git(string): the Git SHA-1 of the skipped content (hexadecimal)sha256(string): the SHA-256 of the skipped content (hexadecimal)blake2s256(bytes): the BLAKE2s-256 of the skipped content (hexadecimal)length(integer): the length of the skipped contentstatus(string): the visibility status of the skipped contentreason(string): the reason why the content was skipped
directory: contains the directories stored in the archive.
id(string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm
directory_entry: contains the entries in directories.
directory_id(string): the Git SHA-1 of the directory containing the entry (hexadecimal).name(bytes): the name of the file (basename of its path)type(string): the type of object the branch points to (eitherrev(revision),dir(directory) orfile(content)).target(string): the Git SHA-1 of the object this entry points to (hexadecimal).perms(integer): the permissions of the object
revision: contains the revisions stored in the archive.
id(string): the intrinsic hash of the revision (hexadecimal), recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the commit hash.message(bytes): the revision messageauthor(string): an anonymized hash of the author of the revision.date(timestamp): the date the revision was authoreddate_offset(integer): the offset of the timezone ofdatecommitter(string): an anonymized hash of the committer of the revision.committer_date(timestamp): the date the revision was committedcommitter_offset(integer): the offset of the timezone ofcommitter_date, known ascommitter_date_offsetin swh-storagedirectory(string): the Git SHA-1 of the directory the revision points to (hexadecimal). Every revision points to the root directory of the project source tree to which it corresponds.
revision_history: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits).
id(string): the Git SHA-1 identifier of the revision (hexadecimal)parent_id(string): the Git SHA-1 identifier of the parent (hexadecimal)parent_rank(integer): the rank of the parent, which defines the ordering between the parents of the revision
release: contains the releases stored in the archive.
id(string): the intrinsic hash of the release (hexadecimal), recursively computed with the Git SHA-1 algorithmtarget(string): the Git SHA-1 of the object the release points to (hexadecimal)date(timestamp): the date the release was createdauthor(integer): the author of the revisionname(bytes): the release namemessage(bytes): the release message
snapshot: contains the list of snapshots stored in the archive.
id(string): the intrinsic hash of the snapshot (hexadecimal), recursively computed with the Git SHA-1 algorithm.
snapshot_branch: contains the list of branches associated with each snapshot.
snapshot_id(string): the intrinsic hash of the snapshot (hexadecimal)name(bytes): the name of the branchtarget(string): the intrinsic hash of the object the branch points to (hexadecimal)target_type(string): the type of object the branch points to (eitherrelease,revision,directoryorcontent).
origin: the software origins from which the projects in the dataset were archived.
url(bytes): the URL of the origin
origin_visit: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these “visits” is an entry in this table.
origin: (string) the URL of the origin visitedvisit: (integer) an integer identifier of the visitdate: (timestamp) the date at which the origin was visitedtype(string): the type of origin visited (e.ggit,pypi,hg,svn,git,ftp,deb, …)
origin_visit_status: the status of each visit.
origin: (string) the URL of the origin visitedvisit: (integer) an integer identifier of the visitdate: (timestamp) the date at which the origin was visitedtype(string): the type of origin visited (e.ggit,pypi,hg,svn,git,ftp,deb, …)snapshot_id(string): the intrinsic hash of the snapshot archived in this visit (hexadecimal).status(string): the integer identifier of the snapshot archived in this visit, eitherpartialfor partial visits orfullfor full visits.
person: the full names of authors and committers. It contains a deduplicated list of each release’s and revision’s author’s (and committer’s) full name. Full names over 32 kB are truncated at 32 kB. Due to the sensitive nature of the data in it, this table is not publicly available.
fullname(bytes): the full name of a person, usually in the formatName <email>sha256_fullname(bytes): the SHA 256 of the person’s full name