Relational schema#

The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables.

This page documents the relational schema of the latest version of the graph dataset.

Note: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API.

  • content: contains information on the contents stored in the archive.

    • sha1 (string): the SHA-1 of the content (hexadecimal)

    • sha1_git (string): the Git SHA-1 of the content (hexadecimal)

    • sha256 (string): the SHA-256 of the content (hexadecimal)

    • blake2s256 (bytes): the BLAKE2s-256 of the content (hexadecimal)

    • length (integer): the length of the content

    • status (string): the visibility status of the content

  • skipped_content: contains information on the contents that were not archived for various reasons.

    • sha1 (string): the SHA-1 of the skipped content (hexadecimal)

    • sha1_git (string): the Git SHA-1 of the skipped content (hexadecimal)

    • sha256 (string): the SHA-256 of the skipped content (hexadecimal)

    • blake2s256 (bytes): the BLAKE2s-256 of the skipped content (hexadecimal)

    • length (integer): the length of the skipped content

    • status (string): the visibility status of the skipped content

    • reason (string): the reason why the content was skipped

  • directory: contains the directories stored in the archive.

    • id (string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm

  • directory_entry: contains the entries in directories.

    • directory_id (string): the Git SHA-1 of the directory containing the entry (hexadecimal).

    • name (bytes): the name of the file (basename of its path)

    • type (string): the type of object the branch points to (either rev (revision), dir (directory) or file (content)).

    • target (string): the Git SHA-1 of the object this entry points to (hexadecimal).

    • perms (integer): the permissions of the object

  • revision: contains the revisions stored in the archive.

    • id (string): the intrinsic hash of the revision (hexadecimal), recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the commit hash.

    • message (bytes): the revision message

    • author (string): an anonymized hash of the author of the revision.

    • date (timestamp): the date the revision was authored

    • date_offset (integer): the offset of the timezone of date

    • committer (string): an anonymized hash of the committer of the revision.

    • committer_date (timestamp): the date the revision was committed

    • committer_offset (integer): the offset of the timezone of committer_date, known as committer_date_offset in swh-storage

    • directory (string): the Git SHA-1 of the directory the revision points to (hexadecimal). Every revision points to the root directory of the project source tree to which it corresponds.

  • revision_history: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits).

    • id (string): the Git SHA-1 identifier of the revision (hexadecimal)

    • parent_id (string): the Git SHA-1 identifier of the parent (hexadecimal)

    • parent_rank (integer): the rank of the parent, which defines the ordering between the parents of the revision

  • release: contains the releases stored in the archive.

    • id (string): the intrinsic hash of the release (hexadecimal), recursively computed with the Git SHA-1 algorithm

    • target (string): the Git SHA-1 of the object the release points to (hexadecimal)

    • date (timestamp): the date the release was created

    • author (integer): the author of the revision

    • name (bytes): the release name

    • message (bytes): the release message

  • snapshot: contains the list of snapshots stored in the archive.

    • id (string): the intrinsic hash of the snapshot (hexadecimal), recursively computed with the Git SHA-1 algorithm.

  • snapshot_branch: contains the list of branches associated with each snapshot.

    • snapshot_id (string): the intrinsic hash of the snapshot (hexadecimal)

    • name (bytes): the name of the branch

    • target (string): the intrinsic hash of the object the branch points to (hexadecimal)

    • target_type (string): the type of object the branch points to (either release, revision, directory or content).

  • origin: the software origins from which the projects in the dataset were archived.

    • url (bytes): the URL of the origin

  • origin_visit: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these “visits” is an entry in this table.

    • origin: (string) the URL of the origin visited

    • visit: (integer) an integer identifier of the visit

    • date: (timestamp) the date at which the origin was visited

    • type (string): the type of origin visited (e.g git, pypi, hg, svn, git, ftp, deb, …)

  • origin_visit_status: the status of each visit.

    • origin: (string) the URL of the origin visited

    • visit: (integer) an integer identifier of the visit

    • date: (timestamp) the date at which the origin was visited

    • type (string): the type of origin visited (e.g git, pypi, hg, svn, git, ftp, deb, …)

    • snapshot_id (string): the intrinsic hash of the snapshot archived in this visit (hexadecimal).

    • status (string): the integer identifier of the snapshot archived in this visit, either partial for partial visits or full for full visits.

  • person: the full names of authors and committers. It contains a deduplicated list of each release’s and revision’s author’s (and committer’s) full name. Full names over 32 kB are truncated at 32 kB. Due to the sensitive nature of the data in it, this table is not publicly available.

    • fullname (bytes): the full name of a person, usually in the format Name <email>

    • sha256_fullname (bytes): the SHA 256 of the person’s full name