Extrinsic metadata specification#
Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information available at source code archival time.
Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed.
This specification assumes the reader is familiar with Software Heritage’s Software Architecture and Data model.
Metadata sources#
Metadata fetchers#
Metadata fetchers are software components used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.
A metadata fetcher is uniquely defined by these properties:
its type
its version
Examples:
loaders, which may either discover metadata as a side-effect of loading source code, or be dedicated to fetching metadata.
listers, which may discover metadata as a side-effect of discovering origins.
deposit submitters, which push metadata to SWH from a third-party; usually at the same time as a software artifact
crawlers, which fetch metadata from an authority in a way that is none of the above (eg. by querying a specific API of the origin’s forge).
Storage API#
Artifact metadata#
Data model#
The storage database stores metadata on origins, and all software artifacts supported by the data model. They are represented using this structure (simplified Python code):
class RawExtrinsicMetadata(HashableObject, BaseModel):
object_type = "raw_extrinsic_metadata"
# target object
target: ExtendedSWHID
# source
discovery_date: datetime.datetime
authority: MetadataAuthority
fetcher: MetadataFetcher
# the metadata itself
format: str
metadata: bytes
# context
origin: Optional[str] = None
visit: Optional[int] = None
snapshot: Optional[CoreSWHID] = None
release: Optional[CoreSWHID] = None
revision: Optional[CoreSWHID] = None
path: Optional[bytes] = None
directory: Optional[CoreSWHID] = None
id: Sha1Git
The target
may be:
a regular core SWHID,
a SWHID-like string with type
ori
and the SHA1 of an origin URLa SWHID-like string with type
emd
and the SHA1 of an otherRawExtrinsicMetadata
object (to represent metadata on metadata objects)
id
is a sha1 hash of the RawExtrinsicMetadata
object itself;
it may be used in other RawExtrinsicMetadata
as target.
discovery_date
is a Python datetime.
authority
must be a dict containing keys type
and url
, and
fetcher
a dict containing keys name
and version
.
The authority and fetcher must be known to the storage before using this
endpoint.
format
is a text field indicating the format of the content of the
metadata
byte string, see extrinsic-metadata-formats.
metadata
is a byte array.
Its format is specific to each authority; and is treated as an opaque value
by the storage.
Unifying these various formats into a common language is outside the scope
of this specification.
Finally, the remaining fields allow metadata can be given on a specific artifact within a specified context (for example: a directory in a specific revision from a specific visit on a specific origin) which will be stored along the metadata itself.
For example, two origins may develop the same file independently; the information about authorship, licensing or even description may vary about the same artifact in a different context. This is why it is important to qualify the metadata with the complete context for which it is intended, if any.
The allowed context fields for each target
type are:
for
emd
(extrinsic metadata) andori
(origin): nonefor
snp
(snapshot):origin
(a URL) andvisit
(an integer)for
rel
(release): those above, plussnapshot
(the core SWHID of a snapshot)for
rev
(revision): all those above, plusrelease
(the core SWHID of a release)for
dir
(directory): all those above, plusrevision
(the core SWHID of a revision) andpath
(a byte string), representing the path to this directory from the root of therevision
for
cnt
(content): all those above, plusdirectory
(the core SWHID of a directory)
All keys are optional, but should be provided whenever possible. The dictionary may be empty, if metadata is fully independent from context.
In all cases, visit
should only be provided if origin
is
(as visit ids are only unique with respect to an origin).
Storage API#
The storage API offers three endpoints to manipulate origin metadata:
Adding metadata:
raw_extrinsic_metadata_add(metadata: List[RawExtrinsicMetadata])
which adds a list of
RawExtrinsicMetadata
objects, whosemetadata
field is a byte string obtained from a given authority and associated to thetarget
.Getting all metadata:
raw_extrinsic_metadata_get( target: ExtendedSWHID, authority: MetadataAuthority, after: Optional[datetime.datetime] = None, page_token: Optional[bytes] = None, limit: int = 1000, ) -> PagedResult[RawExtrinsicMetadata]:
returns a list of
RawExtrinsicMetadata
with the giventarget
and from the givenauthority
. Ifafter
is provided, only objects whose discovery date is more recent are returnered.PagedResult
is a structure containing the results themselves, and anext_page_token
used to fetch the next list of results, if any
Extrinsic metadata formats#
Formats are identified by an opaque string. When possible, it should be the MIME type already in use to describe the metadata format outside Software Heritage. Otherwise it should be unambiguous, printable ASCII without spaces, and human-readable.
Here is a list of all the metadata format stored:
application/vnd.github.v3+json
The metadata is the response of an API call to GitHub.
cpan-release-json
The metadata is the response of an API call to CPAN’s /release/{author}/{release} endpoint.
gitlab-project-json
The metadata is the response of an API call to Gitlab’s /api/v4/projects/:id endpoint.
gitea-project-json
The metadata is the response of an API call to Gitea’s /api/v1/repos/{owner}/{repo} endpoint.
gogs-project-json
The metadata is the response of an API call to Gogs’s /api/v1/repos/:owner/:repo endpoint.
pypi-project-json
The metadata is a release entry from a PyPI project’s JSON file, extracted and re-serialized.
replicate-npm-package-json
ditto, but from a replicate.npmjs.com project
nixguix-sources-json
ditto, but from https://nix-community.github.io/nixpkgs-swh/
original-artifacts-json
tarball data, see below
sword-v2-atom-codemeta
XML Atom document, with Codemeta metadata, as sent by a deposit client, see the Deposit protocol reference.
sword-v2-atom-codemeta-v2
Deprecated alias of
sword-v2-atom-codemeta
sword-v2-atom-codemeta-v2-in-json
Deprecated, JSON serialization of a
sword-v2-atom-codemeta
document.xml-deposit-info
Information about a deposit, to identify the provenance of a metadata object sent via swh-deposit, see below
Details on some of these formats:
original-artifacts-json#
This is a loosely defined format, originally used as a metadata
column
on the revision
table that changed over the years.
It is a JSON array, and each entry is a JSON object representing an archive (tarball, zipball, …) that was unpackaged by the SWH loader before loading its content in Software Heritage.
When writing this specification, it was stabilized to this format:
[
{
"length": <int>,
"filename": "<original filename>",
"checksums": {
"sha1": "<hex-encoded string>",
"sha256": "<hex-encoded string>",
},
"url": "<URL the archive was downloaded from>"
},
...
]
Older original-artifacts-json
were migrated to use this format,
but may be missing some of the keys.
xml-deposit-info#
Deposits with code objects are loaded as their own origin, so we can look them up in the deposit database from their metadata (which hold the origin as a context).
This is not true for metadata-only deposits, because we don’t create an origin for them; so we need to store this information somewhere. The naive solution would be to insert them in the Atom entry provided by the client, but it means altering a document before we archive it, which potentially corrupts it or loses part of the data.
Therefore, on each metadata-only deposit, the deposit creates an extra “metametadata” object, with the original metadata object as target, and using this format:
<deposit xmlns="https://www.softwareheritage.org/schema/2018/deposit">
<deposit_id>{{ deposit.id }}</deposit_id>
<deposit_client>{{ deposit.client.provider_url }}</deposit_client>
<deposit_collection>{{ deposit.collection.name }}</deposit_collection>
</deposit>