Provenance Overview#

Intended audience

staff members

The provenance application is a standalone grpc server. It does not depend on other swh services.

This application is running in a kubernetes cluster (production, staging). It’s used through the web api and can be used by staff members (through the vpn).

There is no writing, only read-only queries.

Its backend relies on parquet files.

Authentication#

Through the standard web api authentication mechanism.

Web api users have access to the provenance api when they are affected the swh.web.api.provenance.

Staff members have direct access to the provenance api.

Datasets#

The provenance needs 2 datasets:

  • “versioned” parquet files: the main database queried by the provenance server

  • complemented by the following “versioned” graph files:

    • graph.pthash

    • graph.pthash.order

    • graph.node2swhid.bin

    • graph.node2type.bin

    • graph.nodes.count.txt

    • graph.property.message.bin

    • graph.property.message.offset.bin

    • graph.property.tag_name.bin

    • graph.property.tag_name.offset.bin

Note: The version of the graph files should be the same version as the graph which generated the provenance parquet files.

Internal Domains#

As the provenance will be used through the webapi, there is no public domain, only internal.

For each environment, the hostnames will be: - staging: provenance.internal.staging.swh.network - production: provenance.internal.softwareheritage.org

Architecture#

### Staging

Provenance setup in staging

### Production

Provenance setup in production