Roadmap 2023#

(Version 1.0, last modified 2023-03-13)

This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2023.

Live tracking of the roadmap implementation progress during the year is available from a dedicated GitLab board.

Collect#

Add support for write APIs features in GraphQL#

Description:

Add support for write APIs in GraphQL (eg: an API for save code now) in order to cover 100% of the REST API features in the GraphQL API.

Includes work:

  • Implement write APIs

  • Enforce authorization configuration for restricted access features

KPIs:

  • GraphQL coverage of 100% of the REST API in production

Tooling for takedown notices#

Description:

Set up a workflow to handle takedown requests and improve automation capabilities of the sysadmin tools for takedown notices processing.

Includes work:

  • Set up a specification for workflow integration in swh-web

  • Implement workflow integration

  • Set up technical specification for sysadmin tooling

  • Implement missing sysadmin tools (verification and automation)

  • Create a sysadmin documentation for takedown notices

KPIs:

  • Takedown notice handling integrated to swh-web

  • Automated sysadmin tools for takedown notices processing

Automate add forge now#

Description:

Set up automation capabilities on Add forge now to ease and facilitate the handling of Add forge now requests

Includes work:

  • Automate ingestion process

  • Automate add forge now workflow

  • Setup and deploy automation process in staging

  • Deploy automation process in production

KPIs:

  • Automated Add forge now processing tools and wokflow in production

Minimize archival lag w.r.t. upstream code hosting platforms#

Description:

Improve ingestion efficiency Make lag monitoring dashboards easy to find (for decision makers)

Includes work:

  • Implement git protocol V2 for Dulwich

  • Optimize scheduling policies

  • Optimize loaders

KPIs:

  • Number of out of date repos (absolute and per platform)

  • Total archive lag (e.g., in days)

Extend archive coverage#

Description:

Add listers and loaders for not-yet-supported forges/package managers and VCS Listers and loaders can be developed in house or contributed by external partners, e.g., via dedicated grants.

Includes work:

  • Validate public review and deploy Listers and loaders pending in staging (Arch, AUR, Crates, Packagist, Rubygems, Fedora, Puppet, Hackage, Golang, Bower, Nix/Guix, CVS, pub.dev)

  • Implement new listers and loader

KPIs:

  • Number of deployed listers

  • Number of deployed loaders

Preserve#

Explore possibility of replacing SHA1 with SHA1-DC#

Description:

Mainstream platforms like GitHub now use SHA1-DC

Includes work:

  • Study implications of aligning with the SHA1-DC adoption

KPIs:

  • Decision/blockers whether to move to SHA1-DC

Regularly scrub journal, storage, and objstorage#

Description:

Set up background jobs to regularly check - and repair when necessary - data validity, in all SWH data stores. This includes both blobs (swh-objstorage) and other graph objects (swh-storage) on all the copies (in-house, kafka, azure, upcoming mirrors, etc.)

Includes work:

  • Implement storage scrubber for Cassandra

  • Add scrubbing for the object storage

  • Add metrics and Grafana dashboard for scrubbing process

  • Automatically repair and recover objects found to be invalid

KPIs:

  • List of scrubbers deployed in production

  • Monitoring tools deployed in production

  • Rolling report of operations per datastore including errors found and fixed at each iteration

Publicly available standard for SWHID version 1#

Description:

Publish a stable version of the SWHID version 1 specification, approved by a standard organization body.

Includes work:

  • Publish publicly available standard

  • Start ISO normalization for SWHID V1

KPIs:

  • Published standard for SWHID version 1

SWH Mirror at GRNET#

Description:

Collaborate with GRNET to create a SWH Mirror

Includes work:

  • Guidance and contribution to GRNET architecture and infrastructure choices

  • Specific developments if necessary (to be determined according to the chosen technical solutions)

  • Help to deployment

KPIs:

  • validated architecture and first POC

SWH Mirror at Duisburg-Essen university#

Description:

Collaborate with Duisburg-Essen university to create a SWH Mirror

Includes work:

  • Guidance and contribution to UniDue architecture and infrastructure choices

  • Specific developments if necessary (to be determined according to the chosen technical solutions)

  • Developments of tools for Winery replication (for Ceph-based object storage)

  • Help to deployment

KPIs:

  • validated architecture and first POC

SWH Mirror at ENEA#

Description:

Collaborate with ENEA to create a SWH Mirror

Includes work:

  • Finalize object storage copy

  • Configure the stack for the mirror public deployment

KPIs:

  • SWH Mirror deployed on ENEA infrastructure and publicly available

Mirrors tooling#

Description:

Provide common features required the SWH mirrors

Includes work:

  • Set up feature flags on the web app and test modules activation/deactivation

  • Implement fallback mechanism for objstorage

  • Dedicated CI for the mirroring stack

KPIs:

  • Common features available for specific mirrors instances

Archive cold-copy at CINES via Vitam#

Description:

Perform a first complete copy of the archive stored in Vitam @ CINES Maintain the copy up-to-date periodically (on a period TBD)

Includes work:

  • Validate implementation of ORC format in Vitaam

  • Run a Proof of Concept

  • Run the complete copy @ CINES

  • Configure/schedule the copy update process

KPIs:

  • First copy stored in Vitam

  • Updates calendar defined

Support archiving repositories containing SHA1 hash conflicts on blobs#

Description:

Enable the possibility to use multiple hash types for objects checksums in order to get rid of the limitations imposed by having SHA1 as a primary key for the object storage internally.

Includes work:

  • Implement the remaining low-level layers (model and API are ready)

KPIs:

  • Multiple hash storage facility in production

  • Ability to archive git repos that contains sample SHAttered collisions blobs (they are currently detected and refused)

Share#

Propose Web UI sections for dedicated partner collections#

Description:

Design and test the creation of dedicated collections pages (list of origins associated to/provided by a partner)

Includes work:

  • design a web ui feature for specific software collection (list of origins) based on custom criteria (intrinsic and/or extrinsic metadata)

KPIs:

  • Specification and mockup for this feature

Create a cost-calculator in the Vault#

Description:

Implement a cost-calculator feature in swh-vault in order to estimate the cost of computing before cooking an artifact. The purpose of this feature is to prevent overload in some edge cases and possibly establish a rate-limiting system to avoid abusive usage of the vault.

Includes work:

  • Design calculation rules

  • Implement the cost-calculator

  • Make it configurable according to the user profile

KPIs:

  • Cost-calculation activated on swh-vault in production

Publish derived datasets#

Description:

Setup tools to automate the publication of derived datasets, and generate specific datasets for research purposes throughout the year, on request by rdicosmo and zack

Includes work:

  • Finalize and maintain the automation pipeline (Luigi) for datasets generation

  • Build new datasets when requested

KPIs:

  • Generation pipeline available in production

  • Scheduled and regularly published derived datasets

Collect and index forge metadata#

Description:

Collect and index metadata from more forges and package managers in order to expand metadata coverage.

Includes work:

  • Provide a prioritized list of forges/package managers to process

  • Improve the performance of indexers to reduce lag vs metadata collection

  • Implement and deploy indexers for not supported forges/package managers

KPIs:

  • number of new forges supported / % indexed for each

  • number of new package managers supported / % indexed for each

Evaluate the storage of indexed metadata in a triple-store#

Description:

Evaluate the opportunity of storing indexed metadata in a triple store, instead of the actual ElasticSearch architecture, to prevent crashes due to embedded JSON-LD documents treated as regular JSON, and add support of relations between documents.

Therefore, I would like to try using a proper triple-store. [Virtuoso](https://virtuoso.openlinksw.com) in particular looks promising, as it support both SPARQL and full-text search.

Includes work:

  • Try and evaluate a proper triple-store (Virtuoso) on a testing infrastructure

  • According to the conclusions of the evaluation, decide whether to choose this triple-store solution

KPIs:

  • Decision to switch to a triple-store for indexed metadata storage

Release a first version of the swh-scanner product#

Description:

Industrialize and improve the swh-scanner CLI to provide a full-featured product ready for regular use.

Includes work:

  • Improve the concurrency model on edge cases

  • Set up an enhanced result dashboard

  • Implement advanced filtering capabilities

  • Provide an exhaustive documentation

  • Add provenance information (depending on provenance progress)

KPIs:

  • Release and announce a first version of swh-scanner

Webhook-based notification for long-running user tasks#

Description:

Create a reusable event-based webhook architecture and implement it on adequate SWH features

Includes work:

  • Identify technical issues and design options

  • Specification and implementation of a standard core

  • Implementation for origin visit

  • Implementation for add forge now

  • Implementation for save code now

  • Implementation for vault cooking

  • Implementation for deposit

KPIs:

  • Number of services that support webhook-based notifications

Self-host Software Stories software stack#

Description:

Deploy a Software Stories instance hosted on the SWH infrastructure

Includes work:

  • Define and document the infrastructure requirements

  • Deploy and document (Operations / backups / …)

  • Migrate the current stories to the SWH instance

  • Establish the migration plan / redirection plan

KPIs: - SWH stories site available - Documentation written - Current stories migrated to the SWH instance - Public software stories instance migrated to the SWH instance

Design presentation of Metadata on Web UI#

Description:

Design presentation of intrinsic and extrinsic metadata for any artifact on web UI and add linked data capabilities (Semantic Web solutions)

Includes work:

  • Specify the expected use cases

  • Design metadata view for Web UI

  • Allow export of metadata (in multiple formats - APA/ BibTeX/ CodeMeta/ CFF)

  • Assistance and contribution to CodeMeta

  • Add linked data capabilities

KPIs:

  • Specification and POC

Documentation#

Provide a landing page for docs.s.o#

Description:

Provide a user-friendly landing page for all documentation at docs.s.o, providing guidelines for each user type.

Includes work:

  • Finalize and publish the landing page content

  • Improve the organization of the left-column menus

KPIs:

  • Landing page in production

Technical debt#

Setup efficient and consistent swh-storage pagination#

Description:

Define and implement an efficient structure for pagination in the data sources for swh-storage.

Pagination in the data sources (eg storage) is not very consistent and client friendly. Defining and implementing an efficient structure will be a good improvement. This will also involve re-factoring some clients.

Includes work:

  • Design an efficient pagination architecture

  • Refactor obj-storage to implement the pagination

  • Identify and refactor existing clients that use swh-storage pagination

KPIs:

  • New pagination solution in production for swh-storage

  • Existing clients updated to use this solution

Improve support for malformed git commits#

Description:

Improve the git loader to make it able to deal with edge-case commits that cause Dulwich to crash due to unnecessary data validation.

Includes work:

  • Fix all crashes of the git loader caused by malformed git objects

  • Support commits whose “author” or “committer” field is missing

KPIs:

  • ratio of crashes on commits ingestion by the git loader (before/after)

Tooling and infrastructure#

Dynamic infrastructure#

Description:

Setup a dynamically scalable infrastructure for Software Heritage services

Includes work:

  • Setup an elastic workers infrastructure

  • Configure Kubernetes clusters

  • Monitoring/Alerting solution for container-based services

  • Ingest the logs of the dynamic components into the current elk infrastructure

KPIs:

  • Dashboard displaying the status of the dynamic components - Number of listers running - Number of loaders running - RPC services status

  • Logs ingested and correctly parsed in kibana

  • Clusters fully backuped

Use a common workflow management tool for swh-web#

Description:

Find and integrate a common workflow management tool in swh-web for future modules that will require a workflow logic (takedown notices process, user support, etc.)

Includes work:

  • Investigate the existing tools, measuring advantages and drawbacks for each

  • Integrate the most relevant tool in swh-web

  • Document the usage with a sample module

KPIs:

  • Integrated workflow tool, ready to use, in swh-web

Provide a management-friendly monitoring dashboard of services#

Description:

Provide a high-level and easy to find dashboard of running services with documented key indicators.

Includes work:

  • Gather public site metrics

  • Publish and document a dedicated dashboard

  • Add links to it on common web applications (web app and docs.s.o)

KPIs:

  • Indicators available for public sites status

  • Indicators for archive workers status

  • Indicators for archive behavior

  • Main dashboard that aggregates the indicators

  • Dashboard referenced in common web applications

Provenance in production#

Description:

Publish swh-provenance services in production, including revision and origin layers.

Includes work:

  • Build and deploy content index based on a winnowing algorithm

  • Filter provenance pipeline to process only tags and releases

  • Setup a production infrastructure for the kafka-based revision layer (including monitoring)

  • Refactor and process the origin layer

  • Release provenance documentation

KPIs:

  • Provenance services available in production

  • % of archive covered

Scale-out objstorage in production as primary objstorage#

Description:

Have the Ceph-based objstorage for SWH (Winery) in production as primary storage and set up equivalent MVP in staging (maybe use the same Ceph cluster for this)

Includes work:

  • Deploy Ceph objstorage/Winery on CEA infrastructure

  • Benchmark Ceph-based objstorage

  • Switch to Ceph-based objstorage as primary storage

  • Handle Mirroring

KPIs:

  • Ceph-based obj-storage in production

Cassandra in production as primary storage#

Description:

Use Cassandra as primary storage in production, in replacement of PostgreSQL

Includes work:

  • Finalize and validate the replayed data

  • Install the new bare metal servers for staging and production

  • Deploy a Cassandra-based production instance for tests

  • Benchmark the Cassandra infrastructure

  • Switch to Cassandra in production for primary storage

KPIs:

  • Replayed data validated

  • Live staging archive instance in parallel of the legacy postgresql instance

  • Live production archive instance in parallel of the legacy postgresql instance

  • Cassandra primary storage in staging

  • Cassandra primary storage in production

Design and test a Continuous Deployment infrastructure#

Description:

Set up a Continuous Deployment infrastructure in order to improve bug detection and validate the future elastic infrastructure components

Includes work:

  • Migrate away from Debian packaging for deployment (to pypi packages?)

  • Build a docker image per deployable service

  • Build the deployment tooling

  • Reset and redeploy the stack after commits

  • Execute acceptance tests

  • Identify if a deployment can be done by the ci or needs human interaction (mostly detect if a migration is present)

  • Integration tests

KPIs:

  • Docker image build triggered by a new version deployed in pypi

  • Docker image build by the CI

  • Component versions updated by the CI

  • Automatically redeployed staging on new release

  • Staging / whatever environment testing before pushing to production

Design and test next generation CI Automation#

Description:

Design and tests solutions in order to improve the actual Continuous Integration tools to match the infrastructure evolutions and provide more features

Includes work:

  • Actual CI state of the art and requirements specification

  • Evaluation of a migration from Jenkins to GitLab CI (and effective migration if relevant)

  • Code audit tools integration (static and/or dynamic analysis)

KPIs:

  • Gitlab CI used or tested in one or more sysadmin projects

  • Evaluation matrix (Pros/Cons) for a migration from jenkins to gitlab ci or other tool

  • Pros/Cons to deploy a code audit tool

Graph export and graph compression in production#

Description:

Have the graph compression pipeline running in production with less then a month of lag Deployment, hosting and pipeline tooling

Includes work:

  • Add JVM monitoring

  • Finish automation scripts

  • Deploy on a dedicated machine

KPIs:

  • Graph compression pipeline in production

  • Last update date / number of updates per year