Roadmap 2023#
(Version 1.0, last modified 2023-03-13)
This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2023.
Live tracking of the roadmap implementation progress during the year is available from a dedicated GitLab board.
Collect#
Add support for write APIs features in GraphQL#
Lead: jayesh
Priority: low
Description:
Add support for write APIs in GraphQL (eg: an API for save code now) in order to cover 100% of the REST API features in the GraphQL API.
Includes work:
Implement write APIs
Enforce authorization configuration for restricted access features
KPIs:
GraphQL coverage of 100% of the REST API in production
Tooling for takedown notices#
Lead: lunar
Priority: high
Description:
Set up a workflow to handle takedown requests and improve automation capabilities of the sysadmin tools for takedown notices processing.
Includes work:
Set up a specification for workflow integration in swh-web
Implement workflow integration
Set up technical specification for sysadmin tooling
Implement missing sysadmin tools (verification and automation)
Create a sysadmin documentation for takedown notices
KPIs:
Takedown notice handling integrated to swh-web
Automated sysadmin tools for takedown notices processing
Automate add forge now#
Lead: vsellier
Priority: low
Description:
Set up automation capabilities on Add forge now to ease and facilitate the handling of Add forge now requests
Includes work:
Automate ingestion process
Automate add forge now workflow
Setup and deploy automation process in staging
Deploy automation process in production
KPIs:
Automated Add forge now processing tools and wokflow in production
Minimize archival lag w.r.t. upstream code hosting platforms#
Lead: olasd
Priority: medium
Description:
Improve ingestion efficiency Make lag monitoring dashboards easy to find (for decision makers)
Includes work:
Implement git protocol V2 for Dulwich
Optimize scheduling policies
Optimize loaders
KPIs:
Number of out of date repos (absolute and per platform)
Total archive lag (e.g., in days)
Extend archive coverage#
Lead: ardumont
Priority: medium
Description:
Add listers and loaders for not-yet-supported forges/package managers and VCS Listers and loaders can be developed in house or contributed by external partners, e.g., via dedicated grants.
Includes work:
Validate public review and deploy Listers and loaders pending in staging (Arch, AUR, Crates, Packagist, Rubygems, Fedora, Puppet, Hackage, Golang, Bower, Nix/Guix, CVS, pub.dev)
Implement new listers and loader
KPIs:
Number of deployed listers
Number of deployed loaders
Preserve#
Explore possibility of replacing SHA1 with SHA1-DC#
Lead: olasd
Priority: high
Description:
Mainstream platforms like GitHub now use SHA1-DC
Includes work:
Study implications of aligning with the SHA1-DC adoption
KPIs:
Decision/blockers whether to move to SHA1-DC
Regularly scrub journal, storage, and objstorage#
Lead: vlorentz
Priority: medium
Description:
Set up background jobs to regularly check - and repair when necessary - data validity, in all SWH data stores. This includes both blobs (swh-objstorage) and other graph objects (swh-storage) on all the copies (in-house, kafka, azure, upcoming mirrors, etc.)
Includes work:
Implement storage scrubber for Cassandra
Add scrubbing for the object storage
Add metrics and Grafana dashboard for scrubbing process
Automatically repair and recover objects found to be invalid
KPIs:
List of scrubbers deployed in production
Monitoring tools deployed in production
Rolling report of operations per datastore including errors found and fixed at each iteration
Publicly available standard for SWHID version 1#
Lead: rdicosmo
Priority: high
Description:
Publish a stable version of the SWHID version 1 specification, approved by a standard organization body.
Includes work:
Publish publicly available standard
Start ISO normalization for SWHID V1
KPIs:
Published standard for SWHID version 1
SWH Mirror at GRNET#
Lead: douardda
Priority: medium
Description:
Collaborate with GRNET to create a SWH Mirror
Includes work:
Guidance and contribution to GRNET architecture and infrastructure choices
Specific developments if necessary (to be determined according to the chosen technical solutions)
Help to deployment
KPIs:
validated architecture and first POC
SWH Mirror at Duisburg-Essen university#
Lead: douardda
Priority: low
Description:
Collaborate with Duisburg-Essen university to create a SWH Mirror
Includes work:
Guidance and contribution to UniDue architecture and infrastructure choices
Specific developments if necessary (to be determined according to the chosen technical solutions)
Developments of tools for Winery replication (for Ceph-based object storage)
Help to deployment
KPIs:
validated architecture and first POC
SWH Mirror at ENEA#
Lead: douardda
Priority: high
Description:
Collaborate with ENEA to create a SWH Mirror
Includes work:
Finalize object storage copy
Configure the stack for the mirror public deployment
KPIs:
SWH Mirror deployed on ENEA infrastructure and publicly available
Mirrors tooling#
Lead: douardda
Priority: high
Description:
Provide common features required the SWH mirrors
Includes work:
Set up feature flags on the web app and test modules activation/deactivation
Implement fallback mechanism for objstorage
Dedicated CI for the mirroring stack
KPIs:
Common features available for specific mirrors instances
Archive cold-copy at CINES via Vitam#
Lead: douardda
Priority: medium
Description:
Perform a first complete copy of the archive stored in Vitam @ CINES Maintain the copy up-to-date periodically (on a period TBD)
Includes work:
Validate implementation of ORC format in Vitaam
Run a Proof of Concept
Run the complete copy @ CINES
Configure/schedule the copy update process
KPIs:
First copy stored in Vitam
Updates calendar defined
Support archiving repositories containing SHA1 hash conflicts on blobs#
Lead: olasd
Priority: high
Description:
Enable the possibility to use multiple hash types for objects checksums in order to get rid of the limitations imposed by having SHA1 as a primary key for the object storage internally.
Includes work:
Implement the remaining low-level layers (model and API are ready)
KPIs:
Multiple hash storage facility in production
Ability to archive git repos that contains sample SHAttered collisions blobs (they are currently detected and refused)
Documentation#
Provide a landing page for docs.s.o#
Lead: lunar
Priority: high
Description:
Provide a user-friendly landing page for all documentation at docs.s.o, providing guidelines for each user type.
Includes work:
Finalize and publish the landing page content
Improve the organization of the left-column menus
KPIs:
Landing page in production
Technical debt#
Setup efficient and consistent swh-storage pagination#
Lead: jayesh
Priority: high
Description:
Define and implement an efficient structure for pagination in the data sources for swh-storage.
Pagination in the data sources (eg storage) is not very consistent and client friendly. Defining and implementing an efficient structure will be a good improvement. This will also involve re-factoring some clients.
Includes work:
Design an efficient pagination architecture
Refactor obj-storage to implement the pagination
Identify and refactor existing clients that use swh-storage pagination
KPIs:
New pagination solution in production for swh-storage
Existing clients updated to use this solution
Improve support for malformed git commits#
Lead: vlorentz
Priority: high
Description:
Improve the git loader to make it able to deal with edge-case commits that cause Dulwich to crash due to unnecessary data validation.
Includes work:
Fix all crashes of the git loader caused by malformed git objects
Support commits whose “author” or “committer” field is missing
KPIs:
ratio of crashes on commits ingestion by the git loader (before/after)
Tooling and infrastructure#
Dynamic infrastructure#
Lead: vsellier
Priority: high
Description:
Setup a dynamically scalable infrastructure for Software Heritage services
Includes work:
Setup an elastic workers infrastructure
Configure Kubernetes clusters
Monitoring/Alerting solution for container-based services
Ingest the logs of the dynamic components into the current elk infrastructure
KPIs:
Dashboard displaying the status of the dynamic components - Number of listers running - Number of loaders running - RPC services status
Logs ingested and correctly parsed in kibana
Clusters fully backuped
Use a common workflow management tool for swh-web#
Lead: lunar
Priority: medium
Description:
Find and integrate a common workflow management tool in swh-web for future modules that will require a workflow logic (takedown notices process, user support, etc.)
Includes work:
Investigate the existing tools, measuring advantages and drawbacks for each
Integrate the most relevant tool in swh-web
Document the usage with a sample module
KPIs:
Integrated workflow tool, ready to use, in swh-web
Provide a management-friendly monitoring dashboard of services#
Lead: vsellier
Priority: high
Description:
Provide a high-level and easy to find dashboard of running services with documented key indicators.
Includes work:
Gather public site metrics
Publish and document a dedicated dashboard
Add links to it on common web applications (web app and docs.s.o)
KPIs:
Indicators available for public sites status
Indicators for archive workers status
Indicators for archive behavior
Main dashboard that aggregates the indicators
Dashboard referenced in common web applications
Provenance in production#
Lead: douardda
Priority: high
Description:
Publish swh-provenance services in production, including revision and origin layers.
Includes work:
Build and deploy content index based on a winnowing algorithm
Filter provenance pipeline to process only tags and releases
Setup a production infrastructure for the kafka-based revision layer (including monitoring)
Refactor and process the origin layer
Release provenance documentation
KPIs:
Provenance services available in production
% of archive covered
Scale-out objstorage in production as primary objstorage#
Lead: olasd
Priority: high
Description:
Have the Ceph-based objstorage for SWH (Winery) in production as primary storage and set up equivalent MVP in staging (maybe use the same Ceph cluster for this)
Includes work:
Deploy Ceph objstorage/Winery on CEA infrastructure
Benchmark Ceph-based objstorage
Switch to Ceph-based objstorage as primary storage
Handle Mirroring
KPIs:
Ceph-based obj-storage in production
Cassandra in production as primary storage#
Lead: vsellier
Priority: high
Description:
Use Cassandra as primary storage in production, in replacement of PostgreSQL
Includes work:
Finalize and validate the replayed data
Install the new bare metal servers for staging and production
Deploy a Cassandra-based production instance for tests
Benchmark the Cassandra infrastructure
Switch to Cassandra in production for primary storage
KPIs:
Replayed data validated
Live staging archive instance in parallel of the legacy postgresql instance
Live production archive instance in parallel of the legacy postgresql instance
Cassandra primary storage in staging
Cassandra primary storage in production
Design and test a Continuous Deployment infrastructure#
Lead: vsellier
Priority: medium
Description:
Set up a Continuous Deployment infrastructure in order to improve bug detection and validate the future elastic infrastructure components
Includes work:
Migrate away from Debian packaging for deployment (to pypi packages?)
Build a docker image per deployable service
Build the deployment tooling
Reset and redeploy the stack after commits
Execute acceptance tests
Identify if a deployment can be done by the ci or needs human interaction (mostly detect if a migration is present)
Integration tests
KPIs:
Docker image build triggered by a new version deployed in pypi
Docker image build by the CI
Component versions updated by the CI
Automatically redeployed staging on new release
Staging / whatever environment testing before pushing to production
Design and test next generation CI Automation#
Lead: olasd
Priority: low
Description:
Design and tests solutions in order to improve the actual Continuous Integration tools to match the infrastructure evolutions and provide more features
Includes work:
Actual CI state of the art and requirements specification
Evaluation of a migration from Jenkins to GitLab CI (and effective migration if relevant)
Code audit tools integration (static and/or dynamic analysis)
KPIs:
Gitlab CI used or tested in one or more sysadmin projects
Evaluation matrix (Pros/Cons) for a migration from jenkins to gitlab ci or other tool
Pros/Cons to deploy a code audit tool
Graph export and graph compression in production#
Lead: vlorentz
Priority: high
Description:
Have the graph compression pipeline running in production with less then a month of lag Deployment, hosting and pipeline tooling
Includes work:
Add JVM monitoring
Finish automation scripts
Deploy on a dedicated machine
KPIs:
Graph compression pipeline in production
Last update date / number of updates per year