Roadmap 2022#
(Version 1.0, last modified 2022-04-01)
This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2022.
Live tracking of the roadmap implementation progress during the year is available from a dedicated Kanban board.
Collect#
Extend archive coverage (2+2 loaders/listers)#
Lead: ardumont
Tags: coverage
Task: T4079
Effort: variable, depending on the chosen listers/loaders (4PM ?)
Priority: Medium
Deploy at least 2 additional loaders (of currently unsupported VCS/package formats) and 2 additional listers (of currently unsupported hosting platforms), expanding the coverage of the Software Heritage archive. Listers and loaders can be developed in house or contributed by external partners, e.g., via dedicated grants.
KPIs:
Number of new loaders/listers deployed
Number of origins archived/listed
Minimize archival lag w.r.t. upstream code hosting platforms#
Lead: olasd
Tags: performance, coverage
Task: T4080
Effort: 3 PM
Priority: High
Includes work:
Quantify and monitor in real-time the lag, especially for major platforms (GitHub, GitLab.com, etc.)
Improve ingestion efficiency (optimize loaders, especially the Git loader, optimize scheduling policies) - T2207
Make lag monitoring dashboards easy to find (for decision makers)
KPIs:
Number of out of date repos (absolute and per platform)
Total archive lag (e.g., in days)
Add forge now#
Lead: ardumont
Tags: coverage
Task: T1538
Effort: 3 PM
Priority: High
Includes work:
Make it user-driven, simple, and efficient to fully and recurrently archive a new instance of an already supported code hosting platform.
User-facing web form allowing any user to propose the archival of a new forge instance, and moderation web UI to validate archival requests before ingestion. T4047
Admin tooling and UI to deal with received submissions. T4058
Include free-from box suggestion form for forges that are not supported yet (to replace the currently poorly maintained wiki page). Possibly to be integrated with the user support system elsewhere in the roadmap.
KPIs:
Number of forges/instances added
Integrate deposit with InvenioRDM#
Lead: moranegg
Tags: 2021, coverage, deposit
Task: T2344
Effort: 1-2 PM
Priority: Medium
Includes work:
Deploy in production support for receiving source code deposits from InvenioRDM instances, and in particular the Zenodo instance.
Extend CodeMeta vocabulary to qualify author relationships - T2329
Generalize usage of SWHID for referencing SWH archive objects - T3034
Analyze deposit-client on InvenioRDM compatibility - T3549
KPIs:
Complete on paper spec
Number of deposits from an InvenioRDM instance (can be staging instance)
Support deployed in InvenioRDM LTS
Admin tooling for takedown notices#
Lead: douardda
Tags: 2021, legal
Task: T3087
Effort: 3 PM
Priority: High
Includes work:
Admin interface, private and public journal of operations.
Low level support for blacklisting specified contents (not only URLs, also SWHIDs), with support for regexps
Admin interface to add/remove entries from the blacklist
A journal of these operations (what was added/removed, when and why, from the blacklist)
A public webpage that maintains the list of accepted takedown notices
KPIs:
Takedown tools deployed in production
Number of processed takedown notices
Preserve#
Continuous data validation of all the data stores of SWH#
Lead: vlorentz
Tags: integrity, monitoring
Task: T3841
Effort: 2 PM
Priority: Medium
Includes work:
Set up background jobs to regularly check data validity in all SWH data stores.
This includes both blobs (swh-objstorage) and other graph objects (swh-storage) on all the copies (in-house, kafka, azure, upcoming mirrors, etc.).
Estimate ETA for scrubbing of the entire archive.
KPIs:
Scrubbers deployed in production
Monitoring tools deployed in production
% of the archive scrubbed
Support archiving repositories containing SHA1 hash conflicts on blobs#
Lead: olasd
Tags: crypto
Task: T3775
Effort: 1.5 PM
Priority: High
Includes work:
This involves getting rid of the limitations imposed by having SHA1 as a primary key for the object storage internally.
KPIs
Ability to archive git repos that contains sample SHAttered collisions blobs (they are currently detected and refused)
Up-to-date anonymized archive copy on Amazon S3 (except blobs)#
Lead: vlorentz (originally seirl)
Tags: 2021, archivecopy
Task: T3085
Effort: 3 PM
Priority: Low
Includes work:
Periodic dumps of the (anonymized) Merkle graph on the Amazon public cloud.
Fully automate export of the graph dataset
Document how to export the graph edge dataset
Define a scheduling periodicity
KPIs:
Automatic exports scheduled
S3 copy up to date w/ last scheduled export
Archive cold-copy at CINES via Vitam#
Lead: douardda
Tags: 2021, archivecopy
Task: T3414
Effort: 2PM
Priority: Medium
Includes work:
Perform a first complete copy of the archive stored in Vitam @ CINES Maintain the copy up-to-date periodically (on a period TBD)
KPIs:
First copy stored in Vitam
Updates calendar defined
Mirrors#
Lead: douardda
Tags: 2021, mirror
Task: T3116
Effort: 2 PM
Priority: High
Includes work:
Deploy in production at least 2 mirrors.
Finalize ENEA Mirror deployment
Launch Snyk mirror project
handle takedown notice synchronization ?
Add feature flags on web UI
KPIs:
ENEA Mirror in production
Snyk mirror in production
Publicly available standard for SWHID version 1#
Lead: zack
Tags: 2021, standard, swhid
Task: T3960
Effort: 1 PM
Priority: High
Includes work:
Publish a stable version of the SWHID version 1 specification, approved by a standard organization body.
KPIs:
Published standard for SWHID version 1
SWHID version 2#
Lead: zack
Tags: 2021, swhid, crypto
Task: T3134
Effort: 4 PM
Priority: Low
Includes work:
Complete on paper specification for SWHID version 2, including migrating to a stronger hash than SHA1.
Complete on paper spec
Aligned with work done on new git hashes
Migration plan from/cohabitation with v1 (N.B.: we need to maintain SWHID v1 support forever anyway)
Understand impact on internal microservice architecture (related to T1805, in particular use SWHIDs everywhere (core SWHIDs, without qualifiers))
Keep correspondence with v1 (there may be multiple v2 for one v1)
Reviewed by crypto experts
KPIs:
Written SWHID version 2 specification
Documentation#
docs.s.o: provide a landing page, dispatching to devel/user/sysadmin/mirrors#
Lead: bchauvet
Tags: docs, sys-admin
Task: T3867
Effort: 0.5 PM
Priority: Medium
Includes work:
Provide a nice landing page for all documentation at docs.s.o, dispatching by user type.
Drop the redirection docs.s.o -> docs.s.o/devel.
Depends on populating the /sysadm, /user and /mirrors parts.
KPIs:
Landing page in production (https://docs.softwareheritage.org)
docs.s.o/sysadm: improve sysadmin documentation website#
Lead: vsellier
Tags: docs, sys-admin
Task: T4082
Effort: 1 PM
Priority: Medium
Includes work:
General goal: onboarding material + transparency about how we run the archive.
Target user: team member, partners (e.g.mirror operators), or contributor who needs a clear view of the infrastructure architecture.
This task will be completed when it:
Documents the configuration system of each component.
Documents hardware architecture.
Documents CI architecture (and other major services currently not documented).
KPIs:
List of minimum documented items
Number of available documented items
docs.s.o/user: bootstrap user documentation website#
Lead: moranegg
Tags: docs, user
Task: T3972
Effort: 2 PM
Priority: Medium
Includes work:
The currently available user documentation only provides a FAQ. It should contain at least:
An overall non-technical description of the archive and the core elements of its architecture
A set of howto/getting started pages on main subjects (search, browse, push code in the archive, retrieve code and artifacts from the archive, metadata)
Link to existing documentation on the main w.s.o. site as appropriate.
KPIs:
List of minimum documented items
Number of available documented items
High-level overview of available listers/loaders#
Lead: anlambert
Tags: 2021, docs, sys-admin
Task: T3117
Effort: 0.5 PM
Priority: High
Includes work:
Publish a web page (under docs.s.o somewhere) providing a high-level overview of which listers/loaders are available (implemented, deployed, running, etc.) with pointers to the corresponding modules/implementations.
KPIs:
Online web page
Technical Debt#
Refactor swh-web code#
Lead: anlambert
Tags: webapp, refactoring
Task: T3949
Effort: 3 PM
Priority: Medium
Includes work:
Have a smaller, more modular code base
Split the public API code from the frontend code base
Reduce code duplication (eg. between API and frontend)
Externalize conversion utilities towards swh-core
KPIs:
Separate repositories for frontend and web API
New public API (GraphQL + thin layer)#
Lead: jayesh
Tags: api, refactoring
Task: T4083
Effort: 4 PM
Priority: Medium
Includes work:
Provide a common unified (GraphQL based) public API
Create a GraphQL based API
Integrate actual API on graphQL
KPIs:
GraphQL API in production
Organize 4+ short peer programming code-audit sprints#
Lead: bchauvet
Tags: refactoring
Task: T3956
Effort: 2.5 PM
Priority: n/a (one 2-day sprint every 2 months)
Includes work:
Go through the entire codebase and identify changes that should be done and dead code
Correct identified issues or, failing that, document them with dedicated tasks
Identify one theme per sprint
KPIs:
Sprints done
Organize 4+ sentry-cleaning sprints#
Lead: bchauvet
Tags: project-management, monitoring
Task: T3957
Effort: 2.5 PM
Priority: n/a (one 2-day sprint every 2 months)
Includes work:
We currently have a lot of open Sentry issues, but this is very raw data that isn’t very usable or visible. They should be cleaned up so that under normal conditions, the number of reported issues stays “minimal”.
KPIs:
Sprints done
Number of sentry issues (before/after)
Tooling and Infrastructure#
GitLab migration#
Lead: olasd
Tags: 2021
Task: T2225
Effort: 3 PM
Priority: Medium
Includes work:
Review the current workflow for the migration
Prepare new team workflows for some “sample” projects
Drive the migration to completion
Sysadmin projects migration (iteration #1)
Remaining projects migration (iteration #2)
KPIs:
Number of migrated projects
Phabricator switched to read-only
Polish developer-facing CI automation#
Lead: olasd
Tags: development environment, CI
Task: T4084
Effort: 3 PM
Priority: Low
Includes work:
More automation to keep all linting / testing tools (black, flake8, tox, …) up to date and consistent
CI support for multiple python versions (and possibly some dependency versions)
Faster CI for diffs (e.g., consider use of testmon to only run tests affected by changes)
Investigation of more linters or flake8 plugins
Cypress performance (parallel testing)
KPIs:
To be defined
Continuous Deployment#
Lead: vsellier
Task: T2231
Tags: CI, CD, packaging
Effort: 6 PM
Priority: Low
Includes work:
Improve bug detection Validate the future elastic infrastructure components
Migrate away from Debian packaging for deployment
Build a docker image per deployable service
Build the deployment tooling
Reset and redeploy the stack after commits
Execute acceptance tests
KPIs:
Operational CD platform
CD integrated to gitlab
Continuous Integration for sysadmin tools#
Lead: vsellier
Tags: sysadmin, CI, tooling
Task: T3834
Effort: 2 PM
Priority: Low
Includes work:
Add CI for sysadmin tasks:
Puppet configuration
Vagrant projects
Terraform plans
Container (docker) image production
Create sustainable plan for hardware provisioning/rotation#
Lead: olasd
Tags: sysadmin, hardware
Task: T3959
Effort: 0.5 PM
Priority: High
Write a policy for hardware procurement with the following in mind:
Make sure that we properly track our current pool of hardware, and its warranty status
Make sure we don’t get surprised by lapsing warranties
Make sure that we don’t end up having to renew a bunch of machines all at once
Allow better budget previsions
KPIs:
Shared documented policy
Elastic loaders and listers#
Lead: ardumont
Tags: sysadmin, performance, elasticity
Task: T3592
Effort: 3 PM
Priority: High
Includes work:
Deploy the listers and loaders in containers
Deploy on a couple of bare metal servers (?)
Easily adapt the load to the resources and the waiting tasks
KPIs:
Running elastic infrastructure in production for loaders and listers
Cluster / elastic workers monitoring (number of running workers, statsd, …)
Cassandra in production as primary storage#
Lead: vsellier
Tags: 2021, storage, sysadmin
Task: T2214
Effort: 3 PM
Priority: High
Includes work:
Have the Cassandra storage in production as primary storage
Set up equivalent MVP in staging
KPIs:
Cassandra primary storage in production
Scale-out objstorage in production as primary objstorage#
Lead: olasd
Tags: 2021, objstorage, sysadmin
Task: T3054
Effort: 2 PM
Priority: High
Includes work:
Have the Ceph-based objstorage in production as primary storage
Set up equivalent MVP in staging (maybe use the same Ceph cluster for this)
KPIs:
Ceph-based obj-storage in production
Provenance in production#
Lead: douardda
Tags: 2021, provenance
Task: T3112
Effort: 3 PM
Priority: High
Includes work:
Have the provenance index in production with less then a month of lag Set up equivalent MVP in staging
Produce documentation
Finalize revisions layer processing
Investigate/solve revisions performance issues
Process origins layer
Flatten directories
Production setup (deployment / scripts)
Implement a querying API
KPIs:
Revisions processed per second
% of archive covered
Published documentation
Graph compression in production#
Lead: vlorentz (originally seirl)
Tags: 2021, graph compression
Task:T2220
Effort: 2 PM
Priority: High
Includes work:
Have the graph compression pipeline running in production with less then a month of lag
Deployment, hosting and pipeline tooling
Handle the situation for staging
KPIs:
Graph compression pipeline in production
Last update date / number of updates per year
Mirror tooling in production#
Lead: douardda
Tags: 2021, mirror
Task: T4085
Effort: 2 PM
Priority: High
Includes work:
Document the setup, the administration and the maintenance of a mirror (sprint + maintenance)
Handle the situation for staging
Organize the mirror operators community
KPIs:
Mirror on staging
Organized community
User support ticket system and process#
Lead: bchauvet
Tags: support, user
Task: T3730
Effort: 1 PM
Priority: Medium
Includes work:
Create a user-facing ticket system to support user requests and bug reports (e.g., a support@ address that automatically create support tasks that we can triage and follow)
Define the process to:
Ensure some basic quality of service (e.g., time to first answer)
Pending tasks are not forgotten.
KPIs:
User support feature available on web UI
Reliable user-level monitoring of services#
Lead: vsellier
Tags: 2021, support, user
Task: T3129
Effort: 1 PM
Priority: High
Includes work:
High-level view of which services are running or not, and integration with status.softwareheritage.org
KPIs:
Services dashboard in production