Roadmap 2021

(Version 1.0, last modified 05/04/2021)

This document provides an overview of the technical roadmap of Software Heritage for 2021.

The Kanban board is seen through our forge.

Collect

Faster and more reliable save code now

  • tags: openscience

  • task: T3082

  • lead: ardumont

  • effort: 1 PM

Includes work:

  • set up dedicated fast track pipeline for save code now

  • improve save code now monitoring (user and admin)

Improve deposit integration, management and display

  • tags: openscience

  • task: T3128

  • lead: moranegg

  • effort: 3 PM

Includes work:

  • full invenioRDM integration T2344

  • metadata only deposit T2540

Save forge now

  • tags: expand

  • task: T1538

  • lead: ardumont

  • effort: 1 PM - tooling & process

Admin tooling for takedown notices (URLs)

  • tags: contract, compliance

  • task: T3087

  • lead: anlambert

  • effort: 2 PM

Includes work:

  • admin interface

  • journal of operations

  • web page with list of accepted TDN

Preserve

Complete and up-to-date archive copy on S3

  • tags: stability

  • task: T3085

  • lead: douardda

  • effort: 1 PM

Includes work:

  • live update of the objects

  • regular dumps of the (anonymized) Merkle graph

Scale-out graph storage in production

  • tags: scalability

  • task: T2214

  • lead: vlorentz

  • effort: 3 PM

Includes work:

  • Cassandra: T1892 (maybe with external help)

Scale-out object storage prototype

  • tags: stability, scalability, externalized

  • task: T3054

  • lead: dachary

  • effort: 3 PM

Cold storage archive in Vitam instance at CINES

  • tags: contract

  • task: T3113

  • lead: douardda

  • effort: 4 PM

Mirrors

  • tags: stability, scalability

  • depends: scale-out object storage

  • task: T3116

  • lead: douardda

  • effort: 3 PM

Includes work:

  • get up and running at least one mirror

SWHID v2

  • tags: stability, evolution, datamodel

  • task: T3134

  • lead: zack

  • effort: 6 PM

Includes work:

  • complete on paper spec

  • align with new git hashes

  • including migration plan from v1

  • understand impact on internal microservice architecture

  • keep correspondence with v1 (there may be multiple v2 for one v1!)

  • reviewed by crypto experts

Integrity

  • tags: stability, reliability

  • task: T3135

  • lead: olasd

  • effort: 2 PM

Includes work:

  • making sure objects aren’t corrupted before insertion T399

  • … and that existing ones are not part of T75

  • make corruption check periodically

Share

swh-graph in production

  • tags: scalability

  • task: T2220

  • lead: zack

  • effort: 2 PM

Efficient and reliable Vault download

  • tags: stability

  • task: T3096

  • lead: vlorentz

  • effort: 3 PM

Includes work:

  • swh-graph may speed up a lot operations

Web API 2.0

  • tags: reliability, interoperability

  • task: T2194

  • lead: anlambert

  • effort: 4 PM

Includes work:

  • OpenAPI specification - implementation

Expose metadata and make them searchable

  • tags: openscience

  • task: T3097

  • lead: vlorentz

  • effort: 3 PM

Includes work:

  • index extrinsic metadata in swh-search/Elasticsearch from the journal T2073

  • create API endpoint to access raw_extrinsic_metadata T2938

  • show metadata in the web UI T2088

Full text search prototype

  • tags: feature, wishlist

  • task: T2204

  • lead: anlambert

  • effort: 3 PM

Includes work:

  • requires integration with swh-graph and/or provenance index

Organize

Collect extrinsic metadata

  • tags: compliance

  • task: T2202

  • lead: vlorentz

  • effort: 3 PM

Includesd work: - working pipeline - at least 1 instance running ClearlyDefined - forge metadata (info on the main page, etc.)

Provenance in production

  • tags: contract, feature

  • task: T3112

  • lead: zack

  • effort: 6 PM

Prior art

  • tags: compliance

  • depends: provenance | swh-graph in production

  • task: T3136

  • lead: zack

  • effort: 3 PM

Includes work:

  • pinpoint origin of selected source code artifacts

  • possibly integrated with swh-scanner

Measurement

Efficient archive counters (HyperLogLog)

  • tags: measure, comm

  • task: T2912

  • lead: vsellier

  • effort: 1 PM

Distribution of origins by forge

  • tags: measure, comm

  • task: T3127

  • lead: anlambert

  • effort: 1 PM

Stats on regular crawling by forge

  • tags: measure, comm

  • task: T1363

  • lead: olasd

  • effort: 1 PM

Includes work:

  • lag, periodicity, # of changes since last visit, etc.

View deposits per user (admin and user)

  • tags: measure, support

  • task: T3128

  • lead: ardumont

  • effort: 1 PM

Reliable user-level monitoring of services

  • tags: stability

  • task: T3129

  • lead: vsellier

  • effort: 2 PM

Includes work:

  • status.softwareheritage.org

Documentation

Write use case-specific documentation

  • tags: comm, web, doc

  • task: T2234

  • lead: moranegg

  • effort: 2 PM

Includes FAQ for: - users - ambassadors

Improve quality of code documentation

  • tags: doc, externalized

  • task: TODO

  • lead: TBD

  • effort: 2PM

Includes work:

  • doc(string) audit - team training about doc writing

Documentation strategy

  • tags: doc

  • task: T2624

  • lead: moranegg

  • effort: 1 PM

Includes work:

  • respective role of docs.s.o, wiki, www.s.o, etc.

Community

Tooling for fundraising campaigns

  • tags: web

  • task: T3077

  • lead: anlambert

  • effort: 1 PM

Dedicated page to list status of supported listers/loaders

  • tags: web, doc

  • task: T3117

  • lead: anlambert

  • effort: 1 PM

Includes work:

  • T1870

  • design web page

  • process to maintain up to date

  • make clearly visible and link to Sloan subgrants

Tooling

Migration to GitLab

  • tags: forge, development

  • task: T2225

  • lead: olasd

  • effort: 1PM