How to process add-forge-now requests#

Intended audience

sysadm staff members

The processing is semi-automatic for the moment. Referencing the steps is a kickstarter for automation.

Introduction#

A forge ticket (see for example the git.afpy.org ticket) should have been opened by a moderator.

Meaning the moderation process is ongoing and the upstream forge (to be ingested) has been notified we will start the ingestion soon.

Testing on staging#

To ensure we can ingest that forge, we start by testing out a subset of that forge listing on staging. It’s a pre-check flight to determine we have the right amount of information.

On a staging node (usually the scheduling node of the domain), run:

swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ \
  add-forge-now --preset staging \
    register-lister gitea \
      url=<url>

For example, forge git.afpy.org which is a gitea instance, we’d run:

swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ \
  add-forge-now --preset staging \
    register-lister gitea \
      url=https://git.afpy.org/api/v1/

INFO:swh.lister.pattern:Max origins per page set, truncated 36 page results down to 30
INFO:swh.lister.pattern:Disabling origins before sending them to the scheduler
INFO:swh.lister.pattern:Reached page limit of 3, terminating

Ensure the lister got registered in the staging scheduler db.

After a bit of time, you can check origins from that forge got listed in the scheduler db:

Still on a staging node, we trigger the first ingestion for those origins:

swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ \
  add-forge-now --preset staging \
  schedule-first-visits \
    --visit-type <visit-type> \
    --visit-type <another-visit-type> \
    --lister-name <lister> \
    --lister-instance-name <lister-instance-name>

For our particular instance:

swh scheduler --url http://scheduler0.internal.staging.swh.network:5008/ \
  add-forge-now --preset staging \
  schedule-first-visits \
    --visit-type git \
    --lister-name gitea \
    --lister-instance-name git.afpy.org

100 slots available in celery queue
15 visits to send to celery

After some time, check those origins got ingested at least in part.

If everything is fine, let’s schedule that forge in production.

Deploying on production#

After testing with success the forge ingestion in staging, it’s time to deploy the full and recurrent listing for that forge.

Let’s start by registering the lister for that forge as usual:

swh scheduler --url http://saatchi.internal.softwareheritage.org:5008/ \
  add-forge-now ( --preset production ) \
  register-lister <lister-name> \
    url=<url>

For example:

swh scheduler --url http://saatchi.internal.softwareheritage.org:5008/ \
  add-forge-now ( --preset production ) \
  register-lister gitea \
    url=https://git.afpy.org/api/v1/

Ensure the lister got registered in the production scheduler db.

After a bit of time, you can check origins from that forge got listed in the scheduler db:

Once the listing is through, we trigger the add-forge-now scheduling to make a first pass on that forge.

swh scheduler --url http://saatchi.internal.softwareheritage.org:5008/ \
  add-forge-now ( --preset production ) \
    schedule-first-visits \
      --visit-type <visit-type> \
      --lister-name <lister-name> \
      --lister-instance-name <lister-instance-name>

For example:

swh scheduler --url http://saatchi.internal.softwareheritage.org:5008/ \
  add-forge-now ( --preset production ) \
    schedule-first-visits \
      --visit-type git \
      --lister-name gitea \
      --lister-instance-name git.afpy.org

10000 slots available in celery queue
37 visits to send to celery

After a while, you can check those origins should have been ingested in part. You can now notify the moderator in the ticket that the first ingestion got done.

Usual checks#

In the following, we will demonstrate the usual checks happening in the scheduler db. The format will be the generic query to execute followed by an actual execution (with a sampled output).

Check the lister is registered#

select * from listers
where name='<lister-name>' and
instance_name='<lister-instance>';

Example:

2022-12-06 11:50:17 swh-scheduler@db1:5432 λ \
    select * from listers
    where name='gitea' and
    instance_name='git.afpy.org';

+--------------------------------------+-------+---------------+-------------------------------+
|                  id                  | name  | instance_name |            created            | ...
+--------------------------------------+-------+---------------+-------------------------------+
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | gitea | git.afpy.org  | 2022-12-06 10:47:46.975571+00 |
+--------------------------------------+-------+---------------+-------------------------------+
(1 row)

Time: 4.109 ms

Check origins got listed#

select lister_id, url, visit_type from listed_origins
where lister_id = (select id from listers
                   where name='<lister-name>'
                   and instance_name='<lister-instance-name>');

Example:

2022-12-06 11:50:24 swh-scheduler@db1:5432 λ \
    select lister_id, url, visit_type from listed_origins
    where lister_id = (select id from listers
                       where name='gitea' and
                       instance_name='git.afpy.org');

+--------------------------------------+-----------------------------------------------------------+------------+
|              lister_id               |                            url                            | visit_type |
+--------------------------------------+-----------------------------------------------------------+------------+
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/AFPy/afpy.org.git                    | git        |
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/foxmask/baeuda.git                   | git        |
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/fcode/boilerplate-python.git         | git        |
...
+--------------------------------------+-----------------------------------------------------------+------------+
(15 rows)

Time: 1225.399 ms (00:01.225)

Check origins got ingested#

Either one of the query is fine:

select visit_type, url, last_visit_status from origin_visit_stats
where visit_type='<visit-type>'
  and url like 'https://<lister-instance-name>%';

Example:

2022-12-12 12:08:58 softwareheritage-scheduler@belvedere:5432 λ \
    select visit_type, url, last_visit_status from origin_visit_stats
    where visit_type='git' and
    url like 'https://git.afpy.org%';

+------------+-----------------------------------------------------------+-------------------+
| visit_type |                            url                            | last_visit_status |
+------------+-----------------------------------------------------------+-------------------+
| git        | https://git.afpy.org/mdk/infra.git                        | successful        |
| git        | https://git.afpy.org/ChristopheNan/python-docs-fr.git     | successful        |
| git        | https://git.afpy.org/fcode/delarte.git                    | successful        |
...
+------------+-----------------------------------------------------------+-------------------+
(37 rows)

Time: 95171.399 ms (01:35.171)

or this one, though this will take longer to execute:

select last_visit_status, count(ovs.url)
from origin_visit_stats ovs
join listed_origins lo USING(url, visit_type)
where lister_id = (select id from listers where name='<lister-name>'
                   and instance_name='<lister-instance-name>')

Example:

2022-12-12 11:56:57 softwareheritage-scheduler@belvedere:5432 λ \
    select last_visit_status, count(ovs.url)
    from origin_visit_stats ovs
    join listed_origins lo USING(url, visit_type)
    where lister_id = (select id from listers
                       where name='gitea' and
                       instance_name='git.afpy.org')
    and visit_type='git'
    group by last_visit_status;

+-------------------+-------+
| last_visit_status | count |
+-------------------+-------+
| successful        |    37 |
+-------------------+-------+
(1 row)

Time: 149774.756 ms (02:29.775)