How to process add-forge-now requests#
Intended audience
sysadm staff members
The processing is automatic but may encounter errors. In this case, the operations must be performed manually (see, Add-forge-now automation).
Introduction#
A forge ticket (see for example the forge.inrae.fr ticket) should have been created.
Meaning the moderation process is ongoing and the upstream forge (to be ingested) has been notified we will start the ingestion soon.
Note that there exists roughly 2 kinds of forges, either the technology used by the forge exists is mono-instance (e.g. github, bitbucket, …), either the technology is the same across multiple forges (e.g. gitlab, cgit, gitea, gogs).
All processing operations are performed from the Kubernetes toolbox pod.
kubectl --context archive-staging-rke2 exec -ti -n swh-cassandra -c swh-toolbox deployment/swh-toolbox -- bash
Use the same command with archive-production-rke2 context to access production toolbox.
SWH_CONFIG_FILENAME variable
Once connected to the toolbox, you must export SWH_CONFIG_FILENAME with the scheduler configuration.
The output of the toolbox mentions the right export command.
ᐅ kubectl --context archive-staging-rke2 exec -ti -n swh-cassandra -c swh-toolbox deployment/swh-toolbox -- bash
SWH_CONFIG_FILENAME variable is not set!
This variable must be defined according to your use case (e.g. .
scheduler, storage, vault, ...). You must define it by yourself.
For example, use one of the following:
export SWH_CONFIG_FILENAME=/etc/swh/config-web.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-masking.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-scrubber-objstorage-storage1.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-webhooks.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-cassandra-storage-rw.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-indexer-storage.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-vault.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-scrubber-objstorage-db1.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-scheduler.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-deposit.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-scrubber-storage.yml
export SWH_CONFIG_FILENAME=/etc/swh/config-blocking.yml
swh@swh-toolbox-57d6b657d-tqn4m:~$ export SWH_CONFIG_FILENAME=/etc/swh/config-scheduler.yml
Testing on staging#
To ensure we can ingest that forge, we start by testing out a subset of that forge listing on staging. It’s a pre-check flight to determine we have the right amount of information.
Registering the new lister task#
swh scheduler \
add-forge-now --preset staging \
register-lister <lister-type> \
instance=<instance>
For example, forge forge.inrae.fr which is a gitlab instance, we’d run:
swh@swh-toolbox-57d6b657d-tqn4m:~$ swh scheduler \
add-forge-now --preset staging \
register-lister gitlab \
instance=forge.inrae.fr
WARNING:swh.core.sentry:Sentry DSN not provided, events will not be sent.
Created 1 tasks
Task 33438839
Next run: today (2025-07-23T15:44:45.811986+00:00)
Interval: 90 days, 0:00:00
Type: list-gitlab-full
Policy: oneshot
Args:
Keyword args:
enable_origins: False
instance: 'forge.inrae.fr'
max_origins_per_page: 5
max_pages: 2
instance is the parameter used in the pipeline.
It may be necessary to use url parameter instead of the instance one:
for forges which support only http protocol;
swh@swh-toolbox-798fd68874-zx4wp:~$ swh scheduler \ add-forge-now --preset production \ register-lister gitea \ url=http://vcc-gnd.cn/api/v1/for forges reachable by a subpath.
swh@swh-toolbox-76f4dcdb79-ncrvt:~$ swh scheduler \ add-forge-now --preset staging \ register-lister gitlab \ url=https://microfluidics.utoronto.ca/gitlab/api/v4/
Use base_git_url to specify the origins url:
swh@swh-toolbox-648b4bd4dd-tjh4c:~$ swh scheduler \
add-forge-now --preset staging \
register-lister cgit \
instance=git.koszko.org \
base_git_url=https://git.koszko.org
Or use url and base_git_url:
swh@swh-toolbox-76b76c5565-spw77:~$ swh scheduler \
add-forge-now --preset staging \
register-lister gitweb \
url=http://git.1wt.eu/web \
base_git_url=http://git.1wt.eu/git
Ensure the lister got registered in the staging scheduler db.
Checking the listed origins#
swh scheduler origin check-listed-origins <lister-type> <instance-name> -l
For our example, forge.inrae.fr:
swh@swh-toolbox-57d6b657d-tqn4m:~$ swh scheduler origin check-listed-origins gitlab forge.inrae.fr -l
url last_seen last_update
------------------------------------------------------------ -------------------------------- --------------------------------
https://forge.inrae.fr/QTL/spell-qtl.git 2025-07-23 15:45:48.892705+00:00 2020-02-27 20:56:28.539000+00:00
https://forge.inrae.fr/adminforgemia/doc-public.git 2025-07-23 15:45:48.892705+00:00 2024-09-09 12:53:34.058000+00:00
https://forge.inrae.fr/bioger/django-custom-user.git 2025-07-23 15:45:49.655780+00:00 2023-11-08 14:53:09.962000+00:00
https://forge.inrae.fr/gauthier.quesnel/red-slides.git 2025-07-23 15:45:49.655780+00:00 2019-07-03 06:53:00.720000+00:00
https://forge.inrae.fr/genotoul-bioinfo/d-genies/dgenies.git 2025-07-23 15:45:48.892705+00:00 2025-02-06 14:49:33.746000+00:00
https://forge.inrae.fr/genotoul-bioinfo/jflow.git 2025-07-23 15:45:48.892705+00:00 2020-02-14 16:08:06.932000+00:00
https://forge.inrae.fr/katharina-birgit.budde/testgit.git 2025-07-23 15:45:49.655780+00:00 2019-07-05 09:21:53.092000+00:00
https://forge.inrae.fr/olivier.bonnefon/selommes.git 2025-07-23 15:45:49.655780+00:00 2019-07-25 12:48:39.151000+00:00
https://forge.inrae.fr/svdetection/popsim.git 2025-07-23 15:45:48.892705+00:00 2020-02-28 07:17:22.123000+00:00
https://forge.inrae.fr/umr-gdec/magatt.git 2025-07-23 15:45:49.655780+00:00 2025-07-18 12:15:54.773000+00:00
Forge forge.inrae.fr (gitlab) has 10 listed origins in the scheduler database.
Scheduling the first visit#
After the previous lister registration, we now need to trigger the first ingestion for those origins:
swh scheduler \
add-forge-now --preset staging \
schedule-first-visits \
--type-name <visit-type> \
--lister-name <lister> \
--lister-instance-name <lister-instance-name>
For our example, forge.inrae.fr:
swh scheduler \
add-forge-now --preset staging \
schedule-first-visits \
--type-name git \
--lister-name gitlab \
--lister-instance-name forge.inrae.fr
WARNING:swh.core.sentry:Sentry DSN not provided, events will not be sent.
INFO:swh.scheduler.celery_backend.utils:1000 slots available in celery queue add_forge_now:swh.loader.git.tasks.UpdateGitRepository
INFO:swh.scheduler.celery_backend.utils:10 visits of type git to send to celery
AFN loaders logs
Get the add-forge-now loaders logs:
kubectl --context archive-staging-rke2 logs -n swh-cassandra -l app=loader-add-forge-now -f
stern --context archive-staging-rke2 -n swh-cassandra -l app=loader-add-forge-now --only-log-lines
Use the same commands with archive-production-rke2 context for production environment.
Checking the ingested origins#
swh scheduler origin check-ingested-origins <lister-type> <instance-name>
For our example, forge.inrae.fr:
swh@swh-toolbox-57d6b657d-tqn4m:~$ swh scheduler origin check-ingested-origins gitlab forge.inrae.fr
Forge forge.inrae.fr (gitlab) has 10 scheduled ingests in the scheduler.
failed : 0
None : 0
not_found : 1
successful : 9
total : 10
success rate: 90.00%
After some time, check those origins were ingested at least partially.
If everything is fine, update the add-forge-now request status to Scheduled
with a comment containing a link to the GitLab Issue. Then, let’s schedule that forge in production.
Deploying on production#
After testing with success the forge ingestion in staging, it’s time to deploy the full and recurrent listing for that forge.
Production environment
Use the same commands as for staging, replacing the value of the --preset option with production.
After some time, you can check those origins have been ingested.
If everything is fine, update the add-forge-now request status to First origin loaded
with a comment containing a link to the GitLab Issue.
Usual checks#
In the following, we will demonstrate the usual checks happening in the scheduler db. The format will be the generic query to execute followed by an actual execution (with a sampled output).
Check the lister is registered#
select * from listers
where name='<lister-name>' and
instance_name='<lister-instance>';
Example:
2022-12-06 11:50:17 swh-scheduler@db1:5432 λ \
select * from listers
where name='gitea' and
instance_name='git.afpy.org';
+--------------------------------------+-------+---------------+-------------------------------+
| id | name | instance_name | created | ...
+--------------------------------------+-------+---------------+-------------------------------+
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | gitea | git.afpy.org | 2022-12-06 10:47:46.975571+00 |
+--------------------------------------+-------+---------------+-------------------------------+
(1 row)
Time: 4.109 ms
Check origins got listed#
select lister_id, url, visit_type from listed_origins
where lister_id = (select id from listers
where name='<lister-name>'
and instance_name='<lister-instance-name>');
Example:
2022-12-06 11:50:24 swh-scheduler@db1:5432 λ \
select lister_id, url, visit_type from listed_origins
where lister_id = (select id from listers
where name='gitea' and
instance_name='git.afpy.org');
+--------------------------------------+-----------------------------------------------------------+------------+
| lister_id | url | visit_type |
+--------------------------------------+-----------------------------------------------------------+------------+
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/AFPy/afpy.org.git | git |
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/foxmask/baeuda.git | git |
| d07d1c90-5016-4ab6-91ac-3300f8eb4fc6 | https://git.afpy.org/fcode/boilerplate-python.git | git |
...
+--------------------------------------+-----------------------------------------------------------+------------+
(15 rows)
Time: 1225.399 ms (00:01.225)
Check origins got ingested#
Either one of the query is fine:
select visit_type, url, last_visit_status from origin_visit_stats
where visit_type='<visit-type>'
and url like 'https://<lister-instance-name>%';
Example:
2022-12-12 12:08:58 softwareheritage-scheduler@belvedere:5432 λ \
select visit_type, url, last_visit_status from origin_visit_stats
where visit_type='git' and
url like 'https://git.afpy.org%';
+------------+-----------------------------------------------------------+-------------------+
| visit_type | url | last_visit_status |
+------------+-----------------------------------------------------------+-------------------+
| git | https://git.afpy.org/mdk/infra.git | successful |
| git | https://git.afpy.org/ChristopheNan/python-docs-fr.git | successful |
| git | https://git.afpy.org/fcode/delarte.git | successful |
...
+------------+-----------------------------------------------------------+-------------------+
(37 rows)
Time: 95171.399 ms (01:35.171)
or this one, though this will take longer to execute:
select last_visit_status, count(ovs.url)
from origin_visit_stats ovs
join listed_origins lo USING(url, visit_type)
where lister_id = (select id from listers where name='<lister-name>'
and instance_name='<lister-instance-name>')
Example:
2022-12-12 11:56:57 softwareheritage-scheduler@belvedere:5432 λ \
select last_visit_status, count(ovs.url)
from origin_visit_stats ovs
join listed_origins lo USING(url, visit_type)
where lister_id = (select id from listers
where name='gitea' and
instance_name='git.afpy.org')
and visit_type='git'
group by last_visit_status;
+-------------------+-------+
| last_visit_status | count |
+-------------------+-------+
| successful | 37 |
+-------------------+-------+
(1 row)
Time: 149774.756 ms (02:29.775)
Check duplicated tasks#
select id, arguments, status from task
where arguments -> 'kwargs' ->> 'instance' like '%<domain_name>%'
or arguments -> 'kwargs' ->> 'url' like '%<domain_name>%'
and policy = 'recurring';
Example:
softwareheritage-scheduler=> select id, arguments, status from task
where arguments -> 'kwargs' ->> 'instance' like '%codeberg.org%'
or arguments -> 'kwargs' ->> 'url' like '%codeberg.org%'
and policy = 'recurring';
id | arguments | status
-----------+-----------------------------------------------------------------+------------------------
415431745 | {"args": [], "kwargs": {"instance": "codeberg.org"}} | next_run_not_scheduled
337306005 | {"args": [], "kwargs": {"url": "https://codeberg.org/api/v1/"}} | next_run_not_scheduled
(2 rows)