How to bulk ingest a list of origins#

Intended audience

sysadm staff members

The scheduler provides a cli to send a list of origins directly in a rabbitmq queue. If a loading stack is configured to listen to this queue, these origins will be loaded by the loaders the classical way.

Warning

Only a one-shot loading will be performed, a recurring task is not created.

The automated way#

swh-charts includes a script to automate the bulk ingestion of a list of repository based on a file downloaded from the internet (usually, a paste on our GitLab instance).

The actions performed are exactly the same as in the manually_bulk_ingest section but embedded in a kubernetes job.

The bulk ingest job is based on the toolbox configuration, to avoid duplicating the scheduler configuration. The job config is added as a new subsection of the main config file.

Declare a job#

In the proper environment, edit the helm values file, locate the toolbox: section and add the new bulkLoad job:

toolbox:
  enabled: true
  configs:
    ...
    scheduler:
      schedulerDbConfigurationRef: postgresqlSchedulerConfiguration
      celeryConfigurationRef: producerCeleryConfiguration
    ...
  bulkLoad:
    schedulerConfigurationRef: scheduler
    jobs:
      jobName:
        originListUrl: https://gitlab.softwareheritage.org/...
        taskType: load-git
        maxTasks: 10000
        queuePrefix: oneshot

The toolbox.configs section must already exist.

schedulerConfigurationRef is referencing the scheduler configuration declaration in the toolbox.configs part.

forgeName is an informational name to identify the job.

The job will be named by concatenating several info: toolbox-bulk-load–<queuePrefix>-<jobName>

Once the job is completed, the configuration can be removed from the value file. Helm will automatically cleaned the job when applied by ArgoCD.

In case of an error during the scheduling, the job will be flagged as failing by kubernetes and is raised by the monitoring system. The resources (pods, etc.) of a failing job are not automatically removed to allow easier diagnostics. An operator must manually remove the job to cleanup the resources.

An example of issue for a bulk ingestion: MBed forge ingestion

How to do it manually#

The following example explains how to launch an ingestion from a raw list of origins.

The toolbox deployed in kubernetes contains all the configuration pre-installed to simplify the interaction with the scheduler. The example is based on this. You must have the kubectl command installed on your local environment and the configuration to access the staging and production clusters.

  • Deploy the loader stack with a queue configuration <prefix>:<usual queue names>

  • If not provided, sort the origins per loader type (git/svn/hg/cvs/…)

  • Prepare your local environment, the next commands are for staging, adapt according to your needs

CONTEXT=archive-staging-rke2
NAMESPACE=swh-cassandra
TOOLBOX=toolbox-oneshot-loading
ORIGINS=git_origins.lst
  • Create a dedicated toolbox pod:

kubectl debug --context $CONTEXT -n $NAMESPACE \
$(kubectl --context $CONTEXT -n $NAMESPACE get pods -l app=swh-toolbox -o name | head -1) \
   --container=swh-toolbox --copy-to=$TOOLBOX -- sleep infinity

Cloning the pod allows to not stop the loading if an deployment happens before the end of the loading

  • Copy the file containing the list of origins in the pod

kubectl --context $CONTEXT cp $ORIGINS $NAMESPACE/$TOOLBOX:$ORIGINS -c swh-toolbox
  • Connect to the pod via kubectl or k9s

kubectl --context $CONTEXT exec --namespace $NAMESPACE -ti $TOOLBOX -c swh-toolbox -- bash
  • Populate the celery queue

export SWH_CONFIG_FILENAME=/etc/swh/config-scheduler.yml
ORIGINS=git_origins.lst
TASK_TYPE=load-git
MAX_TASKS=10000

nohup bash -c "cat $ORIGINS | swh scheduler -C $SWH_CONFIG_FILENAME origin \
  send-origins-from-file-to-celery $TASK_TYPE --threshold=$MAX_TASKS \
  --queue-name-prefix oneshot " | tee -a $ORIGINS.output &

The process is detached from the terminal so you can exit the pod without stopping the process. It will run until the end unless the pod is restarted by a maintenance or crash.

  • Check the output in the log file and the queue in rabbitmq

  • When the loading is done, remove the temporary toolbox pod

kubectl --context $CONTEXT --namespace $NAMESPACE delete pods $TOOLBOX