Upgrade swh service#

Intended audience

sysadm staff members

The document describes the deployment for most of our swh services (rpc services, loaders, listers, indexers, …).

There exists currently 2 ways (as we are transitioning from the first to the second):

  • static: From git tag to deployment through debian packaging

  • elastic: From git tag to deployment through kubernetes.

The following will first describe the common deployment part. This involves some python packaging out of a git tag which will be built and push to PyPI and our swh debian repositories.

Then follows the actual deployment with debian packaging. It concludes with the deployment with kubernetes chapter.

Distinct Services#

3 kinds services runs on our nodes:

  • worker services (loaders, listers, cookers, …)

  • rpc services (scheduler, objstorage, storage, web, …)

  • journal client services (search, scheduler, indexer)

Code and publish a release#

It’s usually up to the developers.

Code an evolution or a bugfix in the impacted git repository (usually the master branch). Open a diff for review. Land it when accepted. And then release it following the tag and push part.

Tag and push#

When ready, git tag and git push the new tag of the module. Then let jenkins publish the artifact.

$ git tag -a vA.B.C  # (optionally) `git tag -a -s` to sign the tag too
$ git push origin --follow-tags

Publish artifacts#

Jenkins is in charge of publishing the new release to PyPI (out of the tag just pushed). It then builds the debian package and pushes it to our swh debian repositories.

Troubleshoot#

If jenkins fails for some reason, fix the module be it python code or the debian packaging.

Deployment with debian packaging#

This mostly involves deploying new version of debian packages to static nodes.

Upgrade services#

When a new version is released, we need to upgrade the package(s) and restart services.

worker services (production):

  • swh-worker@loader_{git, hg, svn, npm, …}

  • swh-worker@lister

  • swh-worker@vault_cooker

journal clients (production):

  • swh-indexer-journal-client@{origin_intrinsic_metadata_,extrinsic_metadata_,…}

rpc services (both environment):

  • gunicorn-swh-{scheduler, objstorage, storage, web, …}

From the pergamon node, which is configured for clush, one can act on multiple nodes through the following group names:

  • @swh-workers for the production workers (listers, loaders, …)

  • @azure-workers for the production ones running on azure (indexers, cookers)

See How to deploy a new lister for a practical example.

Debian package troubleshoot#

Update and checkout the debian/unstable-swh branch (in the impacted git repository), then fix whatever is not updated or broken due to a change.

It’s usually a missing new package dependency to fix in debian/control. Add a new entry in debian/changelog. Make sure gbp builds fine locally. Then tag it and push. Jenkins will build the package anew.

$ gbp buildpackage --git-tag-only --git-sign-tag  # tag it
$ git push origin --follow-tags                   # trigger the build

Lather, rinse, repeat until it’s all green!

Deploy#

Nominal case#

Update the machine dependencies and restart service. That usually means as sudo user:

$ apt-get update
$ apt-get dist-upgrade -y
$ systemctl restart $service

Note that this is for one machine you ssh into.

We usually wrap those commands from the sysadmin machine pergamon [3] with the clush command, something like:

$ sudo clush -b -w @swh-workers 'apt-get update; env DEBIAN_FRONTEND=noninteractive \
    apt-get -o Dpkg::Options::="--force-confdef" \
    -o Dpkg::Options::="--force-confold" -y dist-upgrade'

[3] pergamon is already clush configured to allow multiple ssh connections in parallel on our managed infrastructure nodes.

Configuration change required#

Either wait for puppet to actually deploy the changes first and then go back to the nominal case.

Or force a puppet run:

sudo clush -b -w $nodes puppet agent -t

Note: -t is not optional

Long-standing upgrade#

In that case, you may need to stop the impacted services. For example, for long standing data model migration which could take some time.

You need to momentarily stop puppet (which by default runs every 30 min to apply manifest changes) and the cron service (which restarts down services) on the workers nodes.

Report yourself to the storage database migration for a concrete case of database migration.

$ sudo clush -b -w @swh-workers 'systemctl stop cron.service; puppet agent --disable'

Then:

  • Execute the long-standing upgrade.

  • Go back to the nominal case.

  • Restart puppet and the cron services on workers

$ sudo clush -b -w @swh-workers 'systemctl start cron.service; puppet agent --enable'

Deployment with Kubernetes#

This new deployment involves docker images which are exposing script/services which are running in a virtual python frozen environment. Those versioned images are then referenced in a specific helm chart which is deployed in a kubernetes rancher cluster.

That cluster runs on machines nodes (with specific labels) onto which are scheduled pods with containers inside. Those containers are the ones spawning the docker image as applications.

Those docker images are built out of a declared Dockerfile in the swh-apps repository.

You can either:

Add new swh application#

From the repository swh-apps, create a new Dockerfile.

Depending on the services to package, other existing applications can serve as template:

It’s time to build and publish a docker image. It’s a multiple steps process that can be executed locally starting with the frozen set of dependencies requirements to generate.

Update swh application#

If you need to update the swh application, edit its swh-apps/apps/$app/Dockerfile or swh-apps/apps/$app/entrypoint.sh to adapt according to change.

Note: If a new requirement is necessary, update the swh-apps/apps/$app/requirements.txt (source of the generated requirements-frozen.txt). Note that those should be kept to a minimal and it may be that such change should happen upstream in the swh modules instead.

Once your update is done, commit and push the change, then build and publish the new docker image.

Update impacted chart#

In the swh-chart repository, update the values file with the corresponding new changed version.

Check that the nodes are properly labelled to receive the application. Then ArgoCD will be in charge of deploying the changes in a rolling upgrade fashion.

Update app’s frozen requirements#

We’ll first need a “app-manager” container with some dependencies set (due to some limitations in our stack):

$ cd swh-apps/scripts
$ docker build -t app-manager .

Out of this container, we are able to generate the frozen requirements for the $APP_NAME (e.g. loader_{git, svn, cvs, …}, lister, indexer …):

$ cd swh-apps
$ docker run --rm -v $PWD:/src app-manager generate-frozen-requirements $APP_NAME

You have built your frozen requirements that can be committed. Next, we will generate the image updated with that frozen environment.

Generate image#

Build the docker image with the frozen environment and then publish it:

$ IMAGE_NAME=<application>  # e.g. loader_git, loader_svn, ...
$ IMAGE_VERSION=YYYYMMDD.1  # Template of the day, e.g. `$(date '+%Y%m%d')`
$ REGISTRY=container-registry.softwareheritage.org/swh/infra/swh-apps
$ FULL_IMAGE_VERSION=$REGISTRY/$IMAGE_NAME:$IMAGE_VERSION
$ FULL_IMAGE_LATEST=$REGISTRY/$IMAGE_NAME:latest
$ cd swh-apps/apps/<application-name>/
# This will create the versioned image locally
$ docker build -t $FULL_IMAGE .
# Tag with the latest version
$ docker tag $FULL_IMAGE_VERSION $FULL_IMAGE_LATEST

Gitlab registry#

You must have a gitlab account and generate a personal access token with at least write access to the gitlab registry.

Publish image#

You must first login your docker to the swh gitlab registry and then push the image:

$ docker login  # login to the gitlab registry (prompted for personal access token)
passwd: **********
$ docker push $FULL_IMAGE
$ docker push $FULL_IMAGE_LATEST

Do not forget to commit the changes and tag.

Finally, let’s update the impacted chart with the new docker image version.

Commit and tag#

Commit and tag the changes.

Labels on nodes#

For now, we are using dedicated labels on nodes to run specific applications:

  • swh/rpc=true: rpc services, e.g. graphql

  • swh/cooker=true: cooker worker

  • swh/indexer=true: indexer journal client

  • swh/lister=true: lister worker

  • swh/loader=true: loader worker

  • swh/loader-metadata=true: loader-metadata worker

In the following example:

  • cluster in {archive-staging-rke2, archive-production-rke2})

  • $node is an actual node hostname e.g. rancher-node-staging-rke2-worker[1, …] or rancher-node-metal0{1,2} (for production)

  • $new-label is a label of the form: swh/$service=true

To check the actual list of labels

kubectl –context $cluster get nodes –show-labels

To install a label on a node:

kubectl –context $cluster label –overwrite node

$node $new-label