Upgrade Procedure for Debian Nodes in a Cassandra Cluster#

Intended audience

sysadm staff members

Purpose#

This page documents the steps to upgrade Debian nodes running in a Cassandra cluster. The upgrade process involves various commands and checks before and after rebooting the node.

Prerequisites#

Familiarity with SSH and CLI-based command execution
Out-of-band Access to the node (IDRAC/ILO) for reboot
Access to the node through SSH (requires the vpn)

Step 0: Initial Steps#

Ensure the out of band access to the machine is ok. This definitely helps when something goes wrong during a reboot (disk order or names change, network, …).

Step 1: Migrate to the next debian suite#

Update the Debian version of the node (e.g. bullseye to bookworm) using the following command:

root@node:~# /usr/local/bin/migrate-to-${NEXT_CODENAME}.sh

Note: The script should be present on the machine (installed through puppet).

Step 2: Run Puppet Agent#

Once the upgrade procedure happened, run the puppet agent to apply any necessary configuration changes (e.g. /etc/apt/sources.list change, etc…)

root@node:~# puppet agent -t

Step 3: Stop Puppet Agent#

As we will stop the service, we don’t want the agent to start it back again.

root@node:~# puppet agent --disable "Ongoing debian upgrade"

Step 4: Autoremove and Purge#

Perform autoremove to remove unnecessary packages left-over from the migration:

root@node:~# apt autoremove

Step 5: Stop the cassandra service#

The cluster can support one non-responding node so it’s ok to stop the service.

$ nodetool drain

Lookup for the ‘- DRAINED’ pattern in the service log to know it’s done.

$ journalctl -e cassandra@instance1 | grep DRAINED
Nov 27 14:09:06 cassandra01 cassandra[769383]: INFO  [RMI TCP Connection(20949)-192.168.100.181] 2024-11-27 14:09:06,084 StorageService.java:1635 - DRAINED

Then stop the cassandra service.

$ systemctl stop cassandra@instance1

In the output of the nodetool status, the node whose service is stopped should be marked as DN (Down and Normal):

$ nodetool -h cassandra02 status -r | grep DN DN cassandra01.internal.softwareheritage.org 8.63 TiB 16 22.7% cb0695ee-b7f1-4b31-ba5e-9ed7a068d993 rack1

Step 6: Reboot the Node#

We are finally ready to reboot the node, so just do it:

root@node:~# reboot

You can connect to the serial console of the machine to follow through the reboot.

Step 7: Clean up some more#

Once the machine is restarted, some cleanup might be necessary.

root@node:~# apt autopurge

Step 8: Activate puppet agent#

Activate back the puppet agent and make it run. This will start back the cassandra service again.

root@node:~# puppet agent --enable && puppet agent --test

Post cluster migration#

Once all the nodes of the cluster have been migrated:

Remove the argocd sync window so the cluster is back to nominal state.
Enable back the Rancher etcd snapshots.
Check the holderIdentity value in rke2 and rke2-lease leases and configmaps.