How to upgrade a cassandra cluster#

Intended audience

sysadm staff members

This page document the actions to upgrade an online cassandra cluster. The overall plan is to upgrade each node of the cluster one at a time, in a rolling upgrade fashion.

There are two ways to manage this upgrade procedure, either manually or automatically.

As our (static) cassandra clusters are managed through puppet. This implies we’ll have some adaptations to do in the swh-site repository. Since our puppet manifest does not manage the restart of the service, it’s ok to let puppet apply the changes in advance.

Then identify the desired new version and retrieve its sha512 hash.

https://archive.apache.org/dist/cassandra/4.0.15/apache-cassandra-4.0.15-bin.tar.gz https://archive.apache.org/dist/cassandra/4.0.15/apache-cassandra-4.0.15-bin.tar.gz.sha512

Read the changelog just in case some extra actions are required for the upgrade.

In the swh-site repository, adapt the environment’s common.yaml file with those values:

$ echo $environment
staging
$ grep "cassandra::" .../swh-site/data/deployments/$environment/common.yaml
cassandra::version: 4.0.15
cassandra::version_checksum: 9368639fe07613995fec2d50de13ba5b4a2d02e3da628daa1a3165aa009e356295d7f7aefde0dedaab385e9752755af8385679dd5f919902454df29114a3fcc0

Commit and push the changes.

Connect to pergamon and deploy those changes.

Stop all repair jobs before upgrading

All scheduled jobs must be paused and all running jobs must be stopped and aborted.
You can perform these actions from the web UI reaper.

Grafana tag

Set a Grafana tag to mark the start of the upgrade.

Manual procedure#

Then connect on each machine of the cluster in any order (lexicographic order is fine though).

We’ll need the nodetool access, so here is a simple alias to simplify the commands (used for the remaining part of the doc).

$ USER=$(awk '{print $1}' /etc/cassandra/jmxremote.password)
$ PASS=$(awk '{print $2}' /etc/cassandra/jmxremote.password)
$ alias nodetool="/opt/cassandra/bin/nodetool --username $USER --password $PASS"

From another node in the cluster, connect and check the status of the cluster is fine during the migration.

$ period=10; while true; do \
    date; nodetool status -r; echo; nodetool netstats; sleep $period; \
  done

Let’s do a drain call first so the commitlog is flushed on disk sstables. It’s recommended to do it before an upgrade to avoid any pending data in the commit log.

$ nodetool drain

Lookup for the ‘- DRAINED’ pattern in the service log to know it’s done.

$ journalctl -e cassandra@instance1 | grep DRAINED
Nov 27 14:09:06 cassandra01 cassandra[769383]: INFO  [RMI TCP Connection(20949)-192.168.100.181] 2024-11-27 14:09:06,084 StorageService.java:1635 - DRAINED

We stop the cassandra service.

$ systemctl stop cassandra@instance1

In the output of the nodetool status, the node whose service is stopped should be marked as DN (Down and Normal):

$ nodetool -h cassandra02 status -r | grep DN DN cassandra01.internal.softwareheritage.org 8.63 TiB 16 22.7% cb0695ee-b7f1-4b31-ba5e-9ed7a068d993 rack1

Finally we upgrade cassandra version in the node (through puppet):

$ puppet agent --enable && puppet agent --test

Let’s check the correct version is installed in /opt

$ ls -lah /opt/ | grep cassandra-$version
lrwxrwxrwx  1 root root   21 Nov 27 14:13 cassandra -> /opt/cassandra-$version
drwxr-xr-x  8 root root 4.0K Nov 27 14:13 cassandra-$version

Now start back the cassandra service.

$ systemctl start cassandra@instance1

Once the service is started again, the nodetool status should display an UN (Up and Normal) status again for the node upgraded.

$ nodetool status -r … UN cassandra01.internal.softwareheritage.org 8.63 TiB 16 22.7% cb0695ee-b7f1-4b31-ba5e-9ed7a068d993 rack1

Automatic procedure#

It’s the same procedure as previously described but only one call to a script in pergamon is required.

With environment in {staging, production}:

root@pergamon:~# /usr/local/bin/cassandra-restart-cluster.sh $environment

Note that you can also use the previously described checks procedure from a cluster node to follow through the upgrade.

Final Checks#

Finally, check the version is the expected one.

$ nodetool version
ReleaseVersion: $version

$ nodetool describecluster
Cluster Information:
        Name: archive_staging
        Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
        DynamicEndPointSnitch: enabled
        Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
        Schema versions:
                583470c4-6dae-372d-bdab-f0bcbd679c74: [192.168.130.181, 192.168.130.182, 192.168.130.183]

Stats for all nodes:
        Live: 3
        Joining: 0
        Moving: 0
        Leaving: 0
        Unreachable: 0

Data Centers:
        sesi_rocquencourt_staging #Nodes: 3 #Down: 0

Database versions:
        5.0.2: [192.168.130.181:7000, 192.168.130.182:7000, 192.168.130.183:7000]

Keyspaces:
        swh -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
        system_distributed -> Replication class: NetworkTopologyStrategy {replication_factor=3}
        provenance_test -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
        reaper_db -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
        system_traces -> Replication class: SimpleStrategy {replication_factor=2}
        system_auth -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
        system_schema -> Replication class: LocalStrategy {}
        system -> Replication class: LocalStrategy {}

Upgrading to a major version

When updating to a major version, you need to run nodetool upgradesstables.
You can perform this command manually on each node or use a script from pergamon.
With environment in {staging, production}:
root@pergamon:~# /usr/local/bin/cassandra-upgradesstables.sh $environment