How to upgrade a cassandra cluster#
Intended audience
sysadm staff members
This page document the actions to upgrade an online cassandra cluster. The overall plan is to upgrade each node of the cluster one at a time, in a rolling upgrade fashion.
There are two ways to manage this upgrade procedure, either manually or automatically.
As our (static) cassandra clusters are managed through puppet. This implies we’ll have some adaptations to do in the swh-site repository. Since our puppet manifest does not manage the restart of the service, it’s ok to let puppet apply the changes in advance.
Then identify the desired new version and retrieve its sha512 hash.
https://archive.apache.org/dist/cassandra/4.0.15/apache-cassandra-4.0.15-bin.tar.gz https://archive.apache.org/dist/cassandra/4.0.15/apache-cassandra-4.0.15-bin.tar.gz.sha512
Read the changelog just in case some extra actions are required for the upgrade.
In the swh-site repository, adapt the environment’s common.yaml file with those values:
$ echo $environment
staging
$ grep "cassandra::" .../swh-site/data/deployments/$environment/common.yaml
cassandra::version: 4.0.15
cassandra::version_checksum: 9368639fe07613995fec2d50de13ba5b4a2d02e3da628daa1a3165aa009e356295d7f7aefde0dedaab385e9752755af8385679dd5f919902454df29114a3fcc0
Commit and push the changes.
Connect to pergamon and deploy those changes.
Stop all repair jobs before upgrading
Grafana tag
Set a Grafana tag to mark the start of the upgrade.
Manual procedure#
Then connect on each machine of the cluster in any order (lexicographic order is fine though).
We’ll need the nodetool access, so here is a simple alias to simplify the commands (used for the remaining part of the doc).
$ USER=$(awk '{print $1}' /etc/cassandra/jmxremote.password)
$ PASS=$(awk '{print $2}' /etc/cassandra/jmxremote.password)
$ alias nodetool="/opt/cassandra/bin/nodetool --username $USER --password $PASS"
From another node in the cluster, connect and check the status of the cluster is fine during the migration.
$ period=10; while true; do \
date; nodetool status -r; echo; nodetool netstats; sleep $period; \
done
Let’s do a drain call first so the commitlog is flushed on disk sstables. It’s recommended to do it before an upgrade to avoid any pending data in the commit log.
$ nodetool drain
Lookup for the ‘- DRAINED’ pattern in the service log to know it’s done.
$ journalctl -e cassandra@instance1 | grep DRAINED
Nov 27 14:09:06 cassandra01 cassandra[769383]: INFO [RMI TCP Connection(20949)-192.168.100.181] 2024-11-27 14:09:06,084 StorageService.java:1635 - DRAINED
We stop the cassandra service.
$ systemctl stop cassandra@instance1
In the output of the nodetool status
, the node whose service is stopped
should be marked as DN (Down and Normal):
$ nodetool -h cassandra02 status -r | grep DN DN cassandra01.internal.softwareheritage.org 8.63 TiB 16 22.7% cb0695ee-b7f1-4b31-ba5e-9ed7a068d993 rack1
Finally we upgrade cassandra version in the node (through puppet):
$ puppet agent --enable && puppet agent --test
Let’s check the correct version is installed in /opt
$ ls -lah /opt/ | grep cassandra-$version
lrwxrwxrwx 1 root root 21 Nov 27 14:13 cassandra -> /opt/cassandra-$version
drwxr-xr-x 8 root root 4.0K Nov 27 14:13 cassandra-$version
Now start back the cassandra service.
$ systemctl start cassandra@instance1
Once the service is started again, the nodetool status
should display an
UN (Up and Normal) status again for the node upgraded.
$ nodetool status -r … UN cassandra01.internal.softwareheritage.org 8.63 TiB 16 22.7% cb0695ee-b7f1-4b31-ba5e-9ed7a068d993 rack1
Automatic procedure#
It’s the same procedure as previously described but only one call to a script in pergamon is required.
With environment in {staging, production}:
root@pergamon:~# /usr/local/bin/cassandra-restart-cluster.sh $environment
Note that you can also use the previously described checks procedure from a cluster node to follow through the upgrade.
Final Checks#
Finally, check the version is the expected one.
$ nodetool version
ReleaseVersion: $version
$ nodetool describecluster
Cluster Information:
Name: archive_staging
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
583470c4-6dae-372d-bdab-f0bcbd679c74: [192.168.130.181, 192.168.130.182, 192.168.130.183]
Stats for all nodes:
Live: 3
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 0
Data Centers:
sesi_rocquencourt_staging #Nodes: 3 #Down: 0
Database versions:
5.0.2: [192.168.130.181:7000, 192.168.130.182:7000, 192.168.130.183:7000]
Keyspaces:
swh -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
system_distributed -> Replication class: NetworkTopologyStrategy {replication_factor=3}
provenance_test -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
reaper_db -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
system_traces -> Replication class: SimpleStrategy {replication_factor=2}
system_auth -> Replication class: NetworkTopologyStrategy {sesi_rocquencourt_staging=3}
system_schema -> Replication class: LocalStrategy {}
system -> Replication class: LocalStrategy {}
Upgrading to a major version
nodetool upgradesstables
.root@pergamon:~# /usr/local/bin/cassandra-upgradesstables.sh $environment