Upgrade Procedure for Debian Nodes in an Elasticsearch Cluster#
Intended audience
sysadm staff members
Purpose#
This page documents the steps to upgrade Debian nodes running in an Elasticsearch cluster. The upgrade process involves various commands and checks before and after rebooting the node.
Prerequisites#
Familiarity with SSH and CLI-based command execution
Out-of-band Access to the node (IDRAC/ILO) for reboot
Access to the node through SSH (requires the vpn)
Step 0: Initial Steps#
The elasticsearch nodes are only running on bare metal machines. So, ensuring the out of band access to the machine is ok. This definitely helps when something goes wrong during a reboot (disk order or names change, network, …).
Step 1: Migrate to the next debian suite#
Update the Debian version of the node (e.g. bullseye to bookworm) using the following command:
root@node:~# /usr/local/bin/migrate-to-${NEXT_CODENAME}.sh
Note: The script should be present on the machine (installed through puppet).
Step 2: Run Puppet Agent#
Once the upgrade procedure happened, run the puppet agent to apply any necessary configuration changes (e.g. /etc/apt/sources.list change, etc…)
root@node:~# puppet agent -t
Step 3: Stop Puppet Agent#
As we will stop the service, we don’t want the agent to start it back again.
root@node:~# puppet agent --disable "Ongoing debian upgrade"
Step 4: Autoremove and Purge#
Perform autoremove to remove unnecessary packages left-over from the migration.
root@node:~# apt autoremove
Step 5: Stop the elasticsearch service#
The cluster can support one non-responding node so it’s ok to stop the service.
We can check the cluster’s status which should stay green after the elasticsearch service is stopped.
root@node:~# systemctl stop elasticsearch
root@node:~# curl -s $server/_cluster/health | jq .status
"green"
Note: $server
if of the form hostname:9200
(with hostname another cluster node
than the one we are currently upgrading)
Step 6: Reboot the node#
We are ready to reboot the node:
root@node:~# reboot
You can connect to the serial console of the machine to follow through the reboot.
Step 7: Clean up some more#
Once the machine is restarted, some cleanup might be necessary.
root@node:~# apt autopurge
Step 8: Activate puppet agent#
Activate back the puppet agent and make it run. This will start back the elasticsearch service again.
root@node:~# puppet agent --enable && puppet agent --test
Step 8: Join back the cluster#
After the service restarted, check the node joined back the cluster.
root@node:~# curl -s $server/_cat/allocation?v\&s=node;
root@node:~# curl -s $server/_cluster/health | jq .number_of_nodes
For example:
root@esnode1:~# server=http://esnode1.internal.softwareheritage.org:9200; date; \
curl -s $server/_cat/allocation?v\&s=node; echo; \
curl -s $server/_cluster/health | jq
Wed Jan 29 09:57:01 UTC 2025
shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host ip node node.role
638 0 0.0 5.6tb 5.6tb 5.6tb 1.1tb 6.8tb 82 192.168.100.61 192.168.100.61 esnode1 cdfhilmrstw
634 0 0.0 5.7tb 5.7tb 5.7tb 1tb 6.8tb 84 192.168.100.62 192.168.100.62 esnode2 cdfhilmrstw
639 0 0.0 5.6tb 5.6tb 5.6tb 1.1tb 6.8tb 82 192.168.100.63 192.168.100.63 esnode3 cdfhilmrstw
644 0 0.0 5.6tb 5.6tb 5.6tb 8.2tb 13.8tb 40 192.168.100.64 192.168.100.64 esnode7 cdfhilmrstw
645 0 0.0 5.5tb 5.5tb 5.5tb 5.9tb 11.4tb 48 192.168.100.65 192.168.100.65 esnode8 cdfhilmrstw
666 0 0.0 5.1tb 5.1tb 5.1tb 6.3tb 11.4tb 44 192.168.100.66 192.168.100.66 esnode9 cdfhilmrstw
{
"cluster_name": "swh-logging-prod",
"status": "green",
"timed_out": false,
"number_of_nodes": 6,
"number_of_data_nodes": 6,
"active_primary_shards": 1933,
"active_shards": 3866,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}
Post cluster migration#
As the cluster should stay green all along the migration, there is nothing more to check (we just did that after each node).