How to reset the next version environment#
Intended audience
sysadm members who wants to reset said environment
As mentioned, the next-version environment is a sandboxed & scratchable environment dedicated to run the swh stack with the next version of swh modules.
It’s currently configured to be a stateful environment. But we may need to reset it once in a while to avoid consuming too much resources.
We will need a local-checkout of the swh-charts repository.
1. First, we close the ‘swh-next-version’ merge request <https://gitlab.softwareheritage.org/swh/infra/ci-cd/swh-charts/-/merge_requests> if any. That’s what triggers the creation of the next-version applications. The various depending services will be stopped (to avoid those to crash without their stopped backends and raise alerts).
Now, on with the actual reset.
2. Edit cluster-components/values/archive-staging-rke2.yaml <https://gitlab.softwareheritage.org/swh/infra/ci-cd/swh-charts/-/blob/staging/cluster-components/values/archive-staging-rke2.yaml?ref_type=heads> and set enabled: false to the various specific next-version backend instance configuration entries: rabbitmq, kafka, cassandra, elasticsearch, cloudnativePg, redis, … Commit and push the changes.
$ diff --git c/cluster-components/values/archive-staging-rke2.yaml w/cluster-components/values/archive-staging-rke2.yaml
index bd0b5ea0..5deaa8ca 100644
--- c/cluster-components/values/archive-staging-rke2.yaml
+++ w/cluster-components/values/archive-staging-rke2.yaml
@@ -61,7 +61,7 @@ svix:
enabled: false
dsn: db-swh-svix-rw.cnpg
svix-next-version:
- enabled: true
+ enabled: false
ingress:
host: svix-next-version.internal.staging.swh.network
createTLS: true
@@ -263,7 +263,7 @@ cloudnativePg:
##################
next-version-db-swh:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -278,7 +278,7 @@ cloudnativePg:
enabled: true
next-version-db-swh-deposit:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -293,7 +293,7 @@ cloudnativePg:
enabled: true
next-version-db-swh-indexer:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -308,7 +308,7 @@ cloudnativePg:
enabled: true
next-version-db-swh-masking:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
externalClustersRef: externalPostgresqlClusters
@@ -327,7 +327,7 @@ cloudnativePg:
# Specific swh-scheduler database mounted from a backup from swh-scheduler-k8s
# (minimalistic swh-scheduler db)
next-version-db-swh-scheduler:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -345,7 +345,7 @@ cloudnativePg:
source: next-version-db-swh-scheduler
next-version-db-swh-svix:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -360,7 +360,7 @@ cloudnativePg:
enabled: true
next-version-db-swh-vault:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -375,7 +375,7 @@ cloudnativePg:
enabled: true
next-version-db-swh-web:
- enabled: true
+ enabled: false
namespace: cnpg-next-version
testing: true
instances: 1
@@ -718,7 +718,7 @@ cloudnativePg:
schedule: "0 15 0 * * *"
rabbitmq:
- enabled: true
+ enabled: false
namespace: swh-cassandra-next-version
replicas: 1
deployments:
@@ -727,7 +727,7 @@ rabbitmq:
namespace: swh-cassandra-next-version
cassandra:
- enabled: true
+ enabled: false
storageConfig:
cassandraDataVolumeClaimSpec:
storageClassName: local-persistent
@@ -752,7 +752,7 @@ cassandra:
enabled: false
kafka:
- enabled: true
+ enabled: false
namespace: swh-cassandra-next-version
replicas: 1
replicationFactor: 1
@@ -850,7 +850,7 @@ elasticsearch:
storageClassName: local-persistent
deployments:
search-next-version:
- enabled: true
+ enabled: false
namespace: swh-cassandra-next-version
metricsEnabled: true
replicas: 3
@@ -875,7 +875,7 @@ redis:
enabled: true
deployments:
counters:
- enabled: true
+ enabled: false
namespace: swh-cassandra-next-version
storage:
volumeClaimTemplate:
@@ -919,7 +919,7 @@ redis:
appendonly yes
appendfsync everysec
svix-next-version:
- enabled: true
+ enabled: false
namespace: swh-cassandra-next-version
priorityClassName: swh-storages
serviceMonitor:
3. In the argocd application ‘archive-staging-rke2-cluster-components’ , hit the ‘Refresh’ button and then the ‘Sync’ button. Check the “prune” checkbox so it can effectively remove the various deactivated instance backends. When removing those, this will also remove their persistent volume (pv) configured (with a retention policy ‘Delete’, the few ones with ‘Retain’ will stay).
4. Wait for the backends deactivation to be effective (you can check for the pods to be stopped and their associated pvc to be cleaned up).
5. Revert the commit from step 2. and push. The goal being to start back the stopped backends (now that they are empty).
6. Go back in argocd and sync (or wait for the sync to happen). That will start back the backends.
7. The backends will be started from scratch. Note that some postgresql backends can depend on dump (e.g. scheduler).
Wait for the backends to be running.
9. Reopen swh-next-version branch and wait for the services to be up again. When starting, the various services will initialize their empty backends appropriately (e.g. rpc, …) so they can actually run properly.
10. (optional) Since some pvs have a retention ‘Retain’ policy, they will remain after the environment reset. If you want to clean them up, you can use ‘k9s’ in the ‘pv’ view (hit ‘:’ then enter ‘pv’), filtering on ‘Released’ pvs (hit ‘/’ then enter ‘Released’) to remove the ‘swh-cassandra-next-version/’ claimed pvs.