Cluster rebuilds are rarely elegant when you are doing them live. They are usually a mix of cleanup, registration, runtime repair, and then a second wave of follow-up work once the nodes are finally back.

This note came from one of those moments.

The Immediate Problem

The original note was focused on worker recovery in a development cluster. Part of the friction was around NVIDIA-related package holds and container runtime behavior, which is a good example of how cluster issues are often not only “Kubernetes problems”.

They are frequently host-level package and runtime problems that surface inside Kubernetes later.

Step 1: Remove Package Holds

The first cleanup step was to inspect and then remove a long list of held NVIDIA-related packages:

1
2
3
4
5
sudo apt-mark showhold
sudo apt-mark unhold libnvidia-*
sudo apt-mark unhold nvidia-*
sudo apt-mark unhold screen-resolution-extra
sudo apt-mark unhold xserver-xorg-video-nvidia-515

That alone tells you a lot about the environment. This was not a clean worker rebuild on a plain CPU node. It was a cluster where GPU support and container runtime details mattered.

Step 2: Repair the Docker Runtime

The note also captured a reinstall and runtime change for NVIDIA Docker support:

1
2
3
sudo apt install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker

And the Docker runtime configuration needed to define the nvidia runtime explicitly.

That is the kind of detail I always want in a work log because it explains why a supposedly “healthy” node may still be unusable for actual workloads.

Step 3: Re-Register the Node

The re-registration flow used the Rancher agent pattern:

1
2
3
4
5
6
7
docker run -d --privileged --restart=unless-stopped --net=host \
  -v /etc/kubernetes:/etc/kubernetes \
  -v /var/run:/var/run \
  rancher/rancher-agent:v2.6.7 \
  --server https://cluster.example.com \
  --token <registration-token> \
  --etcd --controlplane --worker

I replaced the live server and token values here, but the structure is the important part.

Step 4: Clean Up the Old Kubernetes State

Before re-registering again, the note used a heavy cleanup pass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
docker stop $(docker ps -aq)
docker system prune -f
docker volume rm $(docker volume ls -q)
docker image rm $(docker image ls -q)
sudo rm -rf /etc/ceph \
  /etc/cni \
  /etc/kubernetes \
  /opt/cni \
  /opt/rke \
  /run/secrets/kubernetes.io \
  /run/calico \
  /run/flannel \
  /var/lib/calico \
  /var/lib/etcd \
  /var/lib/cni \
  /var/lib/kubelet \
  /var/lib/rancher/rke/log \
  /var/log/containers \
  /var/log/pods \
  /var/run/calico

That is not a casual command set. It is a rebuild move, not a tune-up.

Step 5: Recreate the Cluster Services Around It

What I like about this note is that it did not stop at “node rejoined”. It also listed the real cluster rebuild work that followed:

  • integrate Ceph storage
  • set up MetalLB and allocate an IP pool
  • restore or plan GPU operator support
  • re-establish cloudflared access

The MetalLB note in particular included a real IP pool and L2 advertisement definition, which made it clear that recovery here meant rebuilding the surrounding platform, not just the node itself.

Why This Kind of Note Matters

This is a useful example of how cluster recovery is layered:

  • host package state
  • container runtime state
  • Rancher registration state
  • Kubernetes component state
  • service ecosystem around the cluster

If you only fix the Rancher registration line and ignore the rest, the node may reappear in the UI but still be operationally wrong.

Closing Thought

This is one of those posts that reminds me why “recreate cluster” is never one step. It is really a stack of partial recoveries, and the work log is what keeps the sequence from turning into guesswork.