Cluster rebuilds are rarely elegant when you are doing them live. They are usually a mix of cleanup, registration, runtime repair, and then a second wave of follow-up work once the nodes are finally back.
This note came from one of those moments.
The Immediate Problem
The original note was focused on worker recovery in a development cluster. Part of the friction was around NVIDIA-related package holds and container runtime behavior, which is a good example of how cluster issues are often not only “Kubernetes problems”.
They are frequently host-level package and runtime problems that surface inside Kubernetes later.
Step 1: Remove Package Holds
The first cleanup step was to inspect and then remove a long list of held NVIDIA-related packages:
| |
That alone tells you a lot about the environment. This was not a clean worker rebuild on a plain CPU node. It was a cluster where GPU support and container runtime details mattered.
Step 2: Repair the Docker Runtime
The note also captured a reinstall and runtime change for NVIDIA Docker support:
| |
And the Docker runtime configuration needed to define the nvidia runtime explicitly.
That is the kind of detail I always want in a work log because it explains why a supposedly “healthy” node may still be unusable for actual workloads.
Step 3: Re-Register the Node
The re-registration flow used the Rancher agent pattern:
| |
I replaced the live server and token values here, but the structure is the important part.
Step 4: Clean Up the Old Kubernetes State
Before re-registering again, the note used a heavy cleanup pass:
| |
That is not a casual command set. It is a rebuild move, not a tune-up.
Step 5: Recreate the Cluster Services Around It
What I like about this note is that it did not stop at “node rejoined”. It also listed the real cluster rebuild work that followed:
- integrate Ceph storage
- set up MetalLB and allocate an IP pool
- restore or plan GPU operator support
- re-establish cloudflared access
The MetalLB note in particular included a real IP pool and L2 advertisement definition, which made it clear that recovery here meant rebuilding the surrounding platform, not just the node itself.
Why This Kind of Note Matters
This is a useful example of how cluster recovery is layered:
- host package state
- container runtime state
- Rancher registration state
- Kubernetes component state
- service ecosystem around the cluster
If you only fix the Rancher registration line and ignore the rest, the node may reappear in the UI but still be operationally wrong.
Closing Thought
This is one of those posts that reminds me why “recreate cluster” is never one step. It is really a stack of partial recoveries, and the work log is what keeps the sequence from turning into guesswork.