Building an HPC Kubernetes Cluster with MAAS and DeepOps

This was not a “click a managed cluster into existence” job. It was a bare-metal build where MAAS handled the hardware lifecycle and DeepOps handled the Kubernetes deployment.

That combination is powerful, but it also means there are more moving parts worth writing down.

1. Use MAAS for Discovery and Deployment

The first half of the job belonged to MAAS:

discover or commission the physical servers
correct any BMC or networking details that MAAS could not infer cleanly
deploy the operating system
verify that the nodes are reachable by SSH

That is not glamorous work, but it matters. If the node naming, BMC details, or network assumptions are wrong here, the Kubernetes layer inherits the confusion.

2. Prepare the DeepOps Inventory

Once the nodes were deployed, the next step was to map them into a DeepOps inventory.

That means:

define all hosts under [all]
choose the control-plane nodes
keep an odd number of etcd members
decide which nodes also run workloads

The real note contained environment-specific hostnames and addresses, so I am generalizing the shape here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
[all]
k8s-node01 ansible_host=192.0.2.11 ansible_ssh_user=ubuntu
k8s-node02 ansible_host=192.0.2.12 ansible_ssh_user=ubuntu
k8s-node03 ansible_host=192.0.2.13 ansible_ssh_user=ubuntu

[kube-master]
k8s-node01
k8s-node02
k8s-node03

[etcd]
k8s-node01
k8s-node02
k8s-node03

[kube-node]
k8s-node01
k8s-node02
k8s-node03

3. Tune the Cluster Variables for the Environment

The note also captured the kind of cluster-specific adjustments that generic tutorials usually skip:

extra addresses in the Kubernetes API certificate
Calico backend and encapsulation mode
resource tuning for Calico
workload labels for scheduling decisions

That is the part that makes the cluster fit the hardware and network you actually have instead of the one the defaults assume.

4. Remove the Things You Do Not Want Fighting You

One of the operational choices in the note was to remove unattended-upgrades before the build:

1
2
ansible all -m shell -a "apt -y remove unattended-upgrades"
ansible all -m shell -a "apt -y purge unattended-upgrades"

That is not a universal recommendation, but it does reflect a common ops instinct on tightly managed clusters: avoid surprise package activity while you are still stabilizing the platform.

5. Run the Cluster Build

The main deployment step was:

1
ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml

At this point the work stops being theory. Either the inventory, network, certificates, and base configuration were prepared well enough, or the playbook tells you where the assumptions were wrong.

6. Add Monitoring Early

The original note followed the cluster deployment with monitoring almost immediately. I think that is the right instinct.

A new cluster is much easier to trust when you can see:

Prometheus targets
node health
service exposure
alert flow

If you postpone that work too long, the first real problem becomes harder to diagnose.

Closing Thought

The useful thing about MAAS plus DeepOps is not that it removes the work. It is that it gives the work a repeatable structure.

That structure is what makes a bare-metal cluster build survivable the second time, not just the first.

1. Use MAAS for Discovery and Deployment#

2. Prepare the DeepOps Inventory#

3. Tune the Cluster Variables for the Environment#

4. Remove the Things You Do Not Want Fighting You#

5. Run the Cluster Build#

6. Add Monitoring Early#

Closing Thought#