Capacity Planning for Kubernetes Clusters: Beyond Request Limits
Resource requests and limits are the first thing every Kubernetes tutorial teaches. They are also where most teams stop thinking about capacity. In production, the gap between "it schedules" and "it runs reliably" is where outages hide.
The Bin-Packing Problem
Kubernetes schedules pods based on resource requests, not actual usage. This creates two failure modes:
Over-requesting: Teams set requests high "just in case." Nodes appear full while actual CPU utilization sits at 15%. You pay for 6x more infrastructure than you need. This is the most common failure mode and the easiest to fix with right-sizing tools like Goldilocks or VPA in recommendation mode.
Under-requesting: Teams set requests low to avoid scheduling failures. Pods land on the same node and compete for resources. CPU throttling increases latency. Memory pressure triggers OOM kills. The cluster looks healthy in the scheduler's view while applications degrade.
The right approach is to set requests based on measured p99 usage and limits based on the maximum burst you expect. Use VPA recommendations as a starting point, then adjust based on load testing data.
Scheduling Headroom
A cluster at 85% allocation is not 85% full — it is effectively 100% full for any pod that needs more than the largest contiguous free block. Fragmentation is the silent killer of Kubernetes capacity.
We maintain a 20% scheduling headroom target. This means the cluster should have at least 20% of total allocatable resources unallocated at any time. Below this threshold, we trigger scale-up. This sounds wasteful until you compare it to the cost of an incident caused by a pod that cannot schedule during a traffic spike.
# Prometheus query for scheduling headroom
1 - (
sum(kube_pod_container_resource_requests{resource="cpu"})
/
sum(kube_node_status_allocatable{resource="cpu"})
)
Node Failure Domains and N+2 Planning
If you run 10 nodes and can tolerate losing one, you need 11. If you run across 3 availability zones and need to survive an AZ failure, you need to provision so that any two zones can handle the full workload. This is N+2 planning for a 3-AZ setup.
The math is straightforward but often ignored. For a workload requiring 30 CPU cores distributed across 3 AZs, each AZ has 10 cores. To survive one AZ failure, the remaining two AZs need 15 cores each. Total required: 45 cores, not 30. That is a 50% overhead for AZ resilience.
Pod Topology Spread Constraints enforce even distribution:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
Burst Capacity vs Sustained Load
Cluster autoscaler adds nodes when pods are pending. But node provisioning takes 2-5 minutes (longer with custom AMIs). If your traffic can spike 3x in 30 seconds, autoscaler alone is not enough.
We use a combination of approaches:
- Overprovisioning pods: Low-priority pause pods that hold capacity. When real pods need resources, the pause pods are evicted and real pods schedule immediately. The autoscaler then replaces the evicted capacity.
- Karpenter provisioners: Faster than cluster autoscaler because they bypass node groups and provision directly via the cloud API. Typical provisioning time: 60-90 seconds.
- HPA with custom metrics: Scale horizontally before CPU saturation. We use queue depth and request latency as scaling signals, which lead load by 30-60 seconds compared to CPU utilization.
Monitoring Utilization vs Allocation
Track both, but understand the difference. Allocation is what the scheduler sees — the sum of resource requests. Utilization is what the nodes actually experience — measured CPU and memory consumption.
The gap between them is your optimization opportunity. A healthy cluster has:
- CPU allocation: 60-80% of allocatable
- CPU utilization: 30-50% of allocatable (sustained), up to 70% during peaks
- Memory allocation: 70-85% of allocatable
- Memory utilization: 60-75% of allocatable
If allocation and utilization are both high, you are running hot and need more capacity. If allocation is high but utilization is low, you are over-requesting and wasting money. If utilization is high but allocation is low, you are under-requesting and risking stability.
Cost Optimization Without Sacrificing Reliability
Spot instances can reduce compute costs by 60-80%, but only for workloads that tolerate interruption. We run batch jobs, development environments, and stateless workers on spot. Stateful workloads, databases, and anything with SLOs stay on on-demand instances.
The key insight: cost optimization and reliability are not opposing forces. Right-sizing reduces cost and improves performance (less noisy neighbor effects). Proper headroom costs money but prevents expensive incidents. The cheapest infrastructure is the kind that does not wake you up at 3 AM.