How to Drain Kubernetes Nodes Without Evicting Critical Workloads
Protect critical workloads during Kubernetes node drains using disruption controls.
Horizontal Pod Autoscaler (HPA) tuning usually happens after a service has already caused pain: latency rises, pods scale up, traffic drops, then replicas scale down too quickly and the cycle repeats. The pressure is understandable. You want the cluster to react fast enough to protect users, but not so fast that it keeps changing replica counts because of short metric spikes.
Scaling flaps are usually caused by noisy metrics, missing resource requests, aggressive thresholds, short observation windows, or workloads that take time to become useful after they start. You can avoid most of them by treating HPA as a control loop that needs damping, not as a simple “CPU above X means add pods” rule.
The Horizontal Pod Autoscaler watches one or more metrics, calculates a desired replica count, and updates the target workload, such as a Deployment, StatefulSet, or ReplicaSet. For CPU-based scaling, the calculation is based on CPU usage as a percentage of each container’s CPU request.
That detail matters. If your pods have no CPU requests, CPU utilization scaling will not work correctly. If the CPU request is too low, the HPA may think the pod is overloaded even when the node has plenty of spare CPU. If the request is too high, the HPA may scale too late.
A simplified version of the logic looks like this:
desiredReplicas = ceil(currentReplicas * currentMetricValue / targetMetricValue)
For example, if you run 4 replicas at 80% average CPU utilization and the HPA target is 50%, Kubernetes may calculate:
ceil(4 * 80 / 50) = ceil(6.4) = 7 replicas
That calculation is useful, but it can be too reactive if your metric jumps for only 30 seconds. The goal is to make the HPA respond to sustained demand while ignoring noise.
Do not tune HPA first. Start with the workload spec. HPA depends on meaningful requests, readiness behavior, and startup characteristics.
Here is a basic Deployment foundation for CPU-based autoscaling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: example.com/api:1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /startup
port: 8080
periodSeconds: 5
failureThreshold: 12
Before you add HPA, check these items:
kubectl top pod, Prometheus, or your metrics backend. If a pod normally uses 400m CPU under healthy load, a 50m request will make HPA overreact.If you manage Kubernetes manifests with infrastructure as code, keep the Deployment and HPA changes reviewed together. The same applies if you deploy Kubernetes resources using Terraform, Helm, Kustomize, or a GitOps controller.
The autoscaling/v2 API lets you control scale-up and scale-down behavior. This is where you prevent most flapping.
A safer HPA for a latency-sensitive API might look like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 60
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 25
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
This configuration makes scale-up faster than scale-down, which is usually what you want for user-facing services. Adding pods too slowly can hurt availability. Removing pods too quickly can cause repeated scale-up after every small traffic rebound.
The important fields are:
minReplicas: The baseline capacity you always keep. Set this high enough to handle normal traffic, node disruption, and rolling updates.maxReplicas: The safety limit. Set this based on downstream capacity, database limits, queue pressure, and cluster capacity.averageUtilization: The target CPU utilization relative to CPU requests. A target between 60 and 75 is common for CPU-bound services, but measure your own workload.scaleUp.stabilizationWindowSeconds: A short delay that avoids scaling on tiny spikes. Keep it lower for latency-sensitive paths.scaleDown.stabilizationWindowSeconds: A longer delay that keeps capacity around after demand falls. This is one of the most useful anti-flap settings.policies: Rate limits for replica changes. These stop HPA from jumping from 4 pods to 40 pods in one decision unless you explicitly allow that.selectPolicy: Which policy to use when more than one policy applies. Max allows the largest change. Min allows the smallest change.A practical starting point:
minReplicas of at least 2 for production services, often 3 if you need safer rolling updates and zone spread.scaleDown.stabilizationWindowSeconds to 300 seconds for most web services.maxReplicas below the point where your database, cache, or external API gets overloaded.CPU is a good first metric for CPU-bound workloads. It is a poor signal for workloads blocked on I/O, database calls, locks, external APIs, or queue depth. If you scale on the wrong signal, the HPA will look unstable because it is reacting to symptoms instead of pressure.
Use CPU utilization when higher traffic directly increases CPU usage. Typical examples include request parsing, JSON serialization, image resizing, compression, encryption, or compute-heavy API handlers.
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
Gotcha: CPU utilization is based on CPU requests. If you change requests, you changed the scaling math.
Memory is rarely a good scale-down metric. Many runtimes do not release memory quickly after a traffic spike. Java, Go, Node.js, and Python services may keep heap, caches, or allocator memory after load drops. HPA can then keep replicas high for longer than expected.
Use memory scaling only when memory growth clearly correlates with load and adding replicas reduces per-pod memory pressure.
Use custom metrics when they represent work better than CPU. Examples include:
For queue workers, a better target is often “items waiting per replica” rather than CPU. A worker might be idle while waiting on a slow downstream service, but the queue is still growing. CPU-based HPA would miss that.
When using multiple metrics, Kubernetes calculates desired replicas for each metric and uses the highest value. This is useful for safety, but it can surprise you. One noisy metric can keep replicas high even when CPU is calm.
Apply the HPA and watch it under controlled load before trusting it in production. You need to verify scale direction, timing, and replica limits.
kubectl top pods -n production
kubectl top deployment api -n production
kubectl apply -f hpa.yaml
kubectl get hpa api -n production --watch
kubectl describe hpa api -n production
hey -z 10m -c 100 https://api.example.com/health-or-test-endpoint
kubectl get pods -n production -l app=api --watch
kubectl rollout status deployment/api -n production
During the test, record:
If the HPA scales up repeatedly before pods become ready, your scale-up policy may be too aggressive for the application startup time. If it scales down and immediately scales back up, increase the scale-down stabilization window or reduce the scale-down rate.
For larger platform setups, keep autoscaling tests close to the way workloads are deployed. If your team provisions Kubernetes applications and cloud dependencies together, patterns from deploying AWS resources using Crossplane on Kubernetes can help keep application and infrastructure changes versioned, reviewed, and repeatable.
Symptom: HPA scales up quickly even under moderate traffic.
Fix: Compare actual CPU usage to requests. If a pod normally uses 300m CPU and the request is 100m, a 65% target means the pod is considered overloaded at 65m. Raise the request to match realistic operating needs, then retune the HPA target.
Symptom: replicas drop soon after traffic falls, then scale back up minutes later.
Fix: Increase scaleDown.stabilizationWindowSeconds to 300 or 600 seconds. Limit scale-down to a small percentage per minute.
scaleDown:
stabilizationWindowSeconds: 600
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
Symptom: HPA adds replicas, but latency stays high for several minutes.
Fix: Measure startup time. Add startup probes, fix slow initialization, preload only what is required, and use a scale-up policy that accounts for readiness delay. For slow-starting services, raising minReplicas may be better than relying on fast autoscaling.
Symptom: CPU is low, but users see latency or queue depth grows.
Fix: Use a custom or external metric that tracks work waiting to be processed. For example, queue workers should often scale on queue depth per replica. APIs may scale better on in-flight requests per pod if CPU is not the bottleneck.
Symptom: HPA adds pods, but errors increase because the database, cache, or external dependency gets overloaded.
Fix: Set maxReplicas based on downstream capacity. Add connection pool limits. Use backpressure, request timeouts, and queue limits. Autoscaling application pods cannot fix a saturated database.
Symptom: replica counts, node counts, and scheduling all change at once, making behavior hard to explain.
Fix: Review HPA, Vertical Pod Autoscaler (VPA), Cluster Autoscaler, Karpenter, and any event-driven autoscaler settings together. Avoid enabling VPA to change CPU requests automatically on the same workloads where HPA uses CPU utilization, unless you have deliberately designed for that interaction.
Use this process when adding or fixing HPA for a production workload:
For example, if an API normally runs at 3 replicas and handles peak traffic at 8 replicas, do not start with minReplicas: 1 and maxReplicas: 100. A more controlled first version might use minReplicas: 3, maxReplicas: 15, CPU target 65%, scale-up of 100% per minute, and scale-down of 25% per minute with a 5-minute stabilization window.
If your workload is more complex, such as data orchestration or scheduled workers, be careful about what you autoscale. Some components need fixed leadership, stable scheduling, or queue-aware scaling. That distinction matters in platforms such as Apache Airflow on Amazon Elastic Kubernetes Service, where workers, schedulers, and web components have different scaling profiles.
Stable HPA behavior comes from good inputs and conservative control settings. Set realistic resource requests, choose metrics that match the bottleneck, keep enough baseline replicas, scale up faster than you scale down, and use stabilization windows to ignore short-lived noise.
Your next step should be to inspect one existing HPA with kubectl describe hpa, compare its target metric to the workload’s real bottleneck, and add explicit behavior settings if they are missing. Most scaling flaps are fixable with a better metric, a slower scale-down policy, or a more honest CPU request.