How to Configure Kubernetes HPA Without Causing Scaling Flaps

How to Configure Kubernetes HPA Without Causing Scaling Flaps

Tune HPA thresholds and stabilization windows to prevent unstable Kubernetes scaling.

Arthur Azrieli
Book Icon - Software Webflow Template
 min read

Horizontal Pod Autoscaler (HPA) tuning usually happens after a service has already caused pain: latency rises, pods scale up, traffic drops, then replicas scale down too quickly and the cycle repeats. The pressure is understandable. You want the cluster to react fast enough to protect users, but not so fast that it keeps changing replica counts because of short metric spikes.

Scaling flaps are usually caused by noisy metrics, missing resource requests, aggressive thresholds, short observation windows, or workloads that take time to become useful after they start. You can avoid most of them by treating HPA as a control loop that needs damping, not as a simple “CPU above X means add pods” rule.

Understand what the HPA is actually doing

The Horizontal Pod Autoscaler watches one or more metrics, calculates a desired replica count, and updates the target workload, such as a Deployment, StatefulSet, or ReplicaSet. For CPU-based scaling, the calculation is based on CPU usage as a percentage of each container’s CPU request.

That detail matters. If your pods have no CPU requests, CPU utilization scaling will not work correctly. If the CPU request is too low, the HPA may think the pod is overloaded even when the node has plenty of spare CPU. If the request is too high, the HPA may scale too late.

A simplified version of the logic looks like this:

desiredReplicas = ceil(currentReplicas * currentMetricValue / targetMetricValue)

For example, if you run 4 replicas at 80% average CPU utilization and the HPA target is 50%, Kubernetes may calculate:

ceil(4 * 80 / 50) = ceil(6.4) = 7 replicas

That calculation is useful, but it can be too reactive if your metric jumps for only 30 seconds. The goal is to make the HPA respond to sustained demand while ignoring noise.

Start with a sane Deployment before tuning the HPA

Do not tune HPA first. Start with the workload spec. HPA depends on meaningful requests, readiness behavior, and startup characteristics.

Here is a basic Deployment foundation for CPU-based autoscaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example.com/api:1.2.3
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /startup
              port: 8080
            periodSeconds: 5
            failureThreshold: 12

Before you add HPA, check these items:

  • CPU requests are realistic. Measure normal and peak usage with kubectl top pod, Prometheus, or your metrics backend. If a pod normally uses 400m CPU under healthy load, a 50m request will make HPA overreact.
  • Readiness means the pod can serve traffic. A pod that reports ready before warming caches or opening connections can make scale-up look successful while users still see latency.
  • Startup spikes are accounted for. Some applications use high CPU during boot. A startup probe helps avoid routing traffic too early, but the HPA may still see startup CPU depending on timing and metrics availability.
  • Limits are not causing CPU throttling. A throttled pod can show high CPU demand and poor latency, which may push HPA to add replicas without fixing the real bottleneck.

If you manage Kubernetes manifests with infrastructure as code, keep the Deployment and HPA changes reviewed together. The same applies if you deploy Kubernetes resources using Terraform, Helm, Kustomize, or a GitOps controller.

Use HPA behavior settings to dampen scaling

The autoscaling/v2 API lets you control scale-up and scale-down behavior. This is where you prevent most flapping.

A safer HPA for a latency-sensitive API might look like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      selectPolicy: Max
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      selectPolicy: Min
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60
        - type: Pods
          value: 2
          periodSeconds: 60

This configuration makes scale-up faster than scale-down, which is usually what you want for user-facing services. Adding pods too slowly can hurt availability. Removing pods too quickly can cause repeated scale-up after every small traffic rebound.

The important fields are:

  • minReplicas: The baseline capacity you always keep. Set this high enough to handle normal traffic, node disruption, and rolling updates.
  • maxReplicas: The safety limit. Set this based on downstream capacity, database limits, queue pressure, and cluster capacity.
  • averageUtilization: The target CPU utilization relative to CPU requests. A target between 60 and 75 is common for CPU-bound services, but measure your own workload.
  • scaleUp.stabilizationWindowSeconds: A short delay that avoids scaling on tiny spikes. Keep it lower for latency-sensitive paths.
  • scaleDown.stabilizationWindowSeconds: A longer delay that keeps capacity around after demand falls. This is one of the most useful anti-flap settings.
  • policies: Rate limits for replica changes. These stop HPA from jumping from 4 pods to 40 pods in one decision unless you explicitly allow that.
  • selectPolicy: Which policy to use when more than one policy applies. Max allows the largest change. Min allows the smallest change.

A practical starting point:

  • Use minReplicas of at least 2 for production services, often 3 if you need safer rolling updates and zone spread.
  • Set scaleDown.stabilizationWindowSeconds to 300 seconds for most web services.
  • Allow scale-up to add 50% to 100% more pods per minute, depending on startup time and downstream capacity.
  • Limit scale-down to 10% to 25% per minute when traffic is bursty.
  • Keep maxReplicas below the point where your database, cache, or external API gets overloaded.

Pick metrics that match the bottleneck

CPU is a good first metric for CPU-bound workloads. It is a poor signal for workloads blocked on I/O, database calls, locks, external APIs, or queue depth. If you scale on the wrong signal, the HPA will look unstable because it is reacting to symptoms instead of pressure.

CPU utilization

Use CPU utilization when higher traffic directly increases CPU usage. Typical examples include request parsing, JSON serialization, image resizing, compression, encryption, or compute-heavy API handlers.

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65

Gotcha: CPU utilization is based on CPU requests. If you change requests, you changed the scaling math.

Memory utilization

Memory is rarely a good scale-down metric. Many runtimes do not release memory quickly after a traffic spike. Java, Go, Node.js, and Python services may keep heap, caches, or allocator memory after load drops. HPA can then keep replicas high for longer than expected.

Use memory scaling only when memory growth clearly correlates with load and adding replicas reduces per-pod memory pressure.

External or custom metrics

Use custom metrics when they represent work better than CPU. Examples include:

  • Queue depth per consumer.
  • Requests per second per pod.
  • In-flight requests per pod.
  • Pending jobs per worker.
  • Kafka consumer lag per partition group.

For queue workers, a better target is often “items waiting per replica” rather than CPU. A worker might be idle while waiting on a slow downstream service, but the queue is still growing. CPU-based HPA would miss that.

When using multiple metrics, Kubernetes calculates desired replicas for each metric and uses the highest value. This is useful for safety, but it can surprise you. One noisy metric can keep replicas high even when CPU is calm.

Validate the HPA in a controlled rollout

Apply the HPA and watch it under controlled load before trusting it in production. You need to verify scale direction, timing, and replica limits.

  1. Confirm metrics are available.
    kubectl top pods -n production
    kubectl top deployment api -n production
  2. Apply the HPA.
    kubectl apply -f hpa.yaml
  3. Watch HPA decisions.
    kubectl get hpa api -n production --watch
  4. Inspect events and calculated metrics.
    kubectl describe hpa api -n production
  5. Generate load from outside the pod network when possible.
    hey -z 10m -c 100 https://api.example.com/health-or-test-endpoint
  6. Watch pods and rollout state.
    kubectl get pods -n production -l app=api --watch
    kubectl rollout status deployment/api -n production

During the test, record:

  • How long it takes before the first scale-up.
  • How many replicas HPA adds per minute.
  • Whether new pods become ready before more scale-up decisions happen.
  • Whether latency improves after new pods become ready.
  • How long replicas stay elevated after load stops.
  • Whether the workload scales down in small steps or drops too sharply.

If the HPA scales up repeatedly before pods become ready, your scale-up policy may be too aggressive for the application startup time. If it scales down and immediately scales back up, increase the scale-down stabilization window or reduce the scale-down rate.

For larger platform setups, keep autoscaling tests close to the way workloads are deployed. If your team provisions Kubernetes applications and cloud dependencies together, patterns from deploying AWS resources using Crossplane on Kubernetes can help keep application and infrastructure changes versioned, reviewed, and repeatable.

Common causes of HPA flapping and how to fix them

CPU requests are too low

Symptom: HPA scales up quickly even under moderate traffic.

Fix: Compare actual CPU usage to requests. If a pod normally uses 300m CPU and the request is 100m, a 65% target means the pod is considered overloaded at 65m. Raise the request to match realistic operating needs, then retune the HPA target.

Scale-down is too aggressive

Symptom: replicas drop soon after traffic falls, then scale back up minutes later.

Fix: Increase scaleDown.stabilizationWindowSeconds to 300 or 600 seconds. Limit scale-down to a small percentage per minute.

scaleDown:
  stabilizationWindowSeconds: 600
  selectPolicy: Min
  policies:
    - type: Percent
      value: 10
      periodSeconds: 60

Pods take too long to become useful

Symptom: HPA adds replicas, but latency stays high for several minutes.

Fix: Measure startup time. Add startup probes, fix slow initialization, preload only what is required, and use a scale-up policy that accounts for readiness delay. For slow-starting services, raising minReplicas may be better than relying on fast autoscaling.

The metric does not represent real pressure

Symptom: CPU is low, but users see latency or queue depth grows.

Fix: Use a custom or external metric that tracks work waiting to be processed. For example, queue workers should often scale on queue depth per replica. APIs may scale better on in-flight requests per pod if CPU is not the bottleneck.

Downstream systems cannot handle the scaled workload

Symptom: HPA adds pods, but errors increase because the database, cache, or external dependency gets overloaded.

Fix: Set maxReplicas based on downstream capacity. Add connection pool limits. Use backpressure, request timeouts, and queue limits. Autoscaling application pods cannot fix a saturated database.

Multiple autoscalers fight each other

Symptom: replica counts, node counts, and scheduling all change at once, making behavior hard to explain.

Fix: Review HPA, Vertical Pod Autoscaler (VPA), Cluster Autoscaler, Karpenter, and any event-driven autoscaler settings together. Avoid enabling VPA to change CPU requests automatically on the same workloads where HPA uses CPU utilization, unless you have deliberately designed for that interaction.

A practical HPA tuning workflow

Use this process when adding or fixing HPA for a production workload:

  1. Classify the workload. Decide whether it is CPU-bound, memory-bound, queue-based, I/O-bound, or limited by another dependency.
  2. Set resource requests from observed usage. Use recent metrics under normal and peak traffic. Do not copy requests from another service.
  3. Choose one primary scaling signal. Start simple. Add more metrics only when you can explain what each one protects.
  4. Set a safe minimum replica count. Cover normal traffic, rollout capacity, and disruption tolerance.
  5. Set a realistic maximum replica count. Protect downstream systems and stay within cluster capacity.
  6. Add scale-up and scale-down behavior. Scale up faster than you scale down. Use a longer stabilization window for scale-down.
  7. Test with controlled load. Watch HPA events, pod readiness, latency, and downstream saturation.
  8. Review after real traffic. Compare desired replicas, actual replicas, and user-facing metrics. Adjust one variable at a time.

For example, if an API normally runs at 3 replicas and handles peak traffic at 8 replicas, do not start with minReplicas: 1 and maxReplicas: 100. A more controlled first version might use minReplicas: 3, maxReplicas: 15, CPU target 65%, scale-up of 100% per minute, and scale-down of 25% per minute with a 5-minute stabilization window.

If your workload is more complex, such as data orchestration or scheduled workers, be careful about what you autoscale. Some components need fixed leadership, stable scheduling, or queue-aware scaling. That distinction matters in platforms such as Apache Airflow on Amazon Elastic Kubernetes Service, where workers, schedulers, and web components have different scaling profiles.

Takeaway

Stable HPA behavior comes from good inputs and conservative control settings. Set realistic resource requests, choose metrics that match the bottleneck, keep enough baseline replicas, scale up faster than you scale down, and use stabilization windows to ignore short-lived noise.

Your next step should be to inspect one existing HPA with kubectl describe hpa, compare its target metric to the workload’s real bottleneck, and add explicit behavior settings if they are missing. Most scaling flaps are fixable with a better metric, a slower scale-down policy, or a more honest CPU request.