Kubernetes HorizontalPodAutoscaler (HPA), or Horizontal Pod Autoscaler, is a Kubernetes controller that automatically changes the number of pod replicas for a workload based on demand. In practical terms, HPA adds more pods when traffic or resource usage rises and removes pods when demand drops.
What HPA does
HPA scales workloads horizontally, meaning it changes replica count rather than changing the CPU or memory assigned to each pod.
It commonly scales Kubernetes objects such as:
- Deployments
- ReplicaSets
- StatefulSets, when supported by the workload pattern
- Any resource that exposes the Kubernetes scale subresource
For example, you might configure an API Deployment to run at least 3 replicas and at most 20 replicas, targeting 60% average CPU utilization. If CPU usage rises above the target, HPA increases replicas. If usage stays low, it scales the Deployment down.
How HPA works
HPA runs as a controller inside the Kubernetes control plane. It periodically checks metrics, compares the current value with the configured target, and updates the workload replica count.
A typical HPA needs:
- A scalable workload, such as a Deployment.
- Minimum and maximum replica counts, for example
minReplicas: 2 and maxReplicas: 10.
- A target metric, such as 70% CPU utilization.
- A metrics source, often Kubernetes Metrics Server for CPU and memory metrics.
At a high level, HPA calculates the desired replica count using the ratio between current metric value and target metric value. If current CPU is 90% and the target is 60%, HPA will try to increase replicas because the workload is running above its target.
Common metrics used by HPA
HPA can scale based on several metric types, depending on your Kubernetes version and metrics setup:
- CPU utilization: The most common HPA metric. It depends on CPU requests being set on containers.
- Memory utilization: Useful for some workloads, but memory often does not drop quickly after traffic falls.
- Custom metrics: Application or service metrics, such as requests per second per pod.
- External metrics: Metrics outside the cluster, such as queue depth from a managed message queue.
For custom or external metrics, teams often use adapters such as Prometheus Adapter or event-driven autoscalers such as KEDA, depending on the use case.
Simple real-world example
Assume you run a checkout API on Kubernetes:
- Normal traffic needs 4 pods.
- Peak traffic during a sale needs 15 pods.
- You configure HPA with
minReplicas: 4, maxReplicas: 20, and a CPU target of 65%.
As traffic grows, CPU usage rises across the pods. HPA increases the Deployment replica count. Kubernetes schedules the new pods, and the Service starts sending traffic to them once they are ready. When traffic drops and metrics stay below target, HPA gradually reduces replicas.
Common use cases
- Web APIs with variable request volume.
- Worker services that process jobs, especially when scaling by queue depth.
- Internal platforms where teams need predictable scaling without manually changing replica counts.
- Batch-adjacent services that receive periodic spikes.
- Data and workflow platforms, such as Airflow components running on Kubernetes. For example, teams deploying Apache Airflow on AWS EKS may use autoscaling for webserver, scheduler, or worker components depending on architecture.
Key configuration fields
- scaleTargetRef: The workload HPA should scale, such as a specific Deployment.
- minReplicas: The lowest number of replicas HPA should keep running.
- maxReplicas: The highest number of replicas HPA is allowed to create.
- metrics: The CPU, memory, custom, or external metrics used for scaling decisions.
- behavior: Optional scaling rules, such as stabilization windows and scale-up or scale-down rate limits.
Benefits
- Improves availability under load by adding pods when demand increases.
- Reduces manual operations because engineers do not need to change replica counts for normal traffic patterns.
- Controls waste by scaling down when demand falls, within the configured minimum.
- Works with standard Kubernetes workflows, including manifests, Helm, GitOps, and Terraform. If you manage manifests through infrastructure code, the same pattern applies when you deploy Kubernetes resources using Terraform.
Tradeoffs and limitations
- HPA does not add cluster nodes by itself. If the cluster has no spare capacity, new pods may stay Pending. You usually pair HPA with Cluster Autoscaler, Karpenter, or another node scaling tool.
- CPU-based scaling requires CPU requests. Without requests, Kubernetes cannot calculate CPU utilization as a percentage of requested CPU.
- Scaling is not instant. HPA needs metrics, Kubernetes needs to create pods, containers need to start, and readiness probes must pass.
- Bad targets can cause unstable scaling. A target that is too low may create too many pods. A target that is too high may react too late.
- Memory can be a poor downscale signal. Some applications hold memory after traffic drops, so memory-based HPA may not scale down as expected.
- It does not fix slow startup times. If your app takes 3 minutes to become ready, HPA cannot make that startup time disappear.
HPA vs VPA vs Cluster Autoscaler
- HPA changes the number of pod replicas.
- Vertical Pod Autoscaler, or VPA, adjusts or recommends CPU and memory requests for pods.
- Cluster Autoscaler changes the number of worker nodes in the cluster.
These tools solve different problems. A common production setup uses HPA for application replica count and a node autoscaler for cluster capacity. VPA may be used for recommendations, but teams should be careful when combining VPA and HPA on the same CPU or memory signals.
Operational tips
- Set realistic resource requests for containers before using CPU utilization targets.
- Choose minReplicas based on availability needs, cold start time, and normal traffic.
- Set maxReplicas to protect downstream services, databases, and third-party APIs.
- Use readiness probes so traffic only reaches pods that can serve requests.
- Review HPA behavior before cluster upgrades, especially when using autoscaling API versions or custom metrics. This fits naturally into broader Kubernetes upgrade planning for startups.
- Watch events with
kubectl describe hpa when scaling does not behave as expected.
Example HPA manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
This HPA keeps at least 3 checkout API pods running, allows up to 20 pods, and targets 65% average CPU utilization across the pods.
Summary
Kubernetes HPA is the standard way to scale pod replicas automatically based on workload demand. It is most effective when your applications have clear resource requests, reliable metrics, sensible replica limits, and enough cluster capacity for new pods. For DevOps and platform teams, HPA is a core building block for resilient Kubernetes workloads.