Horizontal Pod Autoscaler (HPA) is a Kubernetes controller that automatically changes the number of pod replicas for a workload based on observed metrics. HPA stands for Horizontal Pod Autoscaler. In practical terms, it adds more pods when demand rises and removes pods when demand falls.
HPA usually scales workloads such as Deployments, ReplicaSets, or StatefulSets. It helps teams run enough capacity for traffic spikes without manually editing replica counts.
What HPA does
HPA watches metrics for a target workload and adjusts its desired replica count. Common scaling signals include:
- CPU utilization, such as scaling when average pod CPU usage exceeds 70%.
- Memory usage, if configured through supported metrics.
- Custom metrics, such as requests per second, queue depth, or active sessions.
- External metrics, such as a cloud queue length or message backlog.
For example, if an API Deployment has 3 replicas and average CPU stays above the configured target, HPA may increase it to 5 or 8 replicas, depending on the metric value and scaling policy.
How HPA works
HPA runs as part of the Kubernetes control plane. It periodically checks metrics, calculates the desired number of replicas, and updates the workload’s scale subresource.
A typical HPA setup includes:
- A scalable workload, usually a Deployment.
- Minimum and maximum replica counts, such as 2 minimum and 20 maximum.
- Metric targets, such as average CPU utilization of 60%.
- A metrics source, such as Metrics Server for CPU and memory.
- Optional scaling behavior, such as limits on how fast pods scale up or down.
With the autoscaling/v2 API, HPA can use multiple metrics and more detailed scaling behavior. Kubernetes chooses the replica count needed to satisfy the configured targets.
Common use cases
- Web APIs that receive variable request traffic during business hours or product launches.
- Background workers that process queues and need more replicas when backlog grows.
- Internal services with bursty traffic from scheduled jobs or batch workflows.
- Cost control by reducing pod replicas during quiet periods.
- Basic resilience by keeping extra replicas available when load increases.
Simple example
A team runs a checkout service on Kubernetes with this scaling goal:
- Minimum replicas: 3
- Maximum replicas: 30
- Target CPU utilization: 65%
During normal traffic, the service runs 3 to 5 pods. During a sale, CPU usage rises across the pods. HPA detects that the average is above 65% and increases the replica count. When traffic drops and utilization stays low, HPA gradually scales the Deployment back down.
Key benefits
- Automatic scaling: Teams do not need to manually change replica counts for routine traffic changes.
- Better resource use: Clusters can run fewer pods during low demand and more pods during peak demand.
- Works with standard Kubernetes workloads: HPA integrates with Deployments, ReplicaSets, and other scalable resources.
- Supports business-specific signals: Custom metrics let teams scale on queue length, request rate, or other workload-specific indicators.
Limitations and tradeoffs
- HPA does not create nodes: If the cluster has no available capacity, new pods may stay Pending. Cluster Autoscaler or another node autoscaling tool handles node capacity.
- Metrics must be reliable: Missing or delayed metrics can cause poor scaling decisions.
- Scaling is not instant: New pods need time to schedule, start, pass readiness checks, and receive traffic.
- CPU-based scaling can be misleading: Some services bottleneck on I/O, database calls, locks, or external APIs instead of CPU.
- Applications must tolerate more replicas: Stateful behavior, connection limits, rate limits, and shared locks can make horizontal scaling harder.
HPA vs related Kubernetes scaling tools
- HPA: Changes the number of pod replicas for a workload.
- Vertical Pod Autoscaler (VPA): Adjusts pod CPU and memory requests, and may restart pods depending on configuration.
- Cluster Autoscaler: Adds or removes cluster nodes based on pending pods and node utilization.
- KEDA: Extends event-driven autoscaling for sources such as Kafka, RabbitMQ, Prometheus, cloud queues, and databases.
In many production clusters, HPA and Cluster Autoscaler work together. HPA requests more pods when a workload needs capacity. If the current nodes cannot fit those pods, Cluster Autoscaler can add nodes so Kubernetes can schedule them.