Horizontal Pod Autoscaler (HPA)
[!NOTE] This module explores the core principles of Horizontal Pod Autoscaler (HPA), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
The Horizontal Pod Autoscaler (HPA) is Kubernetes’ primary mechanism for scaling applications out (adding replicas) and in (removing replicas) based on demand.
It answers the question: “How many copies of this application do I need right now to handle the current load?”
Analogy: Think of HPA like a thermostat in a large building. The thermostat constantly checks the current temperature (Current Metric) against the target temperature you set (Desired Metric). If the building is too cold, it doesn’t just turn on one heater; it calculates how many heaters are needed to reach the target efficiently without overshooting.
1. The Control Loop: First Principles
HPA is a control loop that runs inside the kube-controller-manager (usually every 15 seconds). It constantly compares the current metric value against your desired target.
The Formula
The number of replicas is calculated using this formula:
Example:
- Current Replicas: 2
- Current CPU Load: 100% (Avg per pod)
- Target CPU Load: 50%
The HPA will scale up to 4 replicas.
2. Interactive: The Scaling Simulator
Visualize how HPA reacts to traffic spikes. Notice that scaling up is fast, but scaling down is slow (to prevent “thrashing”).
Traffic Load: 200 RPS
Target: 100 RPS per Pod
Replicas: 2
Utilization: 100%
3. Stabilization Windows: Preventing “Flapping”
HPA faces a problem called Flapping (or Thrashing).
- Load spikes → Scale Up.
- Load drops slightly → Scale Down.
- Load spikes again → Scale Up.
This causes pods to be created and destroyed rapidly, wasting CPU on startup costs.
The Solution: Behavior Policy
By default, HPA has a 5-minute scale-down stabilization window. This means: “I see that load is low, but I will wait 5 minutes before deleting pods to make sure it’s not a temporary dip.”
You can configure this in the behavior section:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
4. Implementation Guide
1. Prerequisites: Metrics Server
HPA cannot function without the Metrics Server. It provides the currentMetricValue.
# Verify metrics server is running
kubectl top nodes
kubectl top pods
2. Deployment Manifest
Your Deployment MUST have resource requests defined. HPA uses requests to calculate utilization percentage.
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
replicas: 1
selector:
matchLabels:
run: php-apache
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: 200m # Critical for HPA calculation
3. HPA Manifest (v2)
Use autoscaling/v2 for the most features.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Why 50% Utilization?
A target of 100% is dangerous. If traffic spikes, you have zero buffer while new pods are booting up. A target of 50-70% leaves headroom for spikes during the scale-up lag.
5. Scaling on Custom Metrics
Sometimes CPU/Memory isn’t enough. You might want to scale on:
- Requests Per Second (RPS) (from Ingress)
- Queue Length (from RabbitMQ/SQS)
This requires the Prometheus Adapter. It translates Prometheus metrics into the Kubernetes Custom Metrics API so HPA can read them.
metrics:
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
6. Common Gotchas
Missing Requests
If your containers don’t have resources.requests defined, HPA will not work for CPU/Memory scaling because it cannot calculate a percentage.
Cold Starts
HPA reacts to current load. If your app takes 60 seconds to boot (Java/Spring), your users will see errors during that scale-up window. Use Over-provisioning or lower utilization targets to mitigate this. War Story: A major e-commerce site failed during Black Friday because their payment service HPA target was 90% CPU. When traffic spiked, the HPA scaled up, but the new Java pods took 90 seconds to boot. The existing pods hit 100% CPU and crashed, causing a cascading failure before any new pods were ready. Changing the target to 60% provided enough buffer for the boot time.