Horizontal Pod Autoscaler (HPA)

[!NOTE] This module explores the core principles of Horizontal Pod Autoscaler (HPA), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

The Horizontal Pod Autoscaler (HPA) is Kubernetes’ primary mechanism for scaling applications out (adding replicas) and in (removing replicas) based on demand.

It answers the question: “How many copies of this application do I need right now to handle the current load?”

Analogy: Think of HPA like a thermostat in a large building. The thermostat constantly checks the current temperature (Current Metric) against the target temperature you set (Desired Metric). If the building is too cold, it doesn’t just turn on one heater; it calculates how many heaters are needed to reach the target efficiently without overshooting.

1. The Control Loop: First Principles

HPA is a control loop that runs inside the kube-controller-manager (usually every 15 seconds). It constantly compares the current metric value against your desired target.

The Formula

The number of replicas is calculated using this formula:

DesiredReplicas = ceil[ CurrentReplicas * ( CurrentMetricValue / DesiredMetricValue ) ]

Example:

Current Replicas: 2
Current CPU Load: 100% (Avg per pod)
Target CPU Load: 50%

\text{Desired} = \lceil 2 \times (100 / 50) \rceil = \lceil 2 \times 2 \rceil = 4

The HPA will scale up to 4 replicas.

2. Interactive: The Scaling Simulator

Visualize how HPA reacts to traffic spikes. Notice that scaling up is fast, but scaling down is slow (to prevent “thrashing”).

Traffic Load: 200 RPS

Target: 100 RPS per Pod

Replicas: 2

Utilization: 100%

3. Stabilization Windows: Preventing “Flapping”

HPA faces a problem called Flapping (or Thrashing).

Load spikes → Scale Up.
Load drops slightly → Scale Down.
Load spikes again → Scale Up.

This causes pods to be created and destroyed rapidly, wasting CPU on startup costs.

The Solution: Behavior Policy

By default, HPA has a 5-minute scale-down stabilization window. This means: “I see that load is low, but I will wait 5 minutes before deleting pods to make sure it’s not a temporary dip.”

You can configure this in the behavior section:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15

4. Implementation Guide

1. Prerequisites: Metrics Server

HPA cannot function without the Metrics Server. It provides the currentMetricValue.

# Verify metrics server is running
kubectl top nodes
kubectl top pods

2. Deployment Manifest

Your Deployment MUST have resource requests defined. HPA uses requests to calculate utilization percentage.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m # Critical for HPA calculation

3. HPA Manifest (v2)

Use autoscaling/v2 for the most features.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Why 50% Utilization?

A target of 100% is dangerous. If traffic spikes, you have zero buffer while new pods are booting up. A target of 50-70% leaves headroom for spikes during the scale-up lag.

5. Scaling on Custom Metrics

Sometimes CPU/Memory isn’t enough. You might want to scale on:

Requests Per Second (RPS) (from Ingress)
Queue Length (from RabbitMQ/SQS)

This requires the Prometheus Adapter. It translates Prometheus metrics into the Kubernetes Custom Metrics API so HPA can read them.

  metrics:
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: 1k

6. Common Gotchas

Missing Requests

If your containers don’t have resources.requests defined, HPA will not work for CPU/Memory scaling because it cannot calculate a percentage.

Cold Starts

HPA reacts to current load. If your app takes 60 seconds to boot (Java/Spring), your users will see errors during that scale-up window. Use Over-provisioning or lower utilization targets to mitigate this. War Story: A major e-commerce site failed during Black Friday because their payment service HPA target was 90% CPU. When traffic spiked, the HPA scaled up, but the new Java pods took 90 seconds to boot. The existing pods hit 100% CPU and crashed, causing a cascading failure before any new pods were ready. Changing the target to 60% provided enough buffer for the boot time.

Horizontal Pod Autoscaler (HPA)

Horizontal Pod Autoscaler (HPA)

1. The Control Loop: First Principles

The Formula

2. Interactive: The Scaling Simulator

Traffic Load: 200 RPS

Replicas: 2

3. Stabilization Windows: Preventing “Flapping”

The Solution: Behavior Policy

4. Implementation Guide

1. Prerequisites: Metrics Server

2. Deployment Manifest

3. HPA Manifest (v2)

5. Scaling on Custom Metrics

6. Common Gotchas

Found this lesson helpful?