Cluster Autoscaling: HPA, Cluster Autoscaler, and KEDA

Cluster Autoscaling#

Kubernetes autoscaling operates at two levels: pod-level (HPA adds or removes pod replicas) and node-level (Cluster Autoscaler adds or removes nodes). Getting them to work together requires understanding how each makes decisions.

Horizontal Pod Autoscaler (HPA)#

HPA adjusts the replica count of a Deployment, StatefulSet, or ReplicaSet based on observed metrics. The metrics-server must be running in your cluster for CPU and memory metrics.

Basic HPA on CPU#

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This scales my-app between 2 and 10 replicas, targeting 70% average CPU utilization across all pods. The HPA checks metrics every 15 seconds (default) and computes the desired replica count as:

Infrastructure Capacity Planning: Measurement, Projection, and Scaling

Sre

What Capacity Planning Solves#

Running out of capacity during a traffic spike causes outages. Over-provisioning wastes money continuously. Capacity planning is the process of measuring what you use now, projecting what you will need, and ensuring resources are available before demand arrives. Without it, you are either constantly firefighting resource exhaustion or explaining to finance why your cloud bill doubled.

Capacity planning is not a one-time exercise. It is a recurring process – monthly for fast-growing services, quarterly for stable ones.

Scenario: Preparing for and Handling a Traffic Spike

Scenario: Preparing for and Handling a Traffic Spike#

You are helping when someone says: “we have a big launch next week,” “Black Friday is coming,” or “traffic is suddenly 3x normal and climbing.” These are two distinct problems – proactive preparation for a known event and reactive response to an unexpected surge – but they share the same infrastructure mechanics.

The key principle: Kubernetes autoscaling has latency. HPA takes 15-30 seconds to detect increased load and scale pods. Cluster Autoscaler takes 3-7 minutes to provision new nodes. If your traffic spike is faster than your scaling speed, users hit errors during the gap. Proactive preparation eliminates this gap. Reactive response minimizes it.

Monitoring Prometheus Itself: Capacity Planning, Self-Monitoring, and Scaling

Why Monitor Your Monitoring#

If Prometheus runs out of memory and crashes, you lose all alerting. If its disk fills up, it stops ingesting and you have a blind spot that may last hours before anyone notices. If scrapes start timing out, metrics go stale and alerts based on rate() produce no data (which means they silently stop firing rather than triggering). Prometheus must be the most reliably monitored component in your stack.