---
title: "Scenario: Preparing for and Handling a Traffic Spike"
description: "Guide for proactively preparing for known traffic events and reactively handling unexpected traffic surges in Kubernetes, covering HPA tuning, node pre-scaling, load testing, and graceful degradation."
url: https://agent-zone.ai/knowledge/kubernetes/scenarios-scaling-for-traffic-spike/
section: knowledge
date: 2026-02-22
categories: ["kubernetes"]
tags: ["scaling","hpa","traffic","capacity-planning","load-testing","cluster-autoscaler","rate-limiting"]
skills: ["capacity-planning","autoscaling-configuration","load-testing","incident-response"]
tools: ["kubectl","k6","helm"]
levels: ["intermediate"]
word_count: 2051
formats:
  json: https://agent-zone.ai/knowledge/kubernetes/scenarios-scaling-for-traffic-spike/index.json
  html: https://agent-zone.ai/knowledge/kubernetes/scenarios-scaling-for-traffic-spike/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Scenario%3A+Preparing+for+and+Handling+a+Traffic+Spike
---


# Scenario: Preparing for and Handling a Traffic Spike

You are helping when someone says: "we have a big launch next week," "Black Friday is coming," or "traffic is suddenly 3x normal and climbing." These are two distinct problems -- proactive preparation for a known event and reactive response to an unexpected surge -- but they share the same infrastructure mechanics.

The key principle: Kubernetes autoscaling has latency. HPA takes 15-30 seconds to detect increased load and scale pods. Cluster Autoscaler takes 3-7 minutes to provision new nodes. If your traffic spike is faster than your scaling speed, users hit errors during the gap. Proactive preparation eliminates this gap. Reactive response minimizes it.

---

## Part A -- Proactive Preparation (Known Upcoming Spike)

You know the traffic is coming. Maybe it is a product launch, a marketing campaign, a seasonal event, or a planned load test. You have days or weeks to prepare.

### Step 1 -- Capacity Assessment

Start by understanding where you are and where you need to be.

```bash
# Current pod counts and resource usage
kubectl top pods -n production --sort-by=cpu
kubectl get hpa -n production

# Current node capacity and usage
kubectl top nodes
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_CAPACITY:.status.capacity.cpu,\
MEM_CAPACITY:.status.capacity.memory,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory
```

```promql
# Current requests per second (baseline)
sum(rate(http_requests_total[5m]))

# Current p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU headroom: how much of requests are actually used
1 - (
  sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))
  /
  sum(kube_pod_container_resource_requests{namespace="production", resource="cpu"})
)
```

Now estimate the spike. A 2x spike is manageable with autoscaling alone. A 5x spike needs pre-scaling. A 10x spike needs pre-scaling plus architectural changes (caching, CDN, read replicas).

### Identify Bottlenecks

The application tier is rarely the only bottleneck. Walk through the full request path:

- **Ingress controller**: does it have enough replicas? NGINX workers per pod?
- **Application pods**: can they scale fast enough? What is the startup time?
- **Database**: connection pool exhaustion is the most common failure point. If you have 10 pods with a pool of 20 connections each, and you scale to 50 pods, that is 1,000 connections against a database configured for 200.
- **External APIs**: do they have rate limits you will hit at higher traffic?
- **Cache**: if cache is cold after scaling, the database takes the full load during warm-up.

### Step 2 -- Pre-Scale

Do not rely on autoscaling to respond in time. Pre-warm the infrastructure before the event.

```bash
# Increase HPA minimums to pre-warm pod count
# Current: minReplicas=3, you expect 5x traffic, so set minReplicas to 15
kubectl patch hpa my-app -n production \
  -p '{"spec":{"minReplicas":15}}'

# Pre-scale the ingress controller too
kubectl patch hpa ingress-nginx-controller -n ingress-system \
  -p '{"spec":{"minReplicas":4}}'

# Pre-scale the node pool (cloud-specific)
# AWS EKS:
aws eks update-nodegroup-config \
  --cluster-name production \
  --nodegroup-name main-pool \
  --scaling-config minSize=10,maxSize=30,desiredSize=15

# GKE:
gcloud container clusters resize production \
  --node-pool main-pool \
  --num-nodes 15

# AKS:
az aks nodepool scale \
  --resource-group my-rg \
  --cluster-name production \
  --name mainpool \
  --node-count 15
```

Pre-scale dependencies too:

```bash
# Database: increase connection pool or add read replicas
# Redis: scale up if using Redis for sessions or caching
kubectl scale statefulset redis -n production --replicas=3

# If using a managed database, increase instance size or add read replicas
# AWS RDS example:
aws rds create-db-instance-read-replica \
  --db-instance-identifier production-read-1 \
  --source-db-instance-identifier production-primary
```

### Step 3 -- Verify Scaling Works

Never trust autoscaling configuration without testing it. Run a load test at the expected traffic level.

```bash
# Using k6 for load testing
cat > loadtest.js << 'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp up
    { duration: '5m', target: 500 },   // hold at 5x
    { duration: '2m', target: 1000 },  // push to 10x
    { duration: '5m', target: 1000 },  // hold at 10x
    { duration: '3m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // 99th percentile under 500ms
    http_req_failed: ['rate<0.01'],     // less than 1% errors
  },
};

export default function () {
  const res = http.get('https://my-app.example.com/api/health');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.1);
}
EOF

k6 run loadtest.js
```

While the test runs, watch scaling behavior in a separate terminal:

```bash
# Watch pod scaling
kubectl get pods -l app=my-app -n production -w

# Watch HPA decisions
kubectl describe hpa my-app -n production

# Watch node provisioning
kubectl get nodes -w

# Watch for scheduling failures
kubectl get events -n production --field-selector reason=FailedScheduling -w
```

What to look for:

- **Pod startup time**: If pods take 60 seconds to start and pass readiness, there is a 60-second window where scaling cannot keep up with a sudden spike.
- **Node provisioning time**: Cluster Autoscaler typically takes 3-5 minutes to provision new nodes. During this window, pods sit in Pending state.
- **Cascade failures**: scaling the app may overload the database, which causes app errors, which causes retries, which makes everything worse.

### Step 4 -- During the Spike

Even with preparation, monitor actively during the event.

```bash
# Key metrics to watch
# Pod count and HPA status
kubectl get hpa -n production -w

# Error rate (should stay below threshold)
# Prometheus: sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

# Latency (should stay within SLA)
# Prometheus: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))

# Node utilization
kubectl top nodes
```

Be ready for manual intervention:

```bash
# If HPA is scaling too slowly, manually increase replicas
kubectl scale deployment my-app -n production --replicas=30

# If nodes are full and autoscaler is slow, add nodes manually
# AWS:
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name eks-main-pool-xxxx \
  --desired-capacity 20

# If database connections are exhausting, scale down non-critical workloads
kubectl scale deployment analytics-worker -n production --replicas=0
```

Watch for these failure modes during the spike:

- **OOMKilled pods**: memory limit is too low for the traffic volume (more connections = more memory). Increase the limit.
- **Database connection exhaustion**: more pods than the database can handle. Solution: reduce per-pod connection pool size, add PgBouncer or ProxySQL, or add read replicas.
- **Rate limiting from external services**: third-party APIs returning 429s. Solution: implement circuit breakers, add caching, or request rate limit increases in advance.

### Step 5 -- After the Spike

Do not scale down immediately. Traffic often has a long tail, and aggressive scale-down followed by another surge causes worse instability than maintaining extra capacity for a few hours.

```bash
# Reduce HPA minimums gradually over hours/days
# Day of event: minReplicas=15
# Day after: minReplicas=8
# Two days after: minReplicas=3 (back to normal)
kubectl patch hpa my-app -n production \
  -p '{"spec":{"minReplicas":8}}'

# Reduce node pool minimum
aws eks update-nodegroup-config \
  --cluster-name production \
  --nodegroup-name main-pool \
  --scaling-config minSize=5,maxSize=30,desiredSize=8
```

Post-event analysis:

- What was the actual peak traffic? How did it compare to the estimate?
- Did autoscaling engage? At what point?
- Were there any errors during the ramp-up period?
- What was the bottleneck? (usually the database)
- How much did the event cost in extra infrastructure?

---

## Part B -- Reactive Response (Unexpected Spike)

Traffic is surging right now and you did not plan for it. Time is critical.

### Step 1 -- Immediate Assessment (First 2 Minutes)

```bash
# Is HPA already scaling?
kubectl describe hpa my-app -n production
# Look for "ScalingActive" condition and recent events
# "New size: 8; reason: cpu resource utilization above target" = HPA is working
# "ScalingLimited" = HPA hit maxReplicas and cannot scale further

# Are pods in Pending state (node capacity exhausted)?
kubectl get pods -n production --field-selector=status.phase=Pending

# What is the current error rate?
kubectl logs deployment/my-app -n production --tail=20
```

### Step 2 -- Immediate Scaling Actions

```bash
# If HPA exists but maxReplicas is too low, increase it
kubectl patch hpa my-app -n production \
  -p '{"spec":{"maxReplicas":50}}'

# If no HPA exists, manually scale
kubectl scale deployment my-app -n production --replicas=20

# If nodes are full, increase node pool size
# AWS:
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name eks-main-pool-xxxx \
  --desired-capacity 15

# GKE:
gcloud container clusters resize production \
  --node-pool main-pool \
  --num-nodes 15 --quiet
```

### Step 3 -- Protect the System (Graceful Degradation)

If the system cannot handle the full load even after scaling, shed non-critical load to protect core functionality.

```bash
# Enable rate limiting at ingress level
# If using NGINX Ingress, apply rate limiting annotation
kubectl annotate ingress my-app -n production \
  nginx.ingress.kubernetes.io/limit-rps="100" \
  nginx.ingress.kubernetes.io/limit-connections="50" \
  --overwrite

# Scale down non-critical workloads to free resources
kubectl scale deployment analytics-worker -n production --replicas=0
kubectl scale deployment report-generator -n production --replicas=0
kubectl scale deployment email-sender -n production --replicas=1
```

If the database is the bottleneck (the most common case during unexpected spikes):

```bash
# Reduce per-pod connection pool size to stop connection exhaustion
# This requires a config change or environment variable update
kubectl set env deployment/my-app -n production \
  DB_POOL_SIZE=5  # Down from default of 20

# If you have a connection pooler like PgBouncer, scale it
kubectl scale deployment pgbouncer -n production --replicas=3

# Enable read replicas if available
kubectl set env deployment/my-app -n production \
  DB_READ_HOST=db-read-replica.internal
```

### Step 4 -- Scale Infrastructure Layer

If the Cluster Autoscaler is not fast enough (it takes 3-7 minutes), manually provision capacity:

```bash
# Check Cluster Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# If autoscaler is active but slow, increase node pool desired size directly
# This bypasses the autoscaler's deliberation time

# Check pending pods to understand how much capacity is needed
kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o json | \
  jq '[.items[] | .spec.containers[].resources.requests] | {
    total_cpu: (map(.cpu // "0") | length),
    total_pods: length
  }'
```

### HPA Configuration for Spike Readiness

For services that regularly face traffic spikes, configure HPA to scale up aggressively and scale down conservatively:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Lower target = earlier scaling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately, no waiting
      policies:
      - type: Percent
        value: 100                     # Double pod count per 60s if needed
        periodSeconds: 60
      - type: Pods
        value: 10                      # Or add up to 10 pods per 60s
        periodSeconds: 60
      selectPolicy: Max                # Use whichever allows faster scaling
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
      policies:
      - type: Percent
        value: 10                      # Remove at most 10% per 60s
        periodSeconds: 60
      selectPolicy: Min                # Use whichever is more conservative
```

Key choices explained:

- **60% CPU target instead of 70-80%**: leaves headroom for traffic bursts within the HPA check interval. If you target 80% and traffic doubles between checks, pods hit 160% (throttling) before HPA reacts.
- **stabilizationWindowSeconds: 0 for scale-up**: react immediately to increased load. Never delay scaling up.
- **stabilizationWindowSeconds: 600 for scale-down**: avoid thrashing. A 10-minute window means the HPA waits until the load has been consistently lower for 10 minutes before removing pods. This handles bursty traffic patterns.
- **selectPolicy: Max for scale-up, Min for scale-down**: asymmetric aggressiveness. Scale up as fast as possible, scale down as slowly as reasonable.

---

## Monitoring Queries for Traffic Spikes

These Prometheus queries give you the critical signals during a traffic event:

```promql
# Current requests per second vs 1 hour ago
sum(rate(http_requests_total[1m]))
/
sum(rate(http_requests_total[1m] offset 1h))
# Result > 2 means traffic has doubled

# Error rate (should stay below 1%)
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))

# p99 latency trend
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))

# Pod count vs HPA max (how close to ceiling)
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="my-app"}
/
kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler="my-app"}
# Result > 0.8 means you are approaching the HPA ceiling

# Node allocatable vs requested (cluster headroom)
1 - (
  sum(kube_pod_container_resource_requests{resource="cpu"})
  /
  sum(kube_node_status_allocatable{resource="cpu"})
)
# Result < 0.1 means less than 10% cluster headroom -- new pods may go Pending
```

Set up alerts on these metrics before any planned event:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: traffic-spike-alerts
  namespace: monitoring
spec:
  groups:
  - name: traffic-spike
    rules:
    - alert: HPANearMaxReplicas
      expr: |
        kube_horizontalpodautoscaler_status_current_replicas
        /
        kube_horizontalpodautoscaler_spec_max_replicas > 0.8
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "HPA {{ $labels.horizontalpodautoscaler }} is at {{ $value | humanizePercentage }} of max replicas"
    - alert: HighErrorRateDuringSpike
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[2m]))
        /
        sum(rate(http_requests_total[2m])) > 0.02
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Error rate is {{ $value | humanizePercentage }} during traffic spike"
    - alert: ClusterCapacityLow
      expr: |
        1 - (
          sum(kube_pod_container_resource_requests{resource="cpu"})
          /
          sum(kube_node_status_allocatable{resource="cpu"})
        ) < 0.15
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Cluster CPU headroom is only {{ $value | humanizePercentage }}"
```

The difference between surviving a traffic spike and going down during one usually comes down to preparation. Load test your scaling behavior before you need it. Know your bottleneck before traffic finds it for you.

