---
title: "Rate Limiting Implementation Patterns"
description: "Reference for rate limiting algorithms and implementation patterns. Covers token bucket, sliding window, and fixed window algorithms, Redis-based distributed rate limiting, API gateway rate limiting, application-level rate limiting, RateLimit headers, and graceful degradation strategies."
url: https://agent-zone.ai/knowledge/microservices/rate-limiting-patterns/
section: knowledge
date: 2026-02-22
categories: ["microservices"]
tags: ["rate-limiting","token-bucket","sliding-window","redis","api-gateway","throttling","backpressure","rate-limit-headers","distributed-systems"]
skills: ["rate-limiter-implementation","distributed-rate-limiting","api-protection","traffic-management"]
tools: ["redis","kong","nginx","envoy","express","go","python"]
levels: ["intermediate","advanced"]
word_count: 1821
formats:
  json: https://agent-zone.ai/knowledge/microservices/rate-limiting-patterns/index.json
  html: https://agent-zone.ai/knowledge/microservices/rate-limiting-patterns/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Rate+Limiting+Implementation+Patterns
---


# Rate Limiting Implementation Patterns

Rate limiting controls how many requests a client can make within a time period. It protects services from overload, ensures fair usage across clients, prevents abuse, and provides a mechanism for graceful degradation under load. Every production API needs rate limiting at some layer.

## Algorithm Comparison

### Fixed Window

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window. When the count exceeds the limit, reject requests until the next window starts.

```
Window: 12:00:00 - 12:00:59 | Limit: 100 requests
Request at 12:00:30: counter = 57, allowed
Request at 12:00:45: counter = 101, rejected (429)
Window resets at 12:01:00: counter = 0
```

**Implementation with Redis:**

```python
import redis
import time

r = redis.Redis()

def fixed_window_check(client_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{client_id}:{int(time.time()) // window_seconds}"
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    count, _ = pipe.execute()
    return count <= limit
```

**Problem:** The boundary burst. A client can send 100 requests at 12:00:59 and 100 more at 12:01:00, effectively making 200 requests in 2 seconds while staying within the 100/minute limit on both windows. This is the primary weakness of fixed window.

### Sliding Window Log

Maintain a log of every request timestamp. When a new request arrives, remove entries older than the window duration and count the remaining entries. If the count exceeds the limit, reject the request.

```python
def sliding_window_log_check(client_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{client_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = r.pipeline()
    # Remove entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Add current request
    pipe.zadd(key, {str(now): now})
    # Count entries in window
    pipe.zcard(key)
    # Set expiry on the whole key
    pipe.expire(key, window_seconds)
    _, _, count, _ = pipe.execute()

    return count <= limit
```

**Advantage:** No boundary burst problem. The window slides smoothly with time.

**Disadvantage:** Memory usage scales with request volume. For a high-traffic API, storing every timestamp is expensive. A client making 10,000 requests per minute requires 10,000 sorted set entries per client.

### Sliding Window Counter

A compromise between fixed window and sliding window log. Use two fixed windows and weight the counts based on how far into the current window the request arrives.

```
Previous window (12:00 - 12:01): 85 requests
Current window  (12:01 - 12:02): 20 requests so far
Current time: 12:01:15 (25% into current window)

Estimated count = (85 * 0.75) + 20 = 83.75
Limit: 100 -> allowed
```

```python
def sliding_window_counter_check(
    client_id: str, limit: int, window_seconds: int
) -> bool:
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
    elapsed = (now % window_seconds) / window_seconds

    prev_key = f"ratelimit:{client_id}:{previous_window}"
    curr_key = f"ratelimit:{client_id}:{current_window}"

    pipe = r.pipeline()
    pipe.get(prev_key)
    pipe.incr(curr_key)
    pipe.expire(curr_key, window_seconds * 2)
    prev_count, curr_count, _ = pipe.execute()

    prev_count = int(prev_count or 0)
    weighted = prev_count * (1 - elapsed) + curr_count

    return weighted <= limit
```

**Advantage:** Smooths the boundary burst, uses only two counters per client (constant memory), and is simple to implement.

**Disadvantage:** The count is an approximation. In practice, the approximation is close enough for rate limiting purposes.

### Token Bucket

The token bucket adds tokens at a fixed rate up to a maximum capacity. Each request consumes one token. If no tokens are available, the request is rejected. This allows bursts up to the bucket capacity while enforcing an average rate.

```
Bucket capacity: 10 tokens
Refill rate: 2 tokens/second

t=0: bucket=10, request costs 1 -> bucket=9, allowed
t=0: 9 more requests -> bucket=0, allowed
t=0: request -> bucket=0, rejected (no tokens)
t=1: 2 tokens added -> bucket=2
t=1: request -> bucket=1, allowed
```

```python
def token_bucket_check(
    client_id: str, capacity: int, refill_rate: float
) -> bool:
    key = f"ratelimit:tb:{client_id}"
    now = time.time()

    # Atomic Lua script for token bucket
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local data = redis.call('hmget', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or capacity
    local last_refill = tonumber(data[2]) or now

    -- Add tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
        redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
        return 1
    else
        redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
        redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
        return 0
    end
    """
    result = r.eval(lua_script, 1, key, capacity, refill_rate, now)
    return result == 1
```

**Advantage:** Allows controlled bursts. A client that has been idle accumulates tokens and can burst up to the bucket capacity. This matches real-world traffic patterns better than strict windowed counting.

**Disadvantage:** Slightly more complex to implement correctly, especially in distributed systems. The Lua script approach ensures atomicity in Redis.

### Algorithm Selection Guide

| Algorithm | Burst Handling | Memory | Accuracy | Use Case |
|---|---|---|---|---|
| Fixed Window | Poor (boundary burst) | O(1) per client | Approximate | Simple internal rate limiting |
| Sliding Window Log | Excellent | O(n) per client | Exact | Low-volume, high-accuracy needs |
| Sliding Window Counter | Good | O(1) per client | Approximate | Production APIs (recommended default) |
| Token Bucket | Configurable burst | O(1) per client | Exact | APIs with burst allowance |

**For most production APIs:** Use the sliding window counter or token bucket. The sliding window counter is simpler and works well when you want a strict rate. The token bucket is better when you want to allow bursts.

## Distributed Rate Limiting with Redis

In a distributed system with multiple API server instances, rate limiting must be centralized. Redis is the standard choice because it is fast (sub-millisecond operations), supports atomic operations (Lua scripts), and is widely deployed.

### Redis Considerations

**Latency:** Each rate limit check adds a Redis round-trip (typically 0.5-2ms on the same network). This is acceptable for most APIs but can be significant for ultra-low-latency paths.

**Failure mode:** When Redis is unavailable, you have two options:
- **Fail open:** Allow all requests through. Protects availability but loses rate limiting during Redis outages.
- **Fail closed:** Reject all requests. Protects against abuse but causes total outage during Redis failures.

For most services, fail open is the right choice. Rate limiting is a protection mechanism, not a core business function.

```python
def rate_limit_with_fallback(client_id: str, limit: int) -> bool:
    try:
        return sliding_window_counter_check(client_id, limit, 60)
    except redis.ConnectionError:
        # Fail open: allow the request
        log.warning(f"Redis unavailable, skipping rate limit for {client_id}")
        return True
```

**Redis Cluster:** For high availability, use Redis Sentinel or Redis Cluster. With Redis Cluster, ensure your rate limit keys for the same client hash to the same shard by using hash tags: `{client_id}:ratelimit`.

## API Gateway Rate Limiting

API gateways handle rate limiting before requests reach your application, reducing load on application servers.

### NGINX

```nginx
# Define rate limiting zones
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=10r/s;
limit_req_zone $http_x_api_key    zone=per_key:10m rate=100r/s;

server {
    location /api/ {
        # Allow burst of 20, process excess with delay
        limit_req zone=per_ip burst=20 delay=10;
        limit_req zone=per_key burst=50 nodelay;

        limit_req_status 429;

        proxy_pass http://backend;
    }
}
```

The `burst` parameter defines a queue. With `delay=10`, the first 10 excess requests are processed immediately, the next 10 are delayed, and beyond 20 excess requests are rejected. With `nodelay`, all burst requests are processed immediately but the burst bucket must refill at the base rate.

### Kong

```yaml
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limit
plugin: rate-limiting
config:
  minute: 100
  hour: 5000
  policy: redis
  redis_host: redis.default.svc
  redis_port: 6379
  redis_database: 0
  fault_tolerant: true    # fail open on Redis errors
  hide_client_headers: false
```

### Envoy (Istio)

Envoy supports local rate limiting (per-pod) and global rate limiting (via an external rate limit service).

```yaml
# Istio EnvoyFilter for local rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: local-ratelimit
spec:
  workloadSelector:
    labels:
      app: my-service
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
        listener:
          filterChain:
            filter:
              name: envoy.filters.network.http_connection_manager
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            "@type": type.googleapis.com/udpa.type.v1.TypedStruct
            type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
            value:
              stat_prefix: http_local_rate_limiter
              token_bucket:
                max_tokens: 100
                tokens_per_fill: 100
                fill_interval: 60s
              filter_enabled:
                runtime_key: local_rate_limit_enabled
                default_value:
                  numerator: 100
                  denominator: HUNDRED
```

## Rate Limit Response Headers

The IETF RateLimit header fields (draft-ietf-httpapi-ratelimit-headers) standardize how servers communicate rate limit status to clients. Implement these headers on every rate-limited endpoint.

```
HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 57
RateLimit-Reset: 1708632000

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1708632000
Retry-After: 30
```

- **RateLimit-Limit**: The maximum number of requests allowed in the window.
- **RateLimit-Remaining**: How many requests the client can still make in this window.
- **RateLimit-Reset**: Unix timestamp when the window resets (or seconds until reset, depending on implementation).
- **Retry-After**: Included on 429 responses. Tells the client how many seconds to wait before retrying.

```python
from flask import Flask, jsonify, make_response

app = Flask(__name__)

@app.route('/api/data')
def get_data():
    client_id = request.headers.get('X-API-Key', request.remote_addr)
    limit = 100
    window = 60

    allowed, remaining, reset_at = check_rate_limit(client_id, limit, window)

    if not allowed:
        resp = make_response(jsonify({'error': 'Rate limit exceeded'}), 429)
        resp.headers['Retry-After'] = str(int(reset_at - time.time()))
    else:
        resp = make_response(jsonify({'data': 'value'}))

    resp.headers['RateLimit-Limit'] = str(limit)
    resp.headers['RateLimit-Remaining'] = str(max(0, remaining))
    resp.headers['RateLimit-Reset'] = str(int(reset_at))
    return resp
```

## Graceful Degradation

Rate limiting should not be a binary allow/deny. Implement tiered degradation that maintains service quality for well-behaved clients while protecting the system.

### Priority-Based Rate Limiting

Assign different limits based on client tier:

```python
RATE_LIMITS = {
    'enterprise': {'requests_per_minute': 10000, 'burst': 500},
    'pro':        {'requests_per_minute': 1000,  'burst': 100},
    'free':       {'requests_per_minute': 60,    'burst': 10},
    'anonymous':  {'requests_per_minute': 10,    'burst': 5},
}
```

### Endpoint-Specific Limits

Not all endpoints cost the same. A search endpoint that hits the database is more expensive than a cached GET endpoint.

```python
ENDPOINT_COSTS = {
    'GET /api/users/{id}':   1,   # cheap: cached
    'GET /api/search':       5,   # expensive: database query
    'POST /api/reports':     20,  # very expensive: background job
}
```

Deduct the endpoint cost from the token bucket instead of always deducting 1. This prevents expensive endpoints from being called as frequently as cheap ones.

### Shedding Strategy Under Global Overload

When the system is under global load (not just one client exceeding limits), implement progressive shedding:

1. **Level 1 (80% capacity):** Reduce limits for anonymous/free-tier clients by 50%.
2. **Level 2 (90% capacity):** Reject all anonymous requests. Reduce free-tier to 10% of normal.
3. **Level 3 (95% capacity):** Reject all free-tier requests. Reduce pro-tier by 50%.
4. **Level 4 (99% capacity):** Serve only enterprise clients and health checks.

```python
def adjusted_rate_limit(client_tier: str, system_load: float) -> int:
    base_limit = RATE_LIMITS[client_tier]['requests_per_minute']

    if system_load < 0.8:
        return base_limit
    elif system_load < 0.9:
        if client_tier == 'anonymous':
            return base_limit // 2
        return base_limit
    elif system_load < 0.95:
        if client_tier == 'anonymous':
            return 0
        if client_tier == 'free':
            return base_limit // 10
        return base_limit
    else:
        if client_tier in ('anonymous', 'free'):
            return 0
        if client_tier == 'pro':
            return base_limit // 2
        return base_limit
```

## Practical Checklist for Agents

When implementing rate limiting for a service:

1. **Choose the algorithm.** Sliding window counter for strict rate enforcement, token bucket for burst-friendly APIs.
2. **Choose the enforcement layer.** API gateway for external-facing rate limits, application-level for business logic rate limits (e.g., per-user actions).
3. **Use Redis** for distributed rate limiting with Lua scripts for atomicity.
4. **Always return RateLimit headers** on every response, not just 429s. Clients need to know their remaining quota.
5. **Include Retry-After** on 429 responses so well-behaved clients back off correctly.
6. **Fail open** when Redis is unavailable unless the service handles payments or security-critical operations.
7. **Log rate limit events** with client ID, endpoint, and current count for debugging and capacity planning.
8. **Test with load generators** (k6, wrk, vegeta) to verify limits work under actual concurrent traffic.