---
title: "DNS Failover Patterns: TTL Tradeoffs, Health Check Design, and Real-World Failover Timing"
description: "Practical guide to DNS failover covering TTL tradeoffs between propagation speed and DNS load, health check design principles, weighted DNS for blue-green deployments, Route53 and Cloudflare failover configuration, client-side DNS caching gotchas, and why real-world failover is never as fast as you think."
url: https://agent-zone.ai/knowledge/infrastructure/dns-failover-patterns/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["dns","failover","ttl","health-checks","route53","cloudflare","blue-green","disaster-recovery","dns-caching"]
skills: ["dns-failover-design","health-check-configuration","traffic-management","disaster-recovery"]
tools: ["dig","route53","cloudflare","terraform","curl"]
levels: ["intermediate","advanced"]
word_count: 1516
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/dns-failover-patterns/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/dns-failover-patterns/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=DNS+Failover+Patterns%3A+TTL+Tradeoffs%2C+Health+Check+Design%2C+and+Real-World+Failover+Timing
---


## DNS Is Not a Load Balancer

This needs to be said upfront: DNS was designed for name resolution, not traffic management. Using DNS for failover is a pragmatic hack that works well enough for most use cases, but it has fundamental limitations.

DNS responses are cached at multiple levels (recursive resolvers, OS caches, application caches, browser caches). You cannot force a client to re-resolve. You can set a TTL, but clients and resolvers are free to ignore it (and some do). Java applications, for example, cache DNS indefinitely by default in some JVM versions unless you explicitly set `networkaddress.cache.ttl`.

When a region fails and you update DNS, the actual failover timeline looks like this:

```
0:00  Region A fails
0:30  Health check detects failure (3 checks x 10s interval)
0:31  DNS record updated to point to Region B
0:31 - 5:31  Clients with cached DNS continue hitting dead Region A
             New clients resolve to Region B immediately
5:31  Most clients have refreshed DNS (assuming 300s TTL)
~10:00  Stragglers with aggressive caching finally switch

Real-world failover: 30 seconds detection + up to 5 minutes propagation
                     = 5-6 minutes total for most users
```

## The TTL Tradeoff

TTL controls how long DNS responses are cached. Lower TTL means faster failover but more DNS queries.

**300 seconds (5 minutes)** -- the standard default. A good balance for most services. Failover takes up to 5 minutes after DNS is updated. DNS query volume is moderate. Every resolver caches for 5 minutes, so your authoritative servers handle roughly `(total users / resolver sharing ratio) / 300` queries per second.

**60 seconds (1 minute)** -- aggressive but practical. Used when you need faster failover. Increases DNS query volume by 5x compared to 300s. Most managed DNS providers (Route53, Cloudflare, Cloud DNS) handle this without issue. Some older enterprise resolvers may clamp TTLs to a minimum of 300s regardless of what you set.

**30 seconds** -- the practical floor. Below this, DNS query volume becomes significant, and many resolvers and clients ignore TTLs this low. You are also approaching the health check detection interval -- if your health check takes 30 seconds to declare failure, a 30-second TTL does not help because the DNS has not been updated yet.

**5-10 seconds** -- theoretically possible, practically dangerous. The query volume is enormous, resolver behavior is unpredictable, and the improvement over 30s is marginal since health check detection is the bottleneck, not DNS propagation.

```
TTL    Failover Time     Queries/hour     DNS Cost (Route53)
       (after detection)  (per 1M users)
───────────────────────────────────────────────────────────
300s   up to 5 min       ~12,000          ~$0.005/hour
60s    up to 1 min       ~60,000          ~$0.024/hour
30s    up to 30 sec      ~120,000         ~$0.048/hour
10s    up to 10 sec      ~360,000         ~$0.144/hour
```

Route53 charges $0.40 per million queries. For 1 million users at 60s TTL, that is roughly $0.024/hour or ~$17/month just for DNS. At 300s TTL, it is ~$3.50/month. The cost difference is real but rarely the deciding factor.

**Pre-failover TTL reduction**: a common pattern is to run with a 300s TTL normally and reduce it to 60s before planned maintenance or deployments. Lower the TTL, wait at least one full TTL period (5 minutes for the old records to expire), then perform the change. This gives you faster failover during the risky period without paying the DNS query cost all the time.

## Health Check Design

The health check determines when failover triggers. A poorly designed health check causes false positives (unnecessary failovers) or false negatives (failure not detected).

**What to check**: the health check should validate that the application can serve real user requests. A TCP port check only proves the port is open. An HTTP check to `/healthz` that returns 200 without checking dependencies only proves the web server is running. A meaningful health check verifies the full request path:

```python
@app.route('/healthz')
def health():
    checks = {}
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except:
        checks["database"] = "fail"
        return jsonify(checks), 503

    try:
        redis_client.ping()
        checks["cache"] = "ok"
    except:
        checks["cache"] = "fail"
        return jsonify(checks), 503

    return jsonify(checks), 200
```

**Where to check from**: health checks should originate from multiple locations, not just one. Route53 health checks come from checkers in multiple AWS regions. Cloudflare checks from multiple PoPs. If you check from only one location, a network partition between the checker and your service looks like a failure even though users in other locations can still reach you.

**How often**: every 10-30 seconds is typical. Route53 supports 10s or 30s intervals. More frequent checking detects failures faster but generates more load on your health check endpoint. For a health check that queries the database, every 10 seconds from 8 Route53 health checker regions means 48 database queries per minute just for health checks.

**Failure threshold**: how many consecutive failures constitute a real outage. Route53 defaults to 3 failures. At 10-second intervals, that is 30 seconds to detect a failure. At 30-second intervals, 90 seconds. Setting the threshold to 1 causes flapping on transient network blips.

```
Detection time = interval * failure_threshold
Route53 (10s interval, 3 failures) = 30 seconds
Route53 (30s interval, 3 failures) = 90 seconds
Cloudflare (60s interval, 2 failures) = 120 seconds
```

## Route53 Failover Routing

Route53 failover routing uses a primary/secondary record pair. Traffic goes to primary when healthy, secondary when primary fails.

```hcl
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.app.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/healthz"
  failure_threshold  = 3
  request_interval   = 10
  regions            = ["us-east-1", "eu-west-1", "ap-southeast-1"]
}

resource "aws_route53_record" "primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "app.example.com"
  type           = "A"
  set_identifier = "primary"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "secondary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "app.example.com"
  type           = "A"
  set_identifier = "secondary"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}
```

Note: the secondary record does not need its own health check unless you have a third failover target. If the secondary is also down, Route53 returns the primary record regardless (better to try a potentially failing endpoint than return nothing).

## Cloudflare Failover Pools

Cloudflare uses origin pools with priority-based failover. The load balancer routes to the highest-priority healthy pool.

```
Pool configuration:
  Pool 1 (primary):   origin-us-east.app.example.com  (priority 1)
  Pool 2 (secondary): origin-eu-west.app.example.com  (priority 2)

Monitor:
  Type: HTTPS
  Path: /healthz
  Interval: 60 seconds
  Retries: 2
  Expected codes: 200
  Regions: ["WNAM", "ENAM", "WEU"]  # check from 3 regions

Steering: "off" (priority-based failover)
```

Cloudflare's advantage is that failover is handled at the edge, inside their anycast network. The Cloudflare PoP resolving the DNS query knows the health status immediately -- there is no separate DNS update step. When a pool fails health checks, the next DNS query from any PoP returns the secondary pool. This is faster than traditional DNS failover because there is no propagation delay between health check failure detection and DNS response change.

## Weighted DNS for Blue-Green Deployments

Weighted routing enables gradual traffic shifting between two deployments. Unlike hard cutover, you can shift 1% of traffic to the new version, verify, then increase.

```hcl
resource "aws_route53_record" "blue" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "app.example.com"
  type           = "A"
  set_identifier = "blue"

  weighted_routing_policy {
    weight = 90
  }

  alias {
    name    = aws_lb.blue.dns_name
    zone_id = aws_lb.blue.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "green" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "app.example.com"
  type           = "A"
  set_identifier = "green"

  weighted_routing_policy {
    weight = 10
  }

  alias {
    name    = aws_lb.green.dns_name
    zone_id = aws_lb.green.zone_id
    evaluate_target_health = true
  }
}
```

The catch: weighted DNS does not give you precise traffic splitting. Because of DNS caching, a single resolver serving thousands of users caches one answer and sends all its users to that backend until the TTL expires. The weight ratio holds statistically over many resolvers, but any individual resolver sends 100% of its traffic to one backend. For precise traffic splitting, use an application-level load balancer (ALB weighted target groups, Istio, Envoy).

## Client-Side DNS Caching Gotchas

The DNS TTL you set is a suggestion, not a command. Actual caching behavior varies:

**Java**: the JVM caches DNS forever by default when a SecurityManager is installed (common in enterprise environments). Set `networkaddress.cache.ttl=60` in `java.security` or via `-Dsun.net.inetaddressCachePolicy=60`.

**Browsers**: Chrome caches DNS for up to 60 seconds regardless of TTL. Firefox follows TTL but has its own minimum. Safari follows the OS resolver.

**Operating systems**: macOS and Windows cache DNS at the OS level. `dscacheutil -flushcache` on macOS, `ipconfig /flushdns` on Windows. Linux generally does not have an OS-level DNS cache unless systemd-resolved is running.

**Corporate resolvers**: enterprise DNS resolvers sometimes enforce minimum TTLs (often 300s or higher) regardless of the authoritative TTL. You cannot control this.

**CDN and proxy layers**: if your application sits behind a CDN that caches DNS for the origin, the CDN's resolver behavior matters, not the end user's. CloudFront resolves origins every 60 seconds regardless of TTL.

The practical implication: even with a 60-second TTL, plan for 5-10% of traffic to take 5+ minutes to fail over due to aggressive client-side caching. Design your application to handle requests arriving at a failed region gracefully -- return a redirect, a static error page, or a "try again" message rather than a connection timeout.

