Database High Availability Patterns

Database High Availability Patterns#

Every database HA decision starts with two numbers: RPO (Recovery Point Objective – how much data you can afford to lose) and RTO (Recovery Time Objective – how long the database can be unavailable). These numbers dictate the pattern, and each pattern carries specific operational tradeoffs.

Core Concepts#

RPO = 0 means zero data loss. Every committed transaction must survive a failure. This requires synchronous replication, which adds latency to every write.

HAProxy Configuration and Operations

Configuration File Structure#

HAProxy uses a single configuration file, typically /etc/haproxy/haproxy.cfg. The configuration is divided into four sections: global (process-level settings), defaults (default values for all proxies), frontend (client-facing listeners), and backend (server pools).

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 50000
    user haproxy
    group haproxy
    daemon
    stats socket /run/haproxy/admin.sock mode 660 level admin
    tune.ssl.default-dh-param 2048

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    timeout http-request 10s
    timeout http-keep-alive 10s
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http

The maxconn in the global section sets the hard limit on total simultaneous connections. The timeout values in defaults apply to all frontends and backends unless overridden. The stats socket directive enables the runtime API, which is essential for operational management.

PostgreSQL Replication

PostgreSQL Replication#

Streaming replication gives you a full binary copy for high availability and read scaling. Logical replication gives you selective table-level syncing between databases that can run different PostgreSQL versions.

Streaming Replication Setup#

Configure the Primary#

# postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB

Create a replication role and allow connections:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'repl-secret';
# pg_hba.conf
host  replication  replicator  10.0.0.0/8  scram-sha-256

Initialize the Standby#

sudo systemctl stop postgresql-16
sudo rm -rf /var/lib/postgresql/16/main/*
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/16/main \
  --checkpoint=fast --wal-method=stream -R -P
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
sudo systemctl start postgresql-16

The -R flag creates standby.signal and writes connection info to postgresql.auto.conf. The standby now continuously receives and replays WAL from the primary, accepting read-only queries by default.

Monitoring Prometheus Itself: Capacity Planning, Self-Monitoring, and Scaling

Why Monitor Your Monitoring#

If Prometheus runs out of memory and crashes, you lose all alerting. If its disk fills up, it stops ingesting and you have a blind spot that may last hours before anyone notices. If scrapes start timing out, metrics go stale and alerts based on rate() produce no data (which means they silently stop firing rather than triggering). Prometheus must be the most reliably monitored component in your stack.

Pod Affinity and Anti-Affinity: Co-locating and Spreading Workloads

Pod Affinity and Anti-Affinity#

Node affinity controls which nodes a pod can run on. Pod affinity and anti-affinity go further – they control whether a pod should run near or away from other specific pods. This is how you co-locate a frontend with its cache for low latency, or spread database replicas across failure domains for high availability.

Pod Affinity: Schedule Near Other Pods#

Pod affinity tells the scheduler “place this pod in the same topology domain as pods matching a label selector.” The topology domain is defined by topologyKey – it could be the same node, the same zone, or any other node label.

Pod Topology Spread Constraints: Even Distribution Across Failure Domains

Pod Topology Spread Constraints#

Pod anti-affinity gives you binary control: either a pod avoids another pod’s topology domain or it does not. But it does not give you even distribution. If you have 6 replicas and 3 zones, anti-affinity cannot express “put exactly 2 in each zone.” Topology spread constraints solve this by letting you specify the maximum allowed imbalance between any two topology domains.

How Topology Spread Works#

A topology spread constraint defines: