Temporal High Availability: Multi-Component Cluster on Kubernetes

Temporal High Availability#

A single-replica Temporal deployment works for development, but any pod going down takes the workflow engine offline. This guide configures a multi-replica cluster with proper resource allocation, Elasticsearch visibility, and health monitoring.

For the single-replica setup this builds on, see Running Temporal Server on Minikube.

Why HA Matters#

ComponentWhat Breaks When It Goes Down
FrontendNo client can start, signal, query, or cancel workflows. Workers cannot poll.
HistoryRunning workflows stall. No state transitions. Timers do not fire.
MatchingTasks queue up but never dispatch. Workflows appear frozen.
WorkerInternal system workflows stop (archival, replication). Application workflows unaffected.

With multiple replicas, losing a pod triggers a brief rebalance (seconds), not an outage.

Choosing a Log Aggregation Stack: Loki vs Elasticsearch vs CloudWatch Logs vs Vector+ClickHouse

Choosing a Log Aggregation Stack#

Logs are the most fundamental observability signal. Every application produces them, every incident investigation starts with them, and every compliance framework requires retaining them. The challenge is not collecting logs – it is storing, indexing, querying, and retaining them at scale without spending a fortune.

The choice of log aggregation stack determines your query speed, operational burden, storage costs, and how effectively you can correlate logs with metrics and traces during incident response.

Log Analysis and Management Strategies: Structured Logging, Aggregation, Retention, and Correlation

The Decision Landscape#

Log management is deceptively simple on the surface – applications write text, you store it, you search it later. In practice, every decision in the log pipeline involves tradeoffs between cost, query speed, retention depth, operational complexity, and correlation with other observability signals. This guide provides a framework for making those decisions based on your actual requirements rather than defaults or trends.

Structured Logging: The Foundation#

Before choosing any aggregation tool, standardize on structured logging. Unstructured logs are human-readable but machine-hostile. Structured logs are both.

Elasticsearch and OpenSearch: Indexing, Queries, Cluster Management, and Performance

Elasticsearch and OpenSearch: Indexing, Queries, Cluster Management, and Performance#

Elasticsearch and OpenSearch are distributed search and analytics engines built on Apache Lucene. They excel at full-text search, log aggregation, metrics storage, and any workload that benefits from inverted indices. Understanding index design, mappings, query mechanics, and cluster management separates a working setup from one that collapses under production load.

Elasticsearch vs OpenSearch#

OpenSearch is the AWS-maintained fork of Elasticsearch, created after Elastic changed its license from Apache 2.0 to the Server Side Public License (SSPL) in early 2021. For the vast majority of use cases, the two are interchangeable. APIs are compatible, concepts are identical, and most tooling works with both. OpenSearch Dashboards replaces Kibana. This guide applies to both unless explicitly noted.