---
title: "Temporal Cross-Cluster Communication: Architecture and Patterns"
description: "Evaluate three approaches for cross-cluster Temporal communication: namespace replication, worker bridges, and workflow-level coordination via signals or external brokers."
url: https://agent-zone.ai/knowledge/workflow-orchestration/temporal-cross-cluster-communication/
section: knowledge
date: 2026-02-22
categories: ["workflow-orchestration"]
tags: ["temporal","cross-cluster","multi-region","namespace-replication","worker-bridge","distributed-workflows","architecture"]
skills: ["cross-cluster-architecture","multi-region-workflow-design","namespace-replication","bridge-pattern-design"]
tools: ["temporal","go"]
levels: ["advanced"]
word_count: 2011
formats:
  json: https://agent-zone.ai/knowledge/workflow-orchestration/temporal-cross-cluster-communication/index.json
  html: https://agent-zone.ai/knowledge/workflow-orchestration/temporal-cross-cluster-communication/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Temporal+Cross-Cluster+Communication%3A+Architecture+and+Patterns
---


# Temporal Cross-Cluster Communication

When you operate multiple Temporal clusters -- whether for regional deployment, compliance isolation, or blast radius reduction -- workflows in one cluster eventually need to trigger work in another. This article examines three architectural approaches for cross-cluster communication, their tradeoffs, and guidance on choosing the right one for your situation.

This is an architecture guide. It establishes the concepts and patterns. The next article, [Building a Worker Bridge](../temporal-cross-cluster-worker-bridge/), provides the full implementation.

All diagrams and code snippets reference the companion repository at [github.com/statherm/temporal-examples](https://github.com/statherm/temporal-examples) in the `cross-cluster/` directory.

## The Cross-Cluster Problem

Consider a concrete scenario. Your organization runs two Temporal clusters:

- **Cluster A** (us-east): Handles order processing workflows. Has access to the payment gateway and order database.
- **Cluster B** (eu-west): Handles fulfillment workflows. Has access to the EU warehouse systems and shipping APIs.

An order placed by an EU customer must be processed in Cluster A (payment) and fulfilled in Cluster B (warehouse). The order workflow in Cluster A needs to trigger fulfillment work in Cluster B and eventually receive the shipping confirmation.

This is not a hypothetical edge case. Any organization with data residency requirements, regional deployments, or separate infrastructure teams will encounter this problem.

The three approaches differ in where they place the complexity: inside Temporal, in a bridge component, or in the workflow logic itself.

## Approach 1: Namespace Replication

Temporal's built-in multi-cluster replication synchronizes namespace data across clusters. One cluster is active (accepts writes), and one or more are standby (read-only replicas). Failover promotes a standby to active.

### How It Works

Namespace replication continuously streams workflow events from the active cluster to standby clusters. When you start a workflow on the active cluster, its history events are replicated asynchronously to standby clusters. If the active cluster goes down, a standby can be promoted to active, and workflows resume from the last replicated state.

```go
// Configure namespace for multi-cluster replication
// (Typically done via temporal CLI or admin API, not in application code)

// temporal operator namespace create \
//   --namespace orders \
//   --cluster active=cluster-a \
//   --cluster standby=cluster-b
```

### Pros

- **Built into Temporal.** No additional infrastructure to build or maintain.
- **Transparent to workflows.** Workflow code does not change. Replication happens at the infrastructure level.
- **Automatic failover.** Temporal handles promoting standby clusters when the active goes down.

### Cons

- **Requires Temporal Cloud or Enterprise.** Open-source Temporal supports multi-cluster replication, but configuration and operation are significantly more complex without the managed offering.
- **All-or-nothing per namespace.** You replicate an entire namespace or nothing. You cannot selectively replicate specific workflows.
- **Active-passive model.** Only one cluster accepts writes for a given namespace at a time. This is not active-active.
- **Replication lag.** Events take time to propagate. If the active cluster fails, some recent events may be lost (the replication lag window).
- **Same Temporal version required.** All clusters in a replication group must run compatible Temporal Server versions.

### When to Use

Namespace replication is the right choice when your primary goal is disaster recovery or regional failover. It is not designed for the "run part of the workflow here, part there" scenario. It replicates workflows, it does not distribute them.

## Approach 2: Worker Bridge

A worker bridge is a component that connects two independent clusters. It polls one cluster for tasks and executes them using the other cluster's resources.

### Architecture

```
Cluster A (us-east)              Cluster B (eu-west)
┌──────────────────────┐        ┌──────────────────────┐
│ Temporal Server A    │        │ Temporal Server B    │
│                      │        │                      │
│ OrderWorkflow        │        │ FulfillmentWorkflow  │
│ ┌──────────────────┐ │        │                      │
│ │ Task Queue:      │ │ polls  │ ┌──────────────────┐ │
│ │ bridge-to-eu     │◄────────│ │ Bridge Worker    │ │
│ │                  │ │        │ │                  │ │
│ │ Activities:      │ │        │ │ Connects to:     │ │
│ │ - StartRemote    │ │        │ │ - Server A       │ │
│ │ - WaitRemote     │ │        │ │   (poll tasks)   │ │
│ │ - QueryRemote    │ │        │ │ - Server B       │ │
│ └──────────────────┘ │        │ │   (execute work) │ │
│                      │        │ └──────────────────┘ │
└──────────────────────┘        └──────────────────────┘
```

The bridge worker lives in Cluster B's environment. It holds two Temporal client connections:

1. **Client A**: Connected to Cluster A's Temporal Server. The bridge registers as a worker on Cluster A's `bridge-to-eu` task queue.
2. **Client B**: Connected to Cluster B's Temporal Server. The bridge uses this client to start workflows, run activities, or access local resources in Cluster B.

When the OrderWorkflow in Cluster A needs fulfillment, it dispatches an activity on the `bridge-to-eu` task queue. The bridge worker picks it up, starts a FulfillmentWorkflow on Cluster B using Client B, waits for it to complete, and returns the result to Cluster A.

### Pros

- **Works with open-source Temporal.** No enterprise features required.
- **Fine-grained control.** You decide exactly which activities cross the cluster boundary.
- **Different resource pools.** Each cluster can have different hardware, network access, and security posture.
- **Independent scaling.** Clusters and the bridge scale independently.

### Cons

- **Additional infrastructure.** The bridge is another component to deploy, monitor, and maintain.
- **Bridge is a potential SPOF.** If the bridge worker goes down, cross-cluster work stops. Mitigate with multiple replicas.
- **Added latency.** Cross-cluster calls add network round-trip time plus the overhead of starting/monitoring a remote workflow.
- **Operational complexity.** You must manage connections to two clusters, handle version skew, and monitor bridge health.

### Key Design Decisions

When building a worker bridge, you need to decide:

**Activity vs. workflow on the remote side.** Does the bridge start a full workflow on Cluster B, or does it execute a single activity? Full workflows give Cluster B's Temporal the durability guarantees. Single activities are simpler but lose durability for complex operations.

**Synchronous vs. asynchronous bridging.** Does the bridge wait for the remote work to complete (blocking the activity), or does it fire-and-forget and use a callback? Synchronous is simpler. Asynchronous allows the calling workflow to do other work while waiting.

**Error propagation.** How do errors in Cluster B surface in Cluster A? The bridge must translate remote errors into something the calling workflow can understand and handle.

### When to Use

The worker bridge is the right choice when you need workflows in one cluster to execute specific work in another cluster using open-source Temporal. It is the most flexible approach and the one we implement in detail in the [next article](../temporal-cross-cluster-worker-bridge/).

## Approach 3: Workflow-Level Coordination

Instead of a dedicated bridge component, each cluster runs independent workflows that communicate through an external channel -- signals, a message broker (Kafka, NATS), or a shared database.

### Signal-Based Coordination

If the clusters can reach each other's Temporal Frontend services (see [Docker Network Bridging](../temporal-multi-cluster-minikube/)), workflows can signal each other directly using `SignalExternalWorkflow` with the target cluster's client.

```go
// In Cluster A's workflow: coordinate with Cluster B via external broker
func OrderWorkflowWithExternalCoordination(ctx workflow.Context, order Order) error {
    // Process payment locally
    err := workflow.ExecuteActivity(ctx, ProcessPayment, order).Get(ctx, nil)
    if err != nil {
        return err
    }

    // Publish fulfillment request to shared broker
    err = workflow.ExecuteActivity(ctx, PublishFulfillmentRequest, FulfillmentRequest{
        OrderID:     order.ID,
        CallbackKey: workflow.GetInfo(ctx).WorkflowExecution.ID,
        Items:       order.Items,
        Destination: order.ShippingAddress,
    }).Get(ctx, nil)
    if err != nil {
        return err
    }

    // Wait for fulfillment completion signal (from Cluster B's workflow)
    ch := workflow.GetSignalChannel(ctx, "fulfillment-complete")
    var result FulfillmentResult

    // Timeout: fulfillment should complete within 48 hours
    timerCtx, cancel := workflow.WithCancel(ctx)
    selector := workflow.NewSelector(ctx)
    selector.AddReceive(ch, func(c workflow.ReceiveChannel, more bool) {
        c.Receive(ctx, &result)
        cancel()
    })
    selector.AddFuture(workflow.NewTimer(timerCtx, 48*time.Hour), func(f workflow.Future) {
        result.Status = "timeout"
    })
    selector.Select(ctx)

    if result.Status == "timeout" {
        return workflow.ExecuteActivity(ctx, EscalateFulfillmentTimeout, order.ID).Get(ctx, nil)
    }

    return workflow.ExecuteActivity(ctx, SendShippingConfirmation, result).Get(ctx, nil)
}
```

```go
// In Cluster B: a consumer workflow picks up the fulfillment request
func FulfillmentWorkflow(ctx workflow.Context, req FulfillmentRequest) error {
    var result FulfillmentResult

    err := workflow.ExecuteActivity(ctx, PickAndPack, req).Get(ctx, &result)
    if err != nil {
        return err
    }

    err = workflow.ExecuteActivity(ctx, ShipOrder, req).Get(ctx, &result.ShippingInfo)
    if err != nil {
        return err
    }

    // Notify Cluster A via external broker
    result.Status = "shipped"
    return workflow.ExecuteActivity(ctx, PublishFulfillmentComplete, PublishRequest{
        CallbackKey: req.CallbackKey,
        Result:      result,
    }).Get(ctx, nil)
}
```

### Pros

- **Loosely coupled.** Each cluster is fully independent. They do not need to know about each other's Temporal Server addresses.
- **Independent scaling and deployment.** Each cluster's workflows evolve independently.
- **Technology agnostic.** The coordination channel can be anything -- Kafka, NATS, Redis, a database table, an HTTP webhook.

### Cons

- **Complex.** Two workflows, an external broker, and retry logic on both sides.
- **Eventual consistency.** There is always a window where one side has completed but the other does not yet know.
- **Harder to debug.** Tracing a request across clusters, through a broker, and into another workflow requires distributed tracing infrastructure.
- **No transactional guarantee.** If the broker message is lost, the coordination breaks. You need idempotent consumers and dead-letter queues.

### When to Use

Workflow-level coordination is the right choice when clusters are truly independent and loosely coupled -- different teams, different deployment cadences, minimal shared logic. It is also the right choice when you cannot establish direct network connectivity between clusters.

## Choosing an Approach

Use this decision matrix to select the right cross-cluster pattern:

| Factor | Namespace Replication | Worker Bridge | Workflow Coordination |
|---|---|---|---|
| Temporal edition | Cloud / Enterprise | Open-source | Open-source |
| Primary use case | DR / failover | Distributed execution | Loose coupling |
| Network requirement | Cluster-to-cluster | Cluster-to-cluster | Via broker only |
| Latency impact | Replication lag | Activity round-trip | Broker + consumer lag |
| Workflow code changes | None | Minimal (new task queue) | Significant |
| Operational complexity | Low (managed) | Medium | High |
| Coupling | Tight (same namespace) | Medium (shared task queue) | Loose |
| Durability guarantee | Temporal-managed | Bridge-managed | Application-managed |

For most teams starting with cross-cluster patterns, the **worker bridge** is the best balance of flexibility, simplicity, and compatibility with open-source Temporal.

## Network and Security

Cross-cluster communication introduces network paths that must be secured.

**mTLS between clusters.** Both the worker bridge and namespace replication require TLS connections between Temporal clients and servers. Use mutual TLS with per-cluster certificates. Temporal's client SDK supports TLS configuration:

```go
tlsConfig := &tls.Config{
    Certificates: []tls.Certificate{cert},
    RootCAs:      caCertPool,
    ServerName:   "temporal-frontend.cluster-b.example.com",
}

clientB, err := client.Dial(client.Options{
    HostPort: "temporal-cluster-b:7233",
    ConnectionOptions: client.ConnectionOptions{
        TLS: tlsConfig,
    },
})
```

**Network policies.** In Kubernetes, use NetworkPolicy resources to restrict which pods can connect to the bridge and which can reach external clusters. The bridge worker should be the only pod with cross-cluster network access.

**Firewall rules.** Open only the Temporal Frontend gRPC port (7233) between clusters. Do not expose the internal inter-node ports (7234, 7235, 7239).

## Conflict Resolution

When work spans clusters, you must handle the case where the same logical operation is triggered more than once -- network retries, bridge restarts, or duplicate broker messages.

### Idempotency Keys

Include the source cluster ID in all idempotency keys:

```go
func GenerateCrossClusterKey(sourceCluster, workflowID, activityName string) string {
    return fmt.Sprintf("%s:%s:%s", sourceCluster, workflowID, activityName)
}
```

Use this key as the workflow ID on the remote cluster. Temporal's "workflow already started" error provides natural deduplication when you use deterministic workflow IDs.

### Handling Duplicate Starts

When the bridge starts a workflow on Cluster B with a deterministic ID and a workflow with that ID already exists, Temporal returns a `WorkflowExecutionAlreadyStarted` error. The bridge should treat this as success and attach to the existing execution:

```go
func (b *BridgeActivities) StartRemoteWorkflow(ctx context.Context, req RemoteWorkflowRequest) (string, error) {
    workflowID := GenerateCrossClusterKey(req.SourceCluster, req.SourceWorkflowID, req.ActivityName)

    opts := client.StartWorkflowOptions{
        ID:        workflowID,
        TaskQueue: req.RemoteTaskQueue,
    }

    we, err := b.remoteClient.ExecuteWorkflow(ctx, opts, req.WorkflowType, req.Input)
    if err != nil {
        // Check if workflow already exists
        var alreadyStarted *serviceerror.WorkflowExecutionAlreadyStarted
        if errors.As(err, &alreadyStarted) {
            // Attach to existing execution
            return workflowID, nil
        }
        return "", fmt.Errorf("failed to start remote workflow: %w", err)
    }

    return we.GetID(), nil
}
```

### Last-Writer-Wins vs. Merge

For read-only operations (queries, data retrieval), last-writer-wins is fine. For mutations, consider whether your use case tolerates overwrites or requires merge logic. In most infrastructure workflows, operations are naturally idempotent (creating a resource that already exists is a no-op), making conflict resolution straightforward.

## What We Build Next

The [next article](../temporal-cross-cluster-worker-bridge/) implements the worker bridge pattern end-to-end:

1. A bridge worker binary with two Temporal clients
2. Bridge activities that start and monitor workflows on the remote cluster
3. Cross-cluster idempotency handling
4. A Kubernetes Deployment for running the bridge
5. An end-to-end test across two minikube clusters

If you have not yet set up the two-cluster minikube environment, do that first: [Multiple Temporal Servers on Minikube](../temporal-multi-cluster-minikube/).

For the signal-based coordination patterns referenced in Approach 3, see [Temporal Signals for Automated Coordination](../temporal-signals-automated/).