AWS CodePipeline and CodeBuild: Pipeline Structure, ECR Integration, ECS/EKS Deployments, and Cross-Account Patterns

AWS CodePipeline and CodeBuild#

AWS CodePipeline orchestrates CI/CD workflows as a series of stages. CodeBuild executes the actual build and test commands. Together they provide a fully managed pipeline that integrates natively with S3, ECR, ECS, EKS, Lambda, and CloudFormation. No servers to manage, no agents to maintain – but the trade-off is less flexibility than self-hosted systems and tighter coupling to the AWS ecosystem.

Pipeline Structure#

A CodePipeline has stages, and each stage has actions. Actions can run in parallel or sequentially within a stage. The most common pattern is Source -> Build -> Deploy:

Blue-Green Deployments: Traffic Switching, Database Compatibility, and Rollback Strategies

Blue-Green Deployments#

A blue-green deployment runs two identical production environments. One (blue) serves live traffic. The other (green) is idle or running the new version. When the green environment passes validation, you switch traffic from blue to green. If something goes wrong, you switch back. The old environment stays running until you are confident the new version is stable.

The fundamental advantage over rolling updates is atomicity. Traffic switches from 100% old to 100% new in a single operation. There is no period where some users see the old version and others see the new one.

PostgreSQL Disaster Recovery

PostgreSQL Disaster Recovery#

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode – replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

Streaming Replication for DR#

Synchronous vs Asynchronous – The Core Tradeoff#

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Cloud Managed Database Disaster Recovery

Cloud Managed Database Disaster Recovery#

Every cloud provider offers managed database DR, but the actual behavior during a failure rarely matches the marketing. The documented failover time is the best case. The real failover time includes detection delay, DNS propagation, and connection draining. This guide covers what actually happens.

AWS: RDS and Aurora#

RDS Multi-AZ#

RDS Multi-AZ maintains a synchronous standby in a different availability zone. When the primary fails, RDS flips the DNS CNAME to the standby.

Backup Verification and Restore Testing: Proving Your Backups Actually Work

Backup Verification and Restore Testing#

An untested backup is not a backup. It is a file that might contain your data and might be restorable. Teams discover the difference during an actual incident, when the database backup turns out to be corrupted, the restore takes 6 hours instead of the expected 30 minutes, or the backup process silently stopped running three weeks ago.

Backup verification is the practice of regularly proving that your backups contain valid data and can be restored within your required RTO.

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees#

A DR runbook is used during the worst moments of an engineer’s career: systems are down, customers are impacted, leadership is asking for updates, and decisions carry real consequences. The runbook must be clear enough that someone running on adrenaline and three hours of sleep can execute it correctly.

This means: short sentences, numbered steps, explicit commands (copy-paste ready), no ambiguity about who does what, and timing estimates for each phase so the incident commander knows if things are taking too long.

Active-Active Architecture Patterns: Multi-Region, Data Replication, and Split-Brain Resolution

What Active-Active Actually Means#

Active-active means both (or all) regions are serving production traffic simultaneously. Not standing by. Not warmed up and waiting. Actually processing real user requests right now. A user in Frankfurt hits the EU region; a user in Virginia hits the US-East region. Both regions are authoritative. Both can read and write.

This is fundamentally different from active-passive, where the secondary region exists but does not serve traffic until failover. The distinction matters because active-active introduces a class of problems that active-passive avoids entirely – primarily, what happens when two regions modify the same data at the same time.

Active-Passive vs Active-Active: Decision Framework for Multi-Region Architecture

The Core Difference#

Active-passive: one region handles all traffic, a second region stands ready to take over. Failover is an event – something triggers it, traffic shifts, and there is a gap between detection and recovery.

Active-active: both regions handle production traffic simultaneously. There is no failover event for regional traffic – if one region fails, the other is already serving users. The complexity is in keeping data consistent across regions, not in switching traffic.

Global Load Balancing and Geo-Routing: DNS GSLB, Anycast, and Cloud Provider Configurations

DNS-Based Global Server Load Balancing#

Global server load balancing (GSLB) directs users to the nearest or healthiest regional deployment. The most common approach is DNS-based: the authoritative DNS server returns different IP addresses depending on the querying client’s location, the health of backend regions, or configured routing policies.

When a user resolves app.example.com, the GSLB-aware DNS server considers the user’s location (inferred from the resolver’s IP or EDNS Client Subnet), the health of each regional endpoint, and the configured routing policy. It returns the IP address of the best region for that user.

AWS Fundamentals for Agents

IAM: Identity and Access Management#

IAM controls who can do what in your AWS account. Everything in AWS is an API call, and IAM decides which API calls are allowed. There are three concepts an agent must understand: users, roles, and policies.

Users are long-lived identities for humans or service accounts. Roles are temporary identities that can be assumed by users, services, or other AWS accounts. Policies are JSON documents that define permissions. Roles are always preferred over users for programmatic access because they issue short-lived credentials through STS (Security Token Service).