Dockerfile Best Practices: Secure, Efficient Container Images

Dockerfile Best Practices#

A Dockerfile is a security boundary. Every decision – base image, installed package, file copied in, user the process runs as – determines the attack surface of your running container. Most Dockerfiles in the wild are bloated, run as root, and ship debug tools an attacker can use. Here is how to fix that.

Choose the Right Base Image#

Your base image choice is the single biggest factor in image size and vulnerability count.

EKS IAM and Security

EKS IAM and Security#

EKS bridges two identity systems: AWS IAM and Kubernetes RBAC. Understanding how they connect is essential for both granting pods access to AWS services and controlling who can access the cluster.

IAM Roles for Service Accounts (IRSA)#

IRSA lets Kubernetes pods assume IAM roles without using node-level credentials. Each pod gets exactly the AWS permissions it needs, not the broad permissions attached to the node role.

EKS Networking and Load Balancing

EKS Networking and Load Balancing#

EKS networking differs fundamentally from generic Kubernetes networking. Pods get real VPC IP addresses, load balancers are AWS-native resources, and networking decisions have direct cost and IP capacity implications.

VPC CNI: How Pod Networking Works#

The AWS VPC CNI plugin assigns each pod an IP address from your VPC CIDR. Unlike overlay networks (Calico, Flannel), pods are directly routable within the VPC. This means security groups, NACLs, and VPC flow logs all work with pod traffic natively.

EKS Setup and Configuration

EKS Setup and Configuration#

Amazon EKS runs the Kubernetes control plane for you – managed etcd, API server, and controller manager across multiple AZs. You are responsible for the worker nodes, networking configuration, and add-ons.

Cluster Creation Methods#

eksctl is the fastest path for a working cluster. It creates the VPC, subnets, NAT gateway, IAM roles, node groups, and kubeconfig in one command:

eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --version 1.31 \
  --nodegroup-name workers \
  --node-type m6i.large \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed

For repeatable setups, use a ClusterConfig file:

EKS Troubleshooting

EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

From Empty Cluster to Production-Ready: The Complete Setup Sequence

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

GKE Networking

GKE Networking#

GKE networking centers on VPC-native clusters, where pods and services get IP addresses from VPC subnet ranges. This integrates Kubernetes networking directly into Google Cloud’s VPC, enabling native routing, firewall rules, and load balancing without extra overlays.

VPC-Native Clusters and Alias IP Ranges#

VPC-native clusters use alias IP ranges on the subnet. You allocate two secondary ranges: one for pods, one for services.

# Create subnet with secondary ranges
gcloud compute networks subnets create gke-subnet \
  --network my-vpc \
  --region us-central1 \
  --range 10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20

# Create cluster using those ranges
gcloud container clusters create my-cluster \
  --region us-central1 \
  --network my-vpc \
  --subnetwork gke-subnet \
  --cluster-secondary-range-name pods \
  --services-secondary-range-name services \
  --enable-ip-alias

The pod range needs to be large. A /14 gives about 262,000 pod IPs. Each node reserves a /24 from the pod range (256 IPs, 110 usable pods per node). If you have 100 nodes, that consumes 100 /24 blocks. Undersizing the pod range is a common cause of IP exhaustion – the cluster cannot add nodes even though VMs are available.

GKE Security and Identity

GKE Security and Identity#

GKE security covers identity (who can do what), workload isolation (sandboxing untrusted code), supply chain integrity (ensuring only trusted images run), and data protection (encryption at rest). These features layer on top of standard Kubernetes RBAC and network policies.

Workload Identity Federation#

Workload Identity Federation is the successor to the original Workload Identity. It removes the need for a separate workload-pool flag and uses the standard GCP IAM federation model. The concept is the same: bind a Kubernetes service account to a Google Cloud service account so pods get GCP credentials without exported keys.

GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.