---
title: "Kubernetes Production Readiness Checklist: Everything to Verify Before Going Live"
description: "Comprehensive checklist for auditing a Kubernetes cluster before running production workloads, with specific verification commands and expected outcomes for each item."
url: https://agent-zone.ai/knowledge/kubernetes/ops-production-readiness-checklist/
section: knowledge
date: 2026-02-22
categories: ["kubernetes"]
tags: ["production","checklist","audit","security","reliability","observability","operations"]
skills: ["cluster-auditing","production-readiness-assessment","pre-launch-verification"]
tools: ["kubectl","helm","trivy","kube-bench"]
levels: ["intermediate"]
word_count: 1530
formats:
  json: https://agent-zone.ai/knowledge/kubernetes/ops-production-readiness-checklist/index.json
  html: https://agent-zone.ai/knowledge/kubernetes/ops-production-readiness-checklist/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Kubernetes+Production+Readiness+Checklist%3A+Everything+to+Verify+Before+Going+Live
---


# Kubernetes Production Readiness Checklist

This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.

---

## Cluster Health

These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.

### All nodes in Ready state

```bash
kubectl get nodes -o wide
```

**Pass**: Every node shows `STATUS: Ready`. No `NotReady`, `SchedulingDisabled`, or `Unknown`.

**If failing**: Check kubelet logs on the affected node (`journalctl -u kubelet -n 50`). Common causes: expired certificates, disk pressure, memory pressure.

### System pods healthy

```bash
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
```

**Pass**: Empty output. Every system pod is Running or Succeeded (for completed Jobs).

**If failing**: `kubectl describe pod <pod> -n kube-system` to check events. CoreDNS and kube-proxy failures are critical blockers.

### DNS resolution working

```bash
# Pod-to-service resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.local

# Pod-to-external resolution
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup google.com
```

**Pass**: Both resolve with valid IP addresses. Internal resolution returns the cluster service IP. External resolution returns a public IP.

**If failing**: Check CoreDNS pods and ConfigMap. See the DNS debugging knowledge article for detailed troubleshooting.

### Cluster version matches target

```bash
kubectl version --short
```

**Pass**: Server version matches your target release (e.g., v1.29.x). Not running an alpha, beta, or end-of-life version. The version is within the supported window (N-2 minor releases from latest stable).

### etcd health verified

```bash
# On managed clusters (EKS, GKE, AKS), etcd is managed by the provider -- skip this
# On self-managed clusters:
kubectl exec -n kube-system etcd-<node-name> -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health
```

**Pass**: `127.0.0.1:2379 is healthy: successfully committed proposal`.

---

## Workload Configuration

Check every Deployment, StatefulSet, and DaemonSet that will run production traffic.

### All containers have resource requests AND limits

```bash
# Find containers without requests or limits
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.resources.requests == null or .resources.limits == null) |
  "\($ns)/\($pod)/\(.name): missing requests or limits"'
```

**Pass**: Empty output. Every container has both `requests.cpu`, `requests.memory`, `limits.cpu`, and `limits.memory` set.

**Why it matters**: Without requests, the scheduler cannot make placement decisions. Without limits, a single pod can consume all node resources.

### Liveness and readiness probes configured

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.readinessProbe == null or .livenessProbe == null) |
  "\($ns)/\($pod)/\(.name): missing probes"'
```

**Pass**: No application containers listed. System containers (kube-proxy, CNI agents) may legitimately lack probes.

**Common mistake**: Setting `livenessProbe` and `readinessProbe` to the same endpoint and timing. The liveness probe should be more lenient (higher `failureThreshold`) because a liveness failure restarts the container.

### Pod anti-affinity for multi-replica deployments

```bash
kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.replicas > 1) |
  select(.spec.template.spec.affinity.podAntiAffinity == null) |
  "\(.metadata.namespace)/\(.metadata.name): replicas=\(.spec.replicas) but no pod anti-affinity"'
```

**Pass**: No multi-replica deployments without anti-affinity. All replicas should spread across nodes.

### PodDisruptionBudgets for critical services

```bash
# List deployments with no corresponding PDB
kubectl get pdb -A -o json | jq -r '.items[].spec.selector.matchLabels' > /tmp/pdb-selectors.json
kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.replicas > 1) |
  "\(.metadata.namespace)/\(.metadata.name)"'
# Cross-reference manually: every multi-replica deployment should have a PDB
```

**Pass**: Every critical multi-replica deployment has a PDB with `minAvailable` or `maxUnavailable` set.

### Graceful shutdown handling

```bash
kubectl get deployments -A -o json | jq -r '
  .items[] |
  select(.spec.template.spec.terminationGracePeriodSeconds == null or
         .spec.template.spec.terminationGracePeriodSeconds == 30) |
  "\(.metadata.namespace)/\(.metadata.name): using default terminationGracePeriodSeconds (30s)"'
```

**Pass**: Critical services have a `terminationGracePeriodSeconds` appropriate for their shutdown behavior. Services with long-running connections or background jobs need longer than the 30s default.

### Image tags are pinned (no :latest)

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.image | test(":latest$") or (test(":") | not)) |
  "\($ns)/\($pod): \(.image)"'
```

**Pass**: Empty output. Every image uses a specific tag or digest (e.g., `nginx:1.25.3` or `nginx@sha256:...`). Never `:latest` in production.

### Images from trusted registries only

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  .spec.containers[] |
  .image' | sort -u | grep -v -E '^(registry\.company\.com|gcr\.io/my-project|[0-9]+\.dkr\.ecr)'
```

**Pass**: All images come from your organization's approved registries. No images from Docker Hub public repositories in production.

---

## Security

### RBAC configured (no unnecessary cluster-admin)

```bash
kubectl get clusterrolebindings -o json | jq -r '
  .items[] |
  select(.roleRef.name == "cluster-admin") |
  "\(.metadata.name): \(.subjects // [] | map(.name) | join(", "))"'
```

**Pass**: Only system accounts and a single ops-team binding have cluster-admin. No individual user accounts, no CI/CD service accounts with cluster-admin.

### Pod Security Standards enforced

```bash
kubectl get namespaces -o json | jq -r '
  .items[] |
  select(.metadata.labels["pod-security.kubernetes.io/enforce"] != null) |
  "\(.metadata.name): enforce=\(.metadata.labels["pod-security.kubernetes.io/enforce"])"'
```

**Pass**: All application namespaces have at least `baseline` enforcement. The `restricted` level is applied as `warn` or `audit` on production namespaces.

### Network policies in place (default deny)

```bash
kubectl get networkpolicy -A -o json | jq -r '
  .items[] |
  select(.spec.podSelector == {} or .spec.podSelector.matchLabels == null) |
  "\(.metadata.namespace)/\(.metadata.name)"'
```

**Pass**: Every application namespace has a default-deny NetworkPolicy with an empty `podSelector`.

### Service accounts not using default SA

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.metadata.namespace != "kube-system") |
  select(.spec.serviceAccountName == "default" or .spec.serviceAccountName == null) |
  "\(.metadata.namespace)/\(.metadata.name): using default ServiceAccount"'
```

**Pass**: No application pods use the default ServiceAccount. Each workload has its own SA with minimal permissions.

### Containers run as non-root

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.securityContext.runAsNonRoot != true) |
  select(.securityContext.runAsUser == null or .securityContext.runAsUser == 0) |
  "\($ns)/\($pod)/\(.name): may run as root"'
```

**Pass**: All application containers have `runAsNonRoot: true` or an explicit non-zero `runAsUser`.

### Read-only root filesystem where possible

```bash
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.securityContext.readOnlyRootFilesystem != true) |
  "\($ns)/\($pod)/\(.name): writable root filesystem"'
```

**Pass**: Most application containers use `readOnlyRootFilesystem: true` with `emptyDir` mounts for any paths that need writes (e.g., `/tmp`).

---

## Networking

### Ingress TLS configured

```bash
kubectl get ingress -A -o json | jq -r '
  .items[] |
  select(.spec.tls == null or (.spec.tls | length) == 0) |
  "\(.metadata.namespace)/\(.metadata.name): no TLS configured"'
```

**Pass**: Every Ingress resource has a `tls` section with valid secret references.

### cert-manager auto-renewal working

```bash
kubectl get certificates -A -o json | jq -r '
  .items[] |
  "\(.metadata.namespace)/\(.metadata.name): ready=\(.status.conditions[] | select(.type=="Ready") | .status) renewal=\(.status.renewalTime)"'
```

**Pass**: All certificates show `Ready=True` and `renewalTime` is in the future.

### Load balancer health checks configured

```bash
kubectl get svc -A -o json | jq -r '
  .items[] |
  select(.spec.type == "LoadBalancer") |
  "\(.metadata.namespace)/\(.metadata.name): externalTrafficPolicy=\(.spec.externalTrafficPolicy)"'
```

**Pass**: Load balancer services exist and have appropriate `externalTrafficPolicy` (usually `Local` for preserving source IP).

---

## Observability

### Metrics collection working

```bash
kubectl top nodes
kubectl top pods -n app-production
```

**Pass**: Both commands return current CPU and memory usage data. If metrics-server is not installed, both commands will fail.

### Logging pipeline functional

```bash
# Check that log collector DaemonSet is running on all nodes
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=promtail
# Or for fluent-bit:
kubectl get daemonset -n monitoring -l app.kubernetes.io/name=fluent-bit

# Verify logs are queryable in Loki/your logging backend
# Through Grafana Explore: query {namespace="app-production"} and confirm results appear
```

**Pass**: DaemonSet has desired=available on all nodes. Recent logs are queryable.

### Alerting configured and tested

```bash
# Check Alertmanager has receivers configured
kubectl get secret alertmanager-kube-prometheus-alertmanager -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d | head -20

# Check for active alerts
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring &
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname' | head
```

**Pass**: Alertmanager config shows at least one non-default receiver (Slack, PagerDuty, etc.). Test alert was received by the team.

---

## Reliability

### Backup strategy implemented and tested

```bash
# Check Velero status
velero backup get
velero schedule get
```

**Pass**: At least one scheduled backup exists. The most recent backup status is `Completed`. A restore test has been performed in the last 30 days.

### Horizontal auto-scaling configured

```bash
kubectl get hpa -A
```

**Pass**: Critical workloads have HPA configured with appropriate min/max replicas and target metrics.

### Node auto-scaling configured

```bash
# Check Cluster Autoscaler or Karpenter
kubectl get pods -n kube-system | grep -E 'cluster-autoscaler|karpenter'
```

**Pass**: Node auto-scaler is running and configured for the node groups that host application workloads.

---

## Operations

### GitOps or deployment pipeline working

```bash
kubectl get applications -n argocd    # ArgoCD
# or
flux get kustomizations               # Flux
```

**Pass**: Applications are synced and healthy. Last sync was recent (within expected interval).

### Rollback procedure tested

```bash
# Verify rollback capability
kubectl rollout history deployment/<name> -n app-production
```

**Pass**: Deployment revision history exists with at least 2 entries. Team has documented and tested the rollback procedure. `revisionHistoryLimit` is set to a reasonable number (10 is the default).

---

## Scoring

Count passing items from the checklist above:

| Score | Assessment |
|-------|------------|
| 30+ items pass | Production ready |
| 25-29 items pass | Near ready -- address gaps before launch |
| 20-24 items pass | Significant gaps -- schedule a hardening sprint |
| Below 20 | Not production ready -- major work required |

Generate a report listing every failing item, its risk level (critical/high/medium/low), and the specific remediation step. Critical items (cluster health, RBAC, network policies) must be fixed before go-live. High items should be fixed within the first week. Medium and low items can be tracked in a backlog.