PodDisruptionBudgets Deep Dive

PodDisruptionBudgets Deep Dive#

A PodDisruptionBudget (PDB) limits how many pods from a set can be simultaneously down during voluntary disruptions – node drains, cluster upgrades, autoscaler scale-down. PDBs do not protect against involuntary disruptions like node crashes or OOM kills. They are the mechanism by which you tell Kubernetes “this service needs at least N healthy pods at all times during maintenance.”

minAvailable vs maxUnavailable#

PDBs support two fields. Use one or the other, not both.

Advanced Kubernetes Debugging: CrashLoopBackOff, ImagePullBackOff, OOMKilled, and Stuck Pods

Advanced Kubernetes Debugging#

Every Kubernetes failure follows a pattern, and every pattern has a diagnostic sequence. This guide covers the most common failure modes you will encounter in production, with the exact commands and thought process to move from symptom to resolution.

Systematic Debugging Methodology#

Before diving into specific scenarios, internalize this sequence. It applies to nearly every pod issue:

# Step 1: What state is the pod in?
kubectl get pod <pod> -n <ns> -o wide

# Step 2: What does the full pod spec and event history show?
kubectl describe pod <pod> -n <ns>

# Step 3: What did the application log before it failed?
kubectl logs <pod> -n <ns> --previous --all-containers

# Step 4: Can you get inside the container?
kubectl exec -it <pod> -n <ns> -- /bin/sh

# Step 5: Is the node healthy?
kubectl describe node <node-name>
kubectl top node <node-name>

Each failure mode below follows this pattern, with specific things to look for at each step.

Kubernetes Resource Management: QoS Classes, Eviction, OOM Scoring, and Capacity Planning

Kubernetes Resource Management Deep Dive#

Resource management in Kubernetes is the mechanism that decides which pods get scheduled, which pods get killed when the node runs low, and how much CPU and memory each container is actually allowed to use. The surface-level concept of requests and limits is straightforward. The underlying mechanics – QoS classification, CFS CPU quotas, kernel OOM scoring, kubelet eviction thresholds – are where misconfigurations cause production outages.