Detecting Infrastructure Knowledge Gaps: What Agents Don't Know They Don't Know

Detecting Infrastructure Knowledge Gaps#

The most dangerous thing an agent can do is confidently produce a deliverable based on wrong assumptions. An agent that assumes x86_64 when the target is ARM64, that assumes PostgreSQL 14 behavior when the target runs 15, or that assumes AWS IAM patterns when the target is Azure – that agent produces a runbook that will fail in ways the human did not expect and may not understand.

Diagnosing Common Terraform Problems

Stuck State Lock#

A CI job was cancelled, a laptop lost network, or a process crashed mid-apply. Terraform refuses to run:

Error acquiring the state lock
Lock Info:
  ID:        f8e7d6c5-b4a3-2109-8765-43210fedcba9
  Operation: OperationTypeApply
  Who:       deploy@ci-runner
  Created:   2026-02-20 09:15:22 +0000 UTC

Verify the lock holder is truly dead. Check CI job status, then:

terraform force-unlock f8e7d6c5-b4a3-2109-8765-43210fedcba9

If the lock was from a crashed apply, the state may be partially updated. Run terraform plan immediately after unlocking to see the current situation.

EKS Troubleshooting

EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

GKE Troubleshooting

GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.

Jenkins Debugging: Diagnosing Stuck Builds, Pipeline Failures, Performance Issues, and Kubernetes Agent Problems

Jenkins Debugging#

Jenkins failures fall into a few categories: builds stuck waiting, cryptic pipeline errors, performance degradation, and Kubernetes agent pods that refuse to launch.

Builds Stuck in Queue#

When a build sits in the queue and never starts, check the queue tooltip in the UI – it tells you why. Common causes:

No agents with matching labels. The pipeline requests agent { label 'docker-arm64' } but no agent has that label. Check Manage Jenkins > Nodes to see available labels.

kubectl Debugging: A Practical Command Reference

kubectl Debugging#

When something breaks in Kubernetes, you need to move through a specific sequence of commands. Here is every debugging command you will reach for, plus a step-by-step workflow for a pod that will not start.

Logs#

kubectl logs <pod-name> -n <namespace>                           # basic
kubectl logs <pod-name> -c <container-name> -n <namespace>       # specific container
kubectl logs <pod-name> --previous -n <namespace>                # previous crash (essential for CrashLoopBackOff)
kubectl logs -f <pod-name> -n <namespace>                        # stream in real-time
kubectl logs --since=5m <pod-name> -n <namespace>                # last 5 minutes
kubectl logs -l app=payments-api -n payments-prod --all-containers  # all pods matching label

The --previous flag is critical for crash-looping pods where the current container has no logs yet. The --all-containers flag captures init containers and sidecars.

Kubernetes API Server: Architecture, Authentication, Authorization, and Debugging

Kubernetes API Server: Architecture, Authentication, Authorization, and Debugging#

The API server (kube-apiserver) is the front door to your Kubernetes cluster. Every interaction – kubectl commands, controller reconciliation loops, kubelet status updates, admission webhooks – goes through the API server. It is the only component that reads from and writes to etcd. If the API server is down, the cluster is unmanageable. Everything else (scheduler, controllers, kubelets) can tolerate brief API server outages because they cache state locally, but no mutations happen until the API server is back.

Kubernetes DNS Deep Dive: CoreDNS, ndots, and Debugging Resolution Failures

Kubernetes DNS Deep Dive: CoreDNS, ndots, and Debugging Resolution Failures#

DNS problems are responsible for a disproportionate number of Kubernetes debugging sessions. The symptoms are always vague – timeouts, connection refused, “could not resolve host” – and the root causes range from CoreDNS being down to a misunderstood setting called ndots.

How Pod DNS Resolution Works#

When a pod makes a DNS query, it goes through the following chain:

  1. The application calls getaddrinfo() or equivalent.
  2. The system resolver reads /etc/resolv.conf inside the pod.
  3. The query goes to the nameserver specified in resolv.conf, which is CoreDNS (reachable via the kube-dns Service in kube-system).
  4. CoreDNS resolves the name – either from its internal zone (for cluster services) or by forwarding to upstream DNS.

Every pod’s /etc/resolv.conf looks something like this:

Kubernetes Events Debugging: Patterns, Filtering, and Alerting

Kubernetes Events Debugging#

Kubernetes events are the cluster’s built-in audit trail for what is happening to resources. When a pod fails to schedule, a container crashes, a node runs out of disk, or a volume fails to mount, the system records an event. Events are the first place to look when something goes wrong, and learning to read them efficiently separates quick diagnosis from hours of guessing.

Event Structure#

Every Kubernetes event has these fields:

Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.