GKE Troubleshooting

GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.

GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.

HashiCorp Vault on Kubernetes: Secrets Management Done Right

HashiCorp Vault on Kubernetes#

Vault centralizes secret management with dynamic credentials, encryption as a service, and fine-grained access control. On Kubernetes, workloads authenticate using service accounts and pull secrets without hardcoding anything.

Installation with Helm#

helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update

Dev Mode (Single Pod, In-Memory)#

Automatically initialized and unsealed, stores everything in memory, loses all data on restart. Root token is root. Never use this in production.

helm upgrade --install vault hashicorp/vault \
  --namespace vault --create-namespace \
  --set server.dev.enabled=true \
  --set injector.enabled=true

Production Mode (HA with Integrated Raft Storage)#

Run Vault in HA mode with Raft consensus – a 3-node StatefulSet with persistent storage.

Helm Chart Development: Templates, Helpers, and Testing

Helm Chart Development#

Writing your own Helm charts turns static YAML into reusable, configurable packages. The learning curve is in Go’s template syntax and Helm’s conventions, but once you internalize the patterns, chart development is fast.

Chart Structure#

Create a new chart scaffold:

helm create my-app

This generates:

my-app/
  Chart.yaml              # chart metadata (name, version, dependencies)
  values.yaml             # default configuration values
  charts/                 # dependency charts (populated by helm dependency update)
  templates/              # Kubernetes manifest templates
    deployment.yaml
    service.yaml
    ingress.yaml
    serviceaccount.yaml
    hpa.yaml
    NOTES.txt             # post-install instructions (printed after helm install)
    _helpers.tpl           # named template definitions
    tests/
      test-connection.yaml # helm test pod

Chart.yaml#

The Chart.yaml defines your chart’s identity and dependencies:

Helm Values and Overrides: Precedence, Inspection, and Environment Patterns

Helm Values and Overrides#

Every Helm chart has a values.yaml file that defines defaults. When you install or upgrade a release, you override those defaults through values files (-f) and inline flags (--set). Getting the precedence wrong leads to silent misconfigurations where you think you set something but the chart used a different value.

Inspecting Chart Defaults#

Before overriding anything, look at what the chart provides. helm show values dumps the full default values.yaml for any chart:

Image Patching and Lifecycle: Keeping Container Images Current

Image Patching and Lifecycle#

Building a container image and deploying it is the easy part. Keeping it patched over weeks, months, and years is where most teams fail. A container image deployed today with zero known vulnerabilities will accumulate CVEs as new vulnerabilities are disclosed against its OS packages, language runtime, and dependencies. You need an automated system that detects stale base images, triggers rebuilds, and rolls out updates safely.

Ingress Controllers and Routing Patterns

Ingress Controllers and Routing Patterns#

An Ingress resource defines HTTP routing rules – which hostnames and paths map to which backend Services. But an Ingress resource does nothing on its own. You need an Ingress controller running in the cluster to watch for Ingress resources and configure the actual reverse proxy.

Ingress Controllers#

The two most common controllers are nginx-ingress and Traefik.

nginx-ingress (ingress-nginx):

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx --namespace ingress-nginx --create-namespace

Note: there are two different nginx ingress projects. kubernetes/ingress-nginx (community) and nginxinc/kubernetes-ingress (NGINX Inc). The community version is far more common. Make sure you install from https://kubernetes.github.io/ingress-nginx, not the NGINX Inc chart.

Istio Service Mesh: Traffic Management, Security, and Observability

Istio Service Mesh#

Istio adds a proxy sidecar (Envoy) to every pod in the mesh. These proxies handle traffic routing, mutual TLS, retries, circuit breaking, and telemetry without changing application code. The control plane (istiod) pushes configuration to all sidecars.

When You Actually Need a Service Mesh#

You need Istio when you have multiple services requiring mTLS, fine-grained traffic control (canary releases, fault injection), or consistent observability across service-to-service communication. If you have fewer than five services, standard Kubernetes Services and NetworkPolicies are sufficient. A service mesh adds operational complexity – more moving parts, higher memory usage per sidecar, and a learning curve for proxy-level debugging.

kind Validation Templates: Cluster Configs and Lifecycle Scripts

kind Validation Templates#

kind (Kubernetes IN Docker) runs Kubernetes clusters using Docker containers as nodes. It was designed for testing Kubernetes itself, which makes it an excellent tool for validating infrastructure changes. It starts fast, uses fewer resources than minikube, and is disposable by design.

This article provides copy-paste cluster configurations and complete lifecycle scripts for common validation scenarios.

Cluster Configuration Templates#

Basic Single-Node#

The simplest configuration. One container acts as both control plane and worker. Sufficient for validating that deployments, services, ConfigMaps, and Secrets work correctly.

kubectl Debugging: A Practical Command Reference

kubectl Debugging#

When something breaks in Kubernetes, you need to move through a specific sequence of commands. Here is every debugging command you will reach for, plus a step-by-step workflow for a pod that will not start.

Logs#

kubectl logs <pod-name> -n <namespace>                           # basic
kubectl logs <pod-name> -c <container-name> -n <namespace>       # specific container
kubectl logs <pod-name> --previous -n <namespace>                # previous crash (essential for CrashLoopBackOff)
kubectl logs -f <pod-name> -n <namespace>                        # stream in real-time
kubectl logs --since=5m <pod-name> -n <namespace>                # last 5 minutes
kubectl logs -l app=payments-api -n payments-prod --all-containers  # all pods matching label

The --previous flag is critical for crash-looping pods where the current container has no logs yet. The --all-containers flag captures init containers and sidecars.