Jobs and CronJobs: Batch Workloads, Retry Logic, and Scheduling

Jobs and CronJobs#

Deployments manage long-running processes. Jobs manage work that finishes. A Job creates one or more pods, runs them to completion, and tracks whether they succeeded. CronJobs run Jobs on a schedule. Both are essential for database migrations, report generation, data pipelines, and any workload that is not a continuously running server.

Job Basics#

A Job runs a pod until it exits successfully (exit code 0). The simplest case is a single pod that runs once:

kubectl debug and Ephemeral Containers: Non-Invasive Production Debugging

kubectl debug and Ephemeral Containers#

Production containers should be minimal. Distroless images, scratch-based Go binaries, and hardened base images strip out shells, package managers, and debugging tools. This is good for security and image size, but it means kubectl exec gives you nothing to work with. Ephemeral containers solve this problem.

The Problem#

A typical distroless container has no shell:

$ kubectl exec -it payments-api-7f8b9c6d4-x2k9m -- /bin/sh
OCI runtime exec failed: exec failed: unable to start container process:
exec: "/bin/sh": stat /bin/sh: no such file or directory

You cannot install tools, you cannot inspect files, and you cannot run any diagnostic commands. The application is returning 500 errors and you have nothing but logs.

Kubernetes Cost Optimization: Rightsizing, Resource Efficiency, and Waste Reduction

Kubernetes Cost Optimization#

Most Kubernetes clusters run at 15-30% actual CPU utilization but are billed for the full provisioned capacity. The gap between what you reserve and what you use is pure waste. This article covers the practical workflow for finding and eliminating that waste.

The Cost Problem: Requests vs Actual Usage#

Kubernetes resource requests are the foundation of cost. When a pod requests 4 CPUs, the scheduler reserves 4 CPUs on a node regardless of whether the pod ever uses more than 0.1 CPU. The node is sized (and billed) based on what is reserved, not what is consumed.

Kubernetes Disaster Recovery: Runbooks for Common Incidents

Kubernetes Disaster Recovery Runbooks#

These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.

Incident Response Framework#

Every incident follows the same cycle:

  1. Detect – monitoring alert, user report, or kubectl showing unhealthy state
  2. Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
  3. Contain – stop the bleeding. Prevent the issue from spreading
  4. Recover – restore normal operation
  5. Post-mortem – document what happened, why, and how to prevent it

Runbook 1: Node Goes NotReady#

Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.

Kubernetes Operator Development: Patterns, Frameworks, and Best Practices

Kubernetes Operator Development#

Operators are custom controllers that manage CRDs. They encode operational knowledge – the kind of tasks a human operator would perform – into software that runs inside the cluster. An operator watches for changes to its custom resources and reconciles the actual state to match the desired state, creating, updating, or deleting child resources as needed.

Operator Maturity Model#

The Operator Framework defines five maturity levels:

LevelCapabilityExample
1Basic installHelm operator deploys the application
2Seamless upgradesOperator handles version migrations
3Full lifecycleBackup, restore, failure recovery
4Deep insightsExposes metrics, fires alerts, generates dashboards
5Auto-pilotAuto-scaling, auto-healing, auto-tuning without human input

Most custom operators target Level 2-3. Levels 4-5 are typically reached by mature projects like the Prometheus Operator or Rook/Ceph.

Kubernetes Troubleshooting Decision Trees: Symptom to Diagnosis to Fix

Kubernetes Troubleshooting Decision Trees#

Troubleshooting Kubernetes in production is about eliminating possibilities in the right order. Every symptom maps to a finite set of causes, and each cause has a specific diagnostic command. The decision trees below encode that mapping. Start at the symptom, follow the branches, run the commands, and the output tells you which branch to take next.

These trees are designed to be followed mechanically. No intuition required – just execute the commands and interpret the results.

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.

Load Balancer Patterns: L4 vs L7, Health Checks, Session Affinity, and Cloud LB Selection

L4 vs L7 Load Balancing#

The distinction between Layer 4 and Layer 7 load balancing determines what the load balancer can see and what routing decisions it can make.

Layer 4 (Transport) load balancers work at the TCP/UDP level. They see source/destination IPs and ports but not the content of the traffic. They forward raw TCP connections to backends. This makes them fast (no protocol parsing), protocol-agnostic (works for HTTP, gRPC, database connections, custom protocols), and transparent (the backend sees the original packets, mostly). Use L4 for database connections, raw TCP services, and when you need maximum throughput with minimum latency.

Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics

The Retention Problem#

Prometheus stores metrics on local disk with a default retention of 15 days. Most production teams extend this to 30 or 90 days, but local storage has hard limits. A single Prometheus instance cannot scale disk beyond the node it runs on. It provides no high availability – if the instance goes down, you lose scraping and query access. And each Prometheus instance only sees its own targets, so there is no unified view across clusters or regions.

Managed Kubernetes vs Self-Managed: EKS/AKS/GKE vs kubeadm vs k3s vs RKE

Managed Kubernetes vs Self-Managed#

The fundamental tradeoff is straightforward: managed Kubernetes trades control for reduced operational burden, while self-managed Kubernetes gives you full control at the cost of owning everything – etcd, certificates, upgrades, high availability, and recovery.

This decision has cascading effects on team structure, hiring, on-call burden, and long-term maintenance cost. Choose deliberately.

Managed Kubernetes (EKS, AKS, GKE)#

The cloud provider runs the control plane: API server, etcd, controller manager, scheduler. They handle patching, scaling, and high availability for these components. You manage worker nodes and workloads.