Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures

“I’m Not Seeing Metrics” – Systematic Diagnosis#

This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.

Step 1: Is the Target Being Scraped?#

Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.

Status: UP    Last Scrape: 3s ago    Duration: 12ms    Error: (none)
Status: DOWN  Last Scrape: 15s ago   Duration: 0ms     Error: connection refused

If the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.

Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native

Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server:

OpenTelemetry for Kubernetes

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

  • Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
  • Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
  • Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

Planning and Executing Database Migrations: Schema Changes, Data Migrations, and Zero-Downtime Patterns

Planning and Executing Database Migrations#

Database migrations are the highest-risk routine operations most teams perform. A bad migration can cause downtime, data loss, or application errors that cascade across every service that touches the affected tables. This operational sequence walks through the assessment, planning, execution, and rollback of database migrations from simple column additions to full platform changes.

Phase 1 – Assessment#

Step 1: Classify the Migration#

Every migration falls into one of three categories, each with a different risk profile:

Pod Lifecycle and Probes: Init Containers, Hooks, and Health Checks

Pod Lifecycle and Probes#

Understanding how Kubernetes starts, monitors, and stops pods is essential for running reliable services. Misconfigurations here cause cascading failures, dropped requests, and restart loops that are difficult to diagnose.

Pod Startup Sequence#

When a pod is scheduled, this is the exact order of operations:

  1. Init containers run sequentially. Each must exit 0 before the next starts.
  2. All regular containers start simultaneously.
  3. postStart hooks fire (in parallel with the container’s main process).
  4. Startup probe begins checking (if defined).
  5. Once the startup probe passes, liveness and readiness probes begin.

Init Containers#

Init containers run before your application containers and are used for setup tasks: waiting for a dependency, running database migrations, cloning config from a remote source.

Pod Security Standards: Admission Control and Secure Pod Configuration

Pod Security Standards#

Kubernetes Pod Security Standards define three security profiles that control what pods are allowed to do. Pod Security Admission (PSA) enforces these standards at the namespace level. This is the replacement for PodSecurityPolicy, which was removed in Kubernetes 1.25.

The Three Levels#

Privileged – Unrestricted. No security controls applied. Used for system-level workloads like CNI plugins, storage drivers, and logging agents that genuinely need host access.

Baseline – Prevents known privilege escalations. Blocks hostNetwork, hostPID, hostIPC, privileged containers, and most host path mounts. Allows most workloads to run without modification.

PodDisruptionBudgets Deep Dive

PodDisruptionBudgets Deep Dive#

A PodDisruptionBudget (PDB) limits how many pods from a set can be simultaneously down during voluntary disruptions – node drains, cluster upgrades, autoscaler scale-down. PDBs do not protect against involuntary disruptions like node crashes or OOM kills. They are the mechanism by which you tell Kubernetes “this service needs at least N healthy pods at all times during maintenance.”

minAvailable vs maxUnavailable#

PDBs support two fields. Use one or the other, not both.

Post-Mortem Action Item Tracking

The Action Item Problem#

Post-mortem reviews produce action items. Teams agree on what needs to change. Then weeks pass, priorities shift, and items quietly decay into a backlog nobody checks. The next incident hits the same root cause, and the post-mortem produces the same action items again.

Studies of recurring incidents consistently show the root cause was identified in a previous post-mortem, and the corresponding action item was never completed. Action item tracking is the mechanism by which incidents make systems more reliable instead of just more documented.

PostgreSQL Backup and Recovery

PostgreSQL Backup and Recovery#

A backup you have never tested restoring is not a backup. This covers the main backup tools, when to use each, point-in-time recovery, and automation.

Logical Backups: pg_dump and pg_dumpall#

pg_dump exports a single database as SQL or a compressed binary format. It takes a consistent snapshot without blocking writes.

# Custom format (compressed, supports parallel restore)
pg_dump -U postgres -Fc -d myapp -f myapp.dump

# Directory format (parallel dump)
pg_dump -U postgres -Fd -j 4 -d myapp -f myapp_dir/

pg_dumpall exports every database plus cluster-wide objects. In practice, dump roles separately and per-database for flexibility:

PostgreSQL Debugging

PostgreSQL Debugging#

When PostgreSQL breaks, it usually falls into a handful of patterns. This is a reference for diagnosing each one with specific queries and commands.

Connection Refused#

Work through these in order:

1. Is PostgreSQL running?

sudo systemctl status postgresql-16

2. Is it listening on the right address?

ss -tlnp | grep 5432

If it shows 127.0.0.1:5432 but you need remote access, set listen_addresses = '*' in postgresql.conf.

3. Does pg_hba.conf allow the connection? Check logs for no pg_hba.conf entry for host: