Grafana Loki for Log Aggregation

Loki Architecture#

Loki is a log aggregation system designed by Grafana Labs. Unlike Elasticsearch, Loki does not index log content. It indexes only metadata labels, then stores compressed log chunks in object storage. This makes it cheaper to operate and simpler to scale, at the cost of slower full-text search across massive datasets.

The core components are:

  • Distributor: Receives incoming log streams from agents, validates labels, and forwards to ingesters via consistent hashing.
  • Ingester: Buffers log data in memory, builds compressed chunks, and flushes them to long-term storage (S3, GCS, filesystem).
  • Querier: Executes LogQL queries by fetching chunk references from the index and reading chunk data from storage.
  • Compactor: Runs periodic compaction on the index (especially for boltdb-shipper) and handles retention enforcement by deleting old data.
  • Query Frontend (optional): Splits large queries into smaller ones, caches results, and distributes work across queriers.

Deployment Modes#

Loki supports three deployment modes, each suited to different scales.

Grafana Mimir for Long-Term Prometheus Storage

Grafana Mimir for Long-Term Prometheus Storage#

Prometheus stores metrics on local disk with a practical retention limit of weeks to a few months. Beyond that, you need a long-term storage solution. Grafana Mimir is a horizontally scalable, multi-tenant time series database designed for exactly this purpose. It is API-compatible with Prometheus – Grafana queries Mimir using the same PromQL, and Prometheus pushes data to Mimir via remote_write.

Mimir is the successor to Cortex. Grafana Labs forked Cortex, rewrote significant portions for performance, and released Mimir under the AGPLv3 license. If you see references to Cortex architecture, the concepts map directly to Mimir with improvements.

Grafana Organization: Folders, Permissions, Provisioning, and Dashboard Lifecycle

Folder Structure Strategy#

Grafana folders organize dashboards and control access through permissions. The folder structure you choose determines how teams find dashboards and who can edit them. Three patterns work in practice, each suited to a different organizational shape.

By Team#

When teams own distinct services and rarely need cross-team dashboards:

Platform/
  Node Overview
  Kubernetes Cluster
  Networking
Backend/
  API Gateway
  User Service
  Payment Service
Frontend/
  Web Vitals
  CDN Performance
Data/
  Kafka Pipelines
  ETL Jobs
  Data Quality

Each team gets Editor access to their folder and Viewer access to everything else. This works well when ownership boundaries are clear.

Log Analysis and Management Strategies: Structured Logging, Aggregation, Retention, and Correlation

The Decision Landscape#

Log management is deceptively simple on the surface – applications write text, you store it, you search it later. In practice, every decision in the log pipeline involves tradeoffs between cost, query speed, retention depth, operational complexity, and correlation with other observability signals. This guide provides a framework for making those decisions based on your actual requirements rather than defaults or trends.

Structured Logging: The Foundation#

Before choosing any aggregation tool, standardize on structured logging. Unstructured logs are human-readable but machine-hostile. Structured logs are both.

Logging Patterns in Kubernetes

How Kubernetes Captures Logs#

Containers write to stdout and stderr. The container runtime (containerd, CRI-O) captures these streams and writes them to files on the node. The kubelet manages these files at /var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/ with symlinks from /var/log/containers/.

The format depends on the runtime. Containerd writes logs in a format with timestamp, stream tag, and the log line:

2026-02-22T10:15:32.123456789Z stdout F {"level":"info","msg":"request handled","status":200}
2026-02-22T10:15:32.456789012Z stderr F error: connection refused to database

kubectl logs reads these files. It only works while the pod exists – once a pod is deleted, its log files are eventually cleaned up. This is why centralized log collection is essential.

Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures

“I’m Not Seeing Metrics” – Systematic Diagnosis#

This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.

Step 1: Is the Target Being Scraped?#

Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.

Status: UP    Last Scrape: 3s ago    Duration: 12ms    Error: (none)
Status: DOWN  Last Scrape: 15s ago   Duration: 0ms     Error: connection refused

If the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.

OpenTelemetry for Kubernetes

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

  • Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
  • Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
  • Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

Prometheus Architecture Deep Dive

Pull-Based Scraping Model#

Prometheus pulls metrics from targets rather than having targets push metrics to it. Every scrape interval (default 15s in the global config), Prometheus sends an HTTP GET to each target’s metrics endpoint. The target responds with all its current metric values in Prometheus exposition format.

This pull model has concrete advantages. Prometheus controls the scrape rate, so a misbehaving target cannot flood the system. You can scrape a target from your laptop with curl http://target:8080/metrics to see exactly what Prometheus sees. Targets that go down are immediately detectable because the scrape fails.

PromQL Essentials: Practical Query Patterns

Instant Vectors vs Range Vectors#

An instant vector returns one sample per time series at a single point in time. A range vector returns multiple samples per time series over a time window.

# Instant vector: current value of each series
http_requests_total{job="api"}

# Range vector: last 5 minutes of samples for each series
http_requests_total{job="api"}[5m]

You cannot graph a range vector directly. Functions like rate() and increase() consume a range vector and return an instant vector, which Grafana can then plot.

Real User Monitoring (RUM) and Frontend Observability: Core Web Vitals, Error Tracking, and Session Replay

What Real User Monitoring Measures#

Real User Monitoring (RUM) collects performance and behavior data from actual users interacting with your application in their real browsers, on their real networks, with their real hardware. Unlike synthetic monitoring, which tests a controlled scenario from a known location, RUM captures the full spectrum of user experience – including the user on a slow 3G connection in rural Brazil using a 4-year-old phone.

RUM answers questions that no amount of server-side monitoring can: How fast does the page actually load for users? Which JavaScript errors are users hitting in production? Where do users abandon a workflow? Which geographic regions experience worse performance?