Grafana Mimir for Long-Term Prometheus Storage

Grafana Mimir for Long-Term Prometheus Storage#

Prometheus stores metrics on local disk with a practical retention limit of weeks to a few months. Beyond that, you need a long-term storage solution. Grafana Mimir is a horizontally scalable, multi-tenant time series database designed for exactly this purpose. It is API-compatible with Prometheus – Grafana queries Mimir using the same PromQL, and Prometheus pushes data to Mimir via remote_write.

Mimir is the successor to Cortex. Grafana Labs forked Cortex, rewrote significant portions for performance, and released Mimir under the AGPLv3 license. If you see references to Cortex architecture, the concepts map directly to Mimir with improvements.

gRPC for Service-to-Service Communication

gRPC for Service-to-Service Communication#

gRPC is a high-performance RPC framework that uses HTTP/2 for transport and Protocol Buffers (protobuf) for serialization. For service-to-service communication within a microservices architecture, gRPC offers significant advantages over REST: strongly typed contracts, efficient binary serialization, streaming support, and code generation in every major language.

Why gRPC for Internal Services#

REST with JSON is the standard for public APIs. For internal service-to-service calls, gRPC is often the better choice.

Incident Management Lifecycle

Incident Lifecycle Overview#

An incident is an unplanned disruption to a service requiring coordinated response. The lifecycle has six phases: detection, triage, communication, mitigation, resolution, and review. Each has defined actions, owners, and exit criteria.

Phase 1: Detection#

Incidents are detected through three channels. Automated monitoring is best – alerts fire on SLO violations or error thresholds before users notice. Internal reports come from other teams noticing issues with dependencies. Customer reports are worst case – if users detect your incidents first, your observability has gaps.

Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures

Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.

Infrastructure Knowledge Scoping for Agents

Infrastructure Knowledge Scoping for Agents#

An agent working on infrastructure tasks needs to operate at the right level of specificity. Giving generic Kubernetes advice when the user runs EKS with IRSA is unhelpful – the agent misses the IAM integration that will make or break the deployment. Giving EKS-specific advice when the user runs minikube on a laptop is equally unhelpful – the agent references services and configurations that do not exist.

Kubernetes API Server: Architecture, Authentication, Authorization, and Debugging

Kubernetes API Server: Architecture, Authentication, Authorization, and Debugging#

The API server (kube-apiserver) is the front door to your Kubernetes cluster. Every interaction – kubectl commands, controller reconciliation loops, kubelet status updates, admission webhooks – goes through the API server. It is the only component that reads from and writes to etcd. If the API server is down, the cluster is unmanageable. Everything else (scheduler, controllers, kubelets) can tolerate brief API server outages because they cache state locally, but no mutations happen until the API server is back.

Kubernetes Audit Logging: Tracking API Activity for Security and Compliance

Kubernetes Audit Logging: Tracking API Activity for Security and Compliance#

Audit logging records every request to the Kubernetes API server. Every kubectl command, every controller reconciliation, every kubelet heartbeat, every admission webhook call – all of it can be captured with the requester’s identity, the target resource, the timestamp, and optionally the full request and response bodies. Without audit logging, you have no record of who did what in your cluster. With it, you can trace security incidents, satisfy compliance requirements, and debug access control issues.

Kubernetes Controllers: Reconciliation Loops, the Controller Manager, and Custom Controllers

Kubernetes Controllers: Reconciliation Loops, the Controller Manager, and Custom Controllers#

Kubernetes is a declarative system. You tell it what you want (a Deployment with 3 replicas), and controllers make it happen. Controllers are the engines that continuously reconcile desired state with actual state. Without controllers, your YAML manifests would be inert data in etcd.

The Controller Pattern#

Every controller follows the same loop:

1. Watch the API server for changes to a specific resource type
2. For each change, compare desired state (spec) to actual state (status)
3. Take action to bring actual state closer to desired state
4. Update status to reflect current actual state
5. Repeat

This is a level-triggered model, not edge-triggered. A controller does not just react to changes – it reconciles the entire state on each pass. If a controller crashes and restarts, it re-reads all objects and converges to the correct state without needing to replay missed events. This makes controllers resilient to transient failures.

Kubernetes Scheduler: How Pods Get Placed on Nodes

Kubernetes Scheduler: How Pods Get Placed on Nodes#

The scheduler (kube-scheduler) watches for newly created pods that have no node assignment. For each unscheduled pod, the scheduler selects the best node and writes a binding back to the API server. The kubelet on that node then starts the pod. If no node is suitable, the pod stays Pending until conditions change.

The scheduler is the reason pods run where they do. Understanding its internals is essential for diagnosing Pending pods, designing placement constraints, and managing cluster utilization.

Long-Running Workflow Orchestration: State Machines, Checkpointing, and Resumable Multi-Agent Execution

Long-Running Workflow Orchestration#

Most agent examples show single-turn or single-session tasks: answer a question, write a function, debug an error. Real projects are different. Building a feature, migrating a database, setting up a monitoring stack – these take hours, span multiple sessions, involve parallel work streams, and must survive context window resets, session timeouts, and partial failures.

This article covers the architecture for workflows that last hours or days: how to model progress as a state machine, how to checkpoint for reliable resumption, how to delegate to parallel sub-agents without losing coherence, and how to recover when things fail partway through.