Prometheus and Grafana on Minikube: Production-Like Monitoring Without the Cost

Why Monitor a POC Cluster#

Monitoring on minikube serves two purposes. First, it catches resource problems early – your app might work in tests but OOM-kill under load, and you will not know without metrics. Second, it validates that your monitoring configuration works before you deploy it to production. If your ServiceMonitors, dashboards, and alert rules work on minikube, they will work on EKS or GKE.

The Right Chart: kube-prometheus-stack#

There are multiple Prometheus-related Helm charts. Use the right one:

ArgoCD on Minikube: GitOps Deployments from Day One

Why GitOps on a POC Cluster#

Setting up ArgoCD on minikube is not about automating deployments for a local cluster – you could just run kubectl apply. The point is to prove the deployment workflow before production. If your Git repo structure, Helm values, and sync policies work on minikube, they will work on EKS or GKE. If you skip this and bolt on GitOps later, you will spend days restructuring your repo and debugging sync failures under production pressure.

Golden Paths and Paved Roads

What Golden Paths Are#

A golden path is a pre-built, opinionated workflow that gets a developer from zero to a production-ready artifact with minimal decisions. The term comes from Spotify’s internal platform work. Netflix calls them “paved roads.” The idea is the same: provide a well-maintained, well-tested default path that handles 80% of use cases, while allowing teams to go off-road when they have legitimate reasons.

A golden path is not a mandate. It is a recommendation backed by automation. Create a new Go microservice using the golden path and you get a repository with CI/CD, Kubernetes manifests, observability, and a Backstage catalog entry — working in minutes. The golden path removes the 40+ decisions a developer would otherwise need to make.

Azure DevOps Pipelines: YAML Pipelines, Templates, Service Connections, and AKS Integration

Azure DevOps Pipelines#

Azure DevOps Pipelines uses YAML files stored in your repository to define build and deployment workflows. The pipeline model has three levels: stages contain jobs, jobs contain steps. This hierarchy maps directly to how you think about CI/CD – build stage, test stage, deploy-to-staging stage, deploy-to-production stage – with each stage containing one or more parallel jobs.

Pipeline Structure#

A complete pipeline in azure-pipelines.yml:

trigger:
  branches:
    include:
      - main
      - release/*
  paths:
    exclude:
      - docs/**
      - README.md

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: common-vars
  - name: buildConfiguration
    value: 'Release'

stages:
  - stage: Build
    jobs:
      - job: BuildApp
        steps:
          - task: GoTool@0
            inputs:
              version: '1.22'
          - script: |
              go build -o $(Build.ArtifactStagingDirectory)/myapp ./cmd/myapp
            displayName: 'Build binary'
          - publish: $(Build.ArtifactStagingDirectory)
            artifact: drop

  - stage: Test
    dependsOn: Build
    jobs:
      - job: UnitTests
        steps:
          - task: GoTool@0
            inputs:
              version: '1.22'
          - script: go test ./... -v -coverprofile=coverage.out
            displayName: 'Run tests'
          - task: PublishCodeCoverageResults@2
            inputs:
              summaryFileLocation: coverage.out
              codecoverageTool: 'Cobertura'

  - stage: DeployStaging
    dependsOn: Test
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: DeployToStaging
        environment: staging
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: drop
                - script: echo "Deploying to staging"

trigger controls which branches and paths trigger the pipeline. dependsOn creates stage ordering. condition adds logic – succeeded() checks the previous stage passed, and you can combine it with variable checks to restrict certain stages to specific branches.

Crossplane for Platform Abstractions

What Crossplane Does#

Crossplane extends Kubernetes to provision and manage cloud infrastructure using the Kubernetes API. Instead of writing Terraform and running apply, you write Kubernetes manifests and kubectl apply them. Crossplane controllers reconcile the desired state with the actual cloud resources.

The real value is not replacing Terraform — it is building abstractions. Platform teams define custom resource types (like DatabaseClaim) that developers consume without knowing whether they are getting RDS, CloudSQL, or Azure Database. The composition layer maps the simple claim to the actual cloud resources.

Self-Hosted CI Runners at Scale: GitHub Actions Runner Controller, GitLab Runners on K8s, and Autoscaling

Self-Hosted CI Runners at Scale#

GitHub-hosted and GitLab SaaS runners work until they do not. You hit limits when you need private network access to deploy to internal infrastructure, specific hardware like GPUs or ARM64 machines, compliance requirements that prohibit running code on shared infrastructure, or cost control when you are burning thousands of dollars per month on hosted runner minutes.

Self-hosted runners solve these problems but introduce operational complexity: you now own runner provisioning, scaling, security, image updates, and cost management.

Kubernetes Cluster Disaster Recovery: etcd Backup, Velero, and GitOps Recovery

Kubernetes Cluster Disaster Recovery#

Your cluster will fail. The question is whether you can rebuild it in hours or weeks. Kubernetes DR is not a single tool – it is a layered strategy combining etcd snapshots, resource-level backups, GitOps state, and tested recovery procedures.

The three layers of Kubernetes DR: etcd gives you raw cluster state, Velero gives you portable resource and volume backups, and GitOps gives you declarative rebuild capability. You need at least two of these.

Multi-Region Kubernetes: Service Mesh Federation, Cross-Cluster Networking, and GitOps

Multi-Region Kubernetes#

Running Kubernetes in a single region is a single point of failure at the infrastructure level. Region outages are rare but real – AWS us-east-1 has gone down multiple times, taking entire companies offline. Multi-region Kubernetes addresses this, but it introduces complexity in networking, state management, and deployment coordination that you must handle deliberately.

Independent Clusters with Shared GitOps#

The simplest multi-region pattern: run completely independent clusters in each region, deploy the same applications to all of them using GitOps, and route traffic with DNS or a global load balancer.

Stateful Workload Disaster Recovery: Storage Replication, Database Operators, and Restore Ordering

Stateful Workload Disaster Recovery#

Stateless workloads are easy to recover – redeploy from Git and they are running. Stateful workloads carry data that cannot be regenerated. Databases, message queues, object stores, and anything with a PersistentVolume needs a deliberate DR strategy that goes beyond “we have Velero.”

The fundamental challenge: you must capture data at a point in time where the application state is consistent, replicate that data to a recovery site, and restore it in the correct order. Get any of these wrong and you recover corrupted data or a broken dependency chain.

Agent Runbook Generation: Producing Verified Infrastructure Deliverables

Agent Runbook Generation#

An agent that says “you should probably add a readiness probe to your deployment” is giving advice. An agent that hands you a tested manifest with the readiness probe configured, verified against a real cluster, with rollback steps if the probe misconfigures – that agent is producing a deliverable. The difference matters.

The core thesis of infrastructure agent work is that the output is always a deliverable – a runbook, playbook, tested manifest, or validated configuration – never a direct action on someone else’s systems. This article covers the complete workflow for generating those deliverables: understanding requirements, planning steps, executing in a sandbox, capturing what worked, and packaging the result.