Toil Measurement and Reduction

Sre

What Toil Actually Is#

Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. Not all operational work is toil. Capacity planning requires judgment. Postmortem analysis produces lasting improvements. Writing automation code is engineering. Toil is the opposite: it is the work that a machine could do but currently a human is doing, over and over, without making the system any better.

CircleCI Pipeline Patterns: Orbs, Executors, Workspaces, Parallelism, and Approval Workflows

CircleCI Pipeline Patterns#

CircleCI pipelines are defined in .circleci/config.yml. The configuration model uses workflows to orchestrate jobs, jobs to define execution units, and steps to define commands within a job. Every job runs inside an executor – a Docker container, Linux VM, macOS VM, or Windows VM.

Config Structure and Executors#

A minimal config defines a job and a workflow:

version: 2.1

executors:
  go-executor:
    docker:
      - image: cimg/go:1.22
    resource_class: medium
    working_directory: ~/project

jobs:
  build:
    executor: go-executor
    steps:
      - checkout
      - run:
          name: Build application
          command: go build -o myapp ./cmd/myapp

workflows:
  main:
    jobs:
      - build

Named executors let you reuse environment definitions across jobs. The resource_class controls CPU and memory – small (1 vCPU/2GB), medium (2 vCPU/4GB), large (4 vCPU/8GB), xlarge (8 vCPU/16GB). Choose the smallest class that avoids OOM kills to keep costs down.

Golden Paths and Paved Roads

What Golden Paths Are#

A golden path is a pre-built, opinionated workflow that gets a developer from zero to a production-ready artifact with minimal decisions. The term comes from Spotify’s internal platform work. Netflix calls them “paved roads.” The idea is the same: provide a well-maintained, well-tested default path that handles 80% of use cases, while allowing teams to go off-road when they have legitimate reasons.

A golden path is not a mandate. It is a recommendation backed by automation. Create a new Go microservice using the golden path and you get a repository with CI/CD, Kubernetes manifests, observability, and a Backstage catalog entry — working in minutes. The golden path removes the 40+ decisions a developer would otherwise need to make.

Saga Pattern: Choreography, Orchestration, and Compensating Transactions

Saga Pattern#

In a monolith, a single database transaction can span multiple operations atomically. In microservices, each service owns its database. There is no distributed transaction that works reliably across services. The saga pattern solves this by breaking a transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes its work if a later step fails.

The Problem: No Distributed ACID#

Consider an order placement that must: (1) reserve inventory, (2) charge payment, (3) create shipment. In a monolith, this is one transaction. In microservices, these are three services with three databases. Two-phase commit (2PC) across these is fragile, slow, and most message brokers and modern databases do not support it across service boundaries.

On-Call Rotation Design

Sre

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Buildkite Pipeline Patterns: Dynamic Pipelines, Agents, Plugins, and Parallel Builds

Buildkite Pipeline Patterns#

Buildkite splits CI/CD into two parts: a hosted web service that manages pipelines, builds, and the UI, and self-hosted agents that execute the actual work. This architecture means your code, secrets, and build artifacts never touch Buildkite’s infrastructure. The agents run on your machines – EC2 instances, Kubernetes pods, bare metal, laptops.

Why Teams Choose Buildkite#

The question usually comes up against Jenkins and GitHub Actions.

Over Jenkins: Buildkite eliminates the Jenkins controller as a single point of failure. There is no plugin compatibility hell, no Groovy DSL, no Java memory tuning. Agents are stateless binaries that poll for work. Scaling is adding more agents. Jenkins requires careful capacity planning of the controller itself.

CQRS and Event Sourcing

CQRS and Event Sourcing#

CQRS (Command Query Responsibility Segregation) and event sourcing are frequently discussed together but are independent patterns. You can use either one without the other. Understanding each separately is essential before combining them.

CQRS: Separate Read and Write Models#

Most applications use the same data model for reads and writes. The same orders table serves the API that creates orders and the API that lists them. This works until read and write requirements diverge significantly.

Platform Team Structure and Operating Model

Why the Operating Model Matters#

The platform team’s operating model determines whether the platform becomes a force multiplier or a bottleneck. A ticket-driven, gatekeeper-oriented team produces a platform developers route around. A product-oriented, self-service team produces a platform developers adopt voluntarily. Organizational structure shapes developer experience more than technology choices.

Team Topologies and Interaction Modes#

The Team Topologies framework (Skelton & Pais) defines four team types relevant to platform engineering:

Production Readiness Reviews

Sre

Why Services Need a Gate Before Production#

Every production outage caused by a service that launched without monitoring, without runbooks, without capacity planning, without anyone knowing who owns it at 3 AM – every one of those was preventable. A production readiness review is the gate between “it works on my machine” and “it is ready for real users.” Google formalized this as the PRR process. You do not need Google-scale infrastructure to benefit from it.

Azure DevOps Pipelines: YAML Pipelines, Templates, Service Connections, and AKS Integration

Azure DevOps Pipelines#

Azure DevOps Pipelines uses YAML files stored in your repository to define build and deployment workflows. The pipeline model has three levels: stages contain jobs, jobs contain steps. This hierarchy maps directly to how you think about CI/CD – build stage, test stage, deploy-to-staging stage, deploy-to-production stage – with each stage containing one or more parallel jobs.

Pipeline Structure#

A complete pipeline in azure-pipelines.yml:

trigger:
  branches:
    include:
      - main
      - release/*
  paths:
    exclude:
      - docs/**
      - README.md

pool:
  vmImage: 'ubuntu-latest'

variables:
  - group: common-vars
  - name: buildConfiguration
    value: 'Release'

stages:
  - stage: Build
    jobs:
      - job: BuildApp
        steps:
          - task: GoTool@0
            inputs:
              version: '1.22'
          - script: |
              go build -o $(Build.ArtifactStagingDirectory)/myapp ./cmd/myapp
            displayName: 'Build binary'
          - publish: $(Build.ArtifactStagingDirectory)
            artifact: drop

  - stage: Test
    dependsOn: Build
    jobs:
      - job: UnitTests
        steps:
          - task: GoTool@0
            inputs:
              version: '1.22'
          - script: go test ./... -v -coverprofile=coverage.out
            displayName: 'Run tests'
          - task: PublishCodeCoverageResults@2
            inputs:
              summaryFileLocation: coverage.out
              codecoverageTool: 'Cobertura'

  - stage: DeployStaging
    dependsOn: Test
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: DeployToStaging
        environment: staging
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: drop
                - script: echo "Deploying to staging"

trigger controls which branches and paths trigger the pipeline. dependsOn creates stage ordering. condition adds logic – succeeded() checks the previous stage passed, and you can combine it with variable checks to restrict certain stages to specific branches.