Toil Measurement and Reduction

Sre

What Toil Actually Is#

Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. Not all operational work is toil. Capacity planning requires judgment. Postmortem analysis produces lasting improvements. Writing automation code is engineering. Toil is the opposite: it is the work that a machine could do but currently a human is doing, over and over, without making the system any better.

ArgoCD Image Updater: Automatic Image Tag Updates Without Git Commits

ArgoCD Image Updater#

ArgoCD Image Updater watches container registries for new image tags and automatically updates ArgoCD Applications to use them. In a standard GitOps workflow, updating an image tag requires a Git commit that changes the tag in a values file or manifest. Image Updater automates that step.

The Problem It Solves#

Standard GitOps image update flow:

CI builds image → pushes myapp:v1.2.3 to registry
    → Developer (or CI) commits "update image tag to v1.2.3" to Git
    → ArgoCD detects Git change
    → ArgoCD syncs new tag to cluster

That middle step – committing the tag update – is friction. CI pipelines need Git write access, commit messages are noise (“bump image to v1.2.4”, “bump image to v1.2.5”), and the delay between image push and deployment depends on how fast the commit pipeline runs.

Automating Operational Runbooks

Sre

The Manual-to-Automated Progression#

Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.

Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.

Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.

Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.

GitHub Actions Fundamentals: Workflows, Triggers, Jobs, and Data Passing

GitHub Actions Fundamentals#

GitHub Actions is CI/CD built into GitHub. Workflows are YAML files in .github/workflows/. They run on GitHub-hosted or self-hosted machines in response to repository events. No external CI server required.

Workflow File Structure#

Every workflow has three levels: workflow (triggers and config), jobs (parallel units of work), and steps (sequential commands within a job).

name: CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.23'
      - run: go test ./...

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: golangci/golangci-lint-action@v6

Jobs run in parallel by default. Steps within a job run sequentially. Each job gets a fresh runner – no state carries over between jobs unless you explicitly pass it via artifacts or outputs.

Ansible Role Structure and Patterns

Standard Role Directory Structure#

An Ansible role is a directory with a fixed layout. Each subdirectory serves a specific purpose:

roles/
  webserver/
    tasks/
      main.yml          # Entry point — task list
    handlers/
      main.yml          # Service restart/reload triggers
    templates/
      nginx.conf.j2     # Jinja2 templates
    files/
      index.html        # Static files copied as-is
    vars/
      main.yml          # Internal variables (high precedence)
    defaults/
      main.yml          # Default variables (low precedence, meant to be overridden)
    meta/
      main.yml          # Role metadata, dependencies

Generate a skeleton with: