Toil Measurement and Reduction

Sre

What Toil Actually Is#

Toil is work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. Not all operational work is toil. Capacity planning requires judgment. Postmortem analysis produces lasting improvements. Writing automation code is engineering. Toil is the opposite: it is the work that a machine could do but currently a human is doing, over and over, without making the system any better.

Post-Mortem Action Item Tracking

Sre

The Action Item Problem#

Post-mortem reviews produce action items. Teams agree on what needs to change. Then weeks pass, priorities shift, and items quietly decay into a backlog nobody checks. The next incident hits the same root cause, and the post-mortem produces the same action items again.

Studies of recurring incidents consistently show the root cause was identified in a previous post-mortem, and the corresponding action item was never completed. Action item tracking is the mechanism by which incidents make systems more reliable instead of just more documented.