Scenario: Recovering from a Failed Deployment#
You are helping when someone reports: “we deployed a new version and it is causing errors,” “pods are not starting,” or “the service is down after a deploy.” The goal is to restore service as quickly as possible, then prevent recurrence.
Time matters here. Every minute of diagnosis while the service is degraded is a minute of user impact. The bias should be toward fast rollback first, then root cause analysis second.