Build an Incident Response SOP for Fast, Reliable Recovery
Create a Seilers incident response SOP covering detection, escalation, and post-mortem steps so your team handles any production outage with confidence.
When a production incident strikes at 2 AM, the last thing your on-call engineer should be doing is figuring out what to do next. The cost of an undocumented incident response process isn’t measured only in downtime — it’s measured in cascading communication failures, missed escalations, and the organizational damage of watching your team flail in front of customers. A Seilers incident response SOP gives every person in your on-call rotation the same playbook to follow: how to assess severity, who to notify, how often to communicate status updates, and what to do when the dust settles. Documentation doesn’t slow down incident response — it makes it faster.
A complete incident response SOP covers every phase from first alert to post-mortem completion:
Incident Severity Levels (SEV1 / SEV2 / SEV3)
Define your severity scale with objective, unambiguous criteria:
SEV1 — Critical: Full service outage, data loss risk, or security breach. All customers affected. Requires immediate all-hands response.
SEV2 — Major: Significant functionality degraded or a subset of customers affected. Core workflows impaired.
SEV3 — Minor: Non-critical functionality affected or a workaround is available. Limited customer impact.
Include examples for each level to remove ambiguity under pressure.
Detection and Alert Criteria
Define what constitutes a detectable incident: which monitoring alerts fire, what error rate thresholds or latency spikes trigger a page, and how customer-reported issues are escalated to incident status. Include the expected detection-to-declaration time.
Incident Commander Role and Responsibilities
Define the Incident Commander (IC) role: who fills it (typically the on-call engineer who first picks up the page), what authority they have, and what their responsibilities are — declaring the incident, coordinating the response, and driving toward resolution. The IC does not necessarily fix the problem; they manage the process.
Communication Cadence (Internal and External)
Specify how often the Incident Commander posts updates — both internally (Slack incident channel, incident bridge) and externally (status page, customer communications). For SEV1s, a 15-minute internal cadence and 30-minute external update is a common starting point. Define the template for each type of update.
Escalation Contacts
A current, role-by-role list of who to contact for each type of incident: on-call engineering, on-call security, VP Engineering for SEV1 declarations, Customer Success Lead for customer communications, and Legal for data breach scenarios. Keep this list up to date — stale escalation contacts are among the most common failure modes in incident response.
Mitigation and Resolution Steps
The standard investigative steps your team follows: identifying the error source in logs, isolating the affected system, applying a mitigation (rollback, feature flag, traffic redirect), confirming stability, and transitioning from mitigation to a full resolution. Note that mitigation and resolution are distinct — document both.
Post-Mortem Process
When a post-mortem is required (typically any SEV1 or repeat SEV2), who is responsible for scheduling and leading it, the timeline for completion (commonly within 5 business days), the required sections of the post-mortem document, and how action items are tracked to completion.
In Seilers, go to SOPs → New SOP and name it Incident Response. Assign it to the Engineering or Operations category. In the description field, note that this SOP applies to all production incidents and is required reading for every member of the on-call rotation.
2
Define severity levels in a Decision step
Add a Decision step at the top of the SOP titled “Assess Severity.” Include your SEV1, SEV2, and SEV3 definitions with the branching actions for each — who to page, which Slack channel to open, and whether to post an immediate status page update. Decision steps make severity assessment a structured choice, not a judgment call made under pressure.
3
Add steps for each phase: Detect, Assess, Respond, Communicate, Resolve
Create a step for each phase of the incident lifecycle. Keep each step action-oriented — “Post an initial status update to the external status page” rather than “Handle communications.” Concrete actions are followable at 2 AM; vague guidance is not.
4
Embed a Post-Mortem checklist
Create a separate Post-Mortem Checklist SOP and link it inside the Resolution step using Link SOP. The checklist should fire automatically at the close of any SEV1 so the post-mortem process begins before the adrenaline fades and the details blur.
5
Assign an Engineering Lead as owner
Use Assign Owner to tag your Engineering Lead, VP of Engineering, or Head of Site Reliability as the SOP owner. They are responsible for keeping escalation contacts current, updating severity definitions as your architecture evolves, and approving any changes before publication.
6
Publish and share with the on-call rotation
Click Publish and distribute the SOP link to every member of your on-call rotation. Pin it in your #on-call or #incidents Slack channel. Consider adding it as a bookmark in your incident management tool so it’s one click away when an alert fires.
A post-mortem is only valuable if it’s systematic and blameless. Use the following checklist inside your Post-Mortem SOP to structure every review:
Timeline Reconstruction
Build a precise, chronological timeline of the incident: when the first alert fired, when the IC was paged, when key decisions were made, when customer impact began, when mitigation was applied, and when full resolution was confirmed. Use timestamps from your monitoring tools, Slack, and the incident bridge — not memory.
Root Cause Analysis
Identify the proximate cause (what broke) and the root cause (why the conditions existed for it to break). Use a structured method like the 5 Whys to avoid stopping at the surface level. The goal is to find the systemic failure, not the individual mistake.
Action Items
Document every action item with a clear owner, a due date, and a priority level. Action items that live in a post-mortem document with no owner or deadline are intentions, not commitments. Track action items in your project management tool and link them back to the post-mortem for traceability.
Post-Mortem Review Meeting
Schedule a 30-to-60-minute review meeting within 5 business days of incident resolution. Include the Incident Commander, the engineers who were involved, and relevant stakeholders. Run the meeting blamelessy — the focus is on what the system and process allowed to happen, not on individual fault.
Review and test this SOP at least quarterly — ideally through a tabletop exercise or a game day where you simulate an incident and walk through the SOP in real time. An untested incident response plan provides false confidence. The first time you discover a gap in your process should never be during a real outage.
Customer Support SOP
Align your support team’s response process with your incident severity levels.
SOPs Overview
Learn how to create, organize, and maintain SOPs across your entire organization in Seilers.