Incident Lifecycle
Understand the complete incident lifecycle from detection to resolution. Learn best practices for incident management and response.
Lifecycle Stages
Detection → Acknowledgment → Investigation → Resolution → Postmortem
↑ ↓
└──────────────────── Reopen (if needed) ────────────────────┘1. Detection
Incidents are automatically detected when monitors fail:
- Trigger: Monitor exceeds failure threshold (e.g., 3 consecutive failures)
- Auto-Created: Incident created immediately
- Alert Sent: Notifications sent to configured channels
- Status:
open
Detection Example
Time: 10:30:00 - Monitor fails (check 1/3)
Time: 10:31:00 - Monitor fails (check 2/3)
Time: 10:32:00 - Monitor fails (check 3/3)
Action: Incident created automatically
Incident ID: inc_abc123
Status: open
Severity: critical (based on alert rule)
Alerts: Sent to Slack, Email, PagerDuty2. Acknowledgment
Team member acknowledges they are investigating:
- Action: Click "Acknowledge" in dashboard or PagerDuty
- Status Change:
open→acknowledged - Notification: Team notified via Slack/Email
- SLA Timer: Acknowledgment time recorded (target: <3 minutes)
Acknowledgment Process
Time: 10:32:15 - John Doe acknowledges incident
Incident Status: acknowledged
Acknowledged By: John Doe
Time to Acknowledge: 2 minutes 15 seconds
Notification: "John Doe is investigating Production API incident"
Channels: Slack (#incidents), Team Email3. Investigation
Engineer investigates root cause and adds notes:
- Add Notes: Document findings, commands run, logs checked
- Timeline Updates: All notes added to immutable timeline
- Collaborate: Team members can view notes in real-time
- External Tools: Link to Datadog, Sentry, CloudWatch logs
Investigation Example
10:33:00 - John: Checking service logs
Note: "API service returning 500 errors on /health endpoint"
10:34:00 - John: Database connection check
Note: "Database connection pool exhausted (50/50 active)"
10:35:00 - John: Root cause identified
Note: "Long-running queries blocking connection pool"
10:36:00 - Jane joins investigation
Note: "Restarting service to clear connection pool"4. Resolution
Issue fixed and service recovered:
- Auto-Resolve: Monitor recovers → incident auto-resolves
- Manual Resolve: Engineer marks as resolved with note
- Status Change:
acknowledged→resolved - Metrics Recorded: Downtime, MTTR, resolution time
Resolution Example
10:38:42 - Monitor recovers (3 consecutive successful checks)
Action: Incident auto-resolved
Resolved By: System (auto-recovery)
Downtime: 8 minutes 42 seconds
MTTR: 6 minutes 27 seconds (from ack to resolve)
Resolution Note: "Service restarted, connection pool cleared"
Monitor Status: upManual Resolution
# When monitor still down but issue fixed differently:
Action: Engineer clicks "Resolve Incident"
Resolved By: John Doe
Resolution Note: "Scaled database, increased connection pool limit"
Status: resolved (manual)
Note: Monitor may still show "down" if checks haven't run yet5. Postmortem (Optional)
After resolution, document learnings:
- Root Cause: What caused the incident?
- Impact: How many users affected? Revenue lost?
- Response: What went well? What could improve?
- Action Items: Prevent recurrence
Postmortem Template
Incident: Production API Down (inc_abc123)
Date: 2026-02-13
Duration: 8 minutes 42 seconds
Root Cause:
- Database connection pool exhausted (50/50 connections)
- Long-running analytical queries blocking pool
Impact:
- 100% API downtime for 8.7 minutes
- ~500 users affected
- 0 revenue impact (free tier users)
Timeline:
- 10:30:00: First failure detected
- 10:32:15: Acknowledged by John
- 10:35:00: Root cause identified
- 10:38:42: Service recovered
What Went Well:
- Fast detection (3 min threshold appropriate)
- Quick acknowledgment (2m 15s)
- Clear investigation notes
- Auto-recovery worked
What Could Improve:
- Connection pool monitoring (add alert)
- Separate read-only connection pool for analytics
- Query timeout enforcement (max 30s)
Action Items:
1. Add connection pool utilization monitor (owner: John, due: 2026-02-20)
2. Implement query timeout (owner: Jane, due: 2026-02-27)
3. Split analytics to read replica (owner: DevOps, due: 2026-03-15)Reopening Incidents
If issue recurs after resolution:
- Auto-Reopen: Monitor fails again within 24h → same incident reopened
- Manual Reopen: Engineer clicks "Reopen" with reason
- Status Change:
resolved→open
Reopen Example
11:00:00 - Monitor fails again (within 24h window)
Action: Incident reopened automatically
Status: open (reopened)
Reopened By: System
Reason: "Monitor failed again after recovery"
Note: Previous resolution was ineffective, deeper investigation neededStatus Transitions
| From | To | Trigger | Who |
|---|---|---|---|
| - | open | Monitor failure threshold reached | System |
| open | acknowledged | Engineer acknowledges | User |
| acknowledged | resolved | Monitor recovers or manual resolve | System/User |
| open | resolved | Monitor recovers (skip ack) | System |
| resolved | open | Monitor fails again or manual reopen | System/User |
SLA Metrics
| Metric | Target | Description |
|---|---|---|
| MTTD | <2 min | Mean Time To Detect |
| MTTA | <3 min | Mean Time To Acknowledge |
| MTTR | <15 min | Mean Time To Recovery |
| MTBF | >30 days | Mean Time Between Failures |
Best Practices
1. Acknowledge Quickly
- Target: <3 minutes for critical incidents
- Use mobile app for faster acknowledgment
- Acknowledge even if you can't fix immediately
2. Document Everything
- Add notes throughout investigation
- Include commands run, logs checked, findings
- Link to external tools (Datadog, Sentry)
- Makes postmortem writing easier
3. Communicate Proactively
- Update status page with incident details
- Post updates to Slack every 15 minutes
- Notify when resolved
- Thank team for patience
4. Learn from Incidents
- Write postmortems for critical incidents
- Blameless culture (focus on systems, not people)
- Track action items to completion
- Review metrics monthly
Next Steps
- Incident Timeline: Immutable event log
- Incident Management: Overview and features
- Alert Rules: Configure incident triggers
- Best Practices: Incident response tips