Incident Lifecycle

Understand the complete incident lifecycle from detection to resolution. Learn best practices for incident management and response.

Lifecycle Stages

Detection → Acknowledgment → Investigation → Resolution → Postmortem
    ↑                                                              ↓
    └──────────────────── Reopen (if needed) ────────────────────┘

1. Detection

Incidents are automatically detected when monitors fail:

Trigger: Monitor exceeds failure threshold (e.g., 3 consecutive failures)
Auto-Created: Incident created immediately
Alert Sent: Notifications sent to configured channels
Status: open

Detection Example

Time: 10:30:00 - Monitor fails (check 1/3)
Time: 10:31:00 - Monitor fails (check 2/3)
Time: 10:32:00 - Monitor fails (check 3/3)

Action: Incident created automatically
Incident ID: inc_abc123
Status: open
Severity: critical (based on alert rule)
Alerts: Sent to Slack, Email, PagerDuty

2. Acknowledgment

Team member acknowledges they are investigating:

Action: Click "Acknowledge" in dashboard or PagerDuty
Status Change: open → acknowledged
Notification: Team notified via Slack/Email
SLA Timer: Acknowledgment time recorded (target: <3 minutes)

Acknowledgment Process

Time: 10:32:15 - John Doe acknowledges incident

Incident Status: acknowledged
Acknowledged By: John Doe
Time to Acknowledge: 2 minutes 15 seconds

Notification: "John Doe is investigating Production API incident"
Channels: Slack (#incidents), Team Email

3. Investigation

Engineer investigates root cause and adds notes:

Add Notes: Document findings, commands run, logs checked
Timeline Updates: All notes added to immutable timeline
Collaborate: Team members can view notes in real-time
External Tools: Link to Datadog, Sentry, CloudWatch logs

Investigation Example

10:33:00 - John: Checking service logs
Note: "API service returning 500 errors on /health endpoint"

10:34:00 - John: Database connection check
Note: "Database connection pool exhausted (50/50 active)"

10:35:00 - John: Root cause identified
Note: "Long-running queries blocking connection pool"

10:36:00 - Jane joins investigation
Note: "Restarting service to clear connection pool"

4. Resolution

Issue fixed and service recovered:

Auto-Resolve: Monitor recovers → incident auto-resolves
Manual Resolve: Engineer marks as resolved with note
Status Change: acknowledged → resolved
Metrics Recorded: Downtime, MTTR, resolution time

Resolution Example

10:38:42 - Monitor recovers (3 consecutive successful checks)

Action: Incident auto-resolved
Resolved By: System (auto-recovery)
Downtime: 8 minutes 42 seconds
MTTR: 6 minutes 27 seconds (from ack to resolve)

Resolution Note: "Service restarted, connection pool cleared"
Monitor Status: up

Manual Resolution

# When monitor still down but issue fixed differently:

Action: Engineer clicks "Resolve Incident"
Resolved By: John Doe
Resolution Note: "Scaled database, increased connection pool limit"
Status: resolved (manual)

Note: Monitor may still show "down" if checks haven't run yet

5. Postmortem (Optional)

After resolution, document learnings:

Root Cause: What caused the incident?
Impact: How many users affected? Revenue lost?
Response: What went well? What could improve?
Action Items: Prevent recurrence

Postmortem Template

Incident: Production API Down (inc_abc123)
Date: 2026-02-13
Duration: 8 minutes 42 seconds

Root Cause:
- Database connection pool exhausted (50/50 connections)
- Long-running analytical queries blocking pool

Impact:
- 100% API downtime for 8.7 minutes
- ~500 users affected
- 0 revenue impact (free tier users)

Timeline:
- 10:30:00: First failure detected
- 10:32:15: Acknowledged by John
- 10:35:00: Root cause identified
- 10:38:42: Service recovered

What Went Well:
- Fast detection (3 min threshold appropriate)
- Quick acknowledgment (2m 15s)
- Clear investigation notes
- Auto-recovery worked

What Could Improve:
- Connection pool monitoring (add alert)
- Separate read-only connection pool for analytics
- Query timeout enforcement (max 30s)

Action Items:
1. Add connection pool utilization monitor (owner: John, due: 2026-02-20)
2. Implement query timeout (owner: Jane, due: 2026-02-27)
3. Split analytics to read replica (owner: DevOps, due: 2026-03-15)

Reopening Incidents

If issue recurs after resolution:

Auto-Reopen: Monitor fails again within 24h → same incident reopened
Manual Reopen: Engineer clicks "Reopen" with reason
Status Change: resolved → open

Reopen Example

11:00:00 - Monitor fails again (within 24h window)

Action: Incident reopened automatically
Status: open (reopened)
Reopened By: System
Reason: "Monitor failed again after recovery"

Note: Previous resolution was ineffective, deeper investigation needed

Status Transitions

From	To	Trigger	Who
-	open	Monitor failure threshold reached	System
open	acknowledged	Engineer acknowledges	User
acknowledged	resolved	Monitor recovers or manual resolve	System/User
open	resolved	Monitor recovers (skip ack)	System
resolved	open	Monitor fails again or manual reopen	System/User

SLA Metrics

Metric	Target	Description
MTTD	<2 min	Mean Time To Detect
MTTA	<3 min	Mean Time To Acknowledge
MTTR	<15 min	Mean Time To Recovery
MTBF	>30 days	Mean Time Between Failures

Best Practices

1. Acknowledge Quickly

Target: <3 minutes for critical incidents
Use mobile app for faster acknowledgment
Acknowledge even if you can't fix immediately

2. Document Everything

Add notes throughout investigation
Include commands run, logs checked, findings
Link to external tools (Datadog, Sentry)
Makes postmortem writing easier

3. Communicate Proactively

Update status page with incident details
Post updates to Slack every 15 minutes
Notify when resolved
Thank team for patience

4. Learn from Incidents

Write postmortems for critical incidents
Blameless culture (focus on systems, not people)
Track action items to completion
Review metrics monthly

Next Steps

Incident Timeline: Immutable event log
Incident Management: Overview and features
Alert Rules: Configure incident triggers
Best Practices: Incident response tips