Skip to main content

Incident Lifecycle

Understand the complete incident lifecycle from detection to resolution. Learn best practices for incident management and response.

Lifecycle Stages

Detection → Acknowledgment → Investigation → Resolution → Postmortem
    ↑                                                              ↓
    └──────────────────── Reopen (if needed) ────────────────────┘

1. Detection

Incidents are automatically detected when monitors fail:

  • Trigger: Monitor exceeds failure threshold (e.g., 3 consecutive failures)
  • Auto-Created: Incident created immediately
  • Alert Sent: Notifications sent to configured channels
  • Status: open

Detection Example

Time: 10:30:00 - Monitor fails (check 1/3)
Time: 10:31:00 - Monitor fails (check 2/3)
Time: 10:32:00 - Monitor fails (check 3/3)

Action: Incident created automatically
Incident ID: inc_abc123
Status: open
Severity: critical (based on alert rule)
Alerts: Sent to Slack, Email, PagerDuty

2. Acknowledgment

Team member acknowledges they are investigating:

  • Action: Click "Acknowledge" in dashboard or PagerDuty
  • Status Change: openacknowledged
  • Notification: Team notified via Slack/Email
  • SLA Timer: Acknowledgment time recorded (target: <3 minutes)

Acknowledgment Process

Time: 10:32:15 - John Doe acknowledges incident

Incident Status: acknowledged
Acknowledged By: John Doe
Time to Acknowledge: 2 minutes 15 seconds

Notification: "John Doe is investigating Production API incident"
Channels: Slack (#incidents), Team Email

3. Investigation

Engineer investigates root cause and adds notes:

  • Add Notes: Document findings, commands run, logs checked
  • Timeline Updates: All notes added to immutable timeline
  • Collaborate: Team members can view notes in real-time
  • External Tools: Link to Datadog, Sentry, CloudWatch logs

Investigation Example

10:33:00 - John: Checking service logs
Note: "API service returning 500 errors on /health endpoint"

10:34:00 - John: Database connection check
Note: "Database connection pool exhausted (50/50 active)"

10:35:00 - John: Root cause identified
Note: "Long-running queries blocking connection pool"

10:36:00 - Jane joins investigation
Note: "Restarting service to clear connection pool"

4. Resolution

Issue fixed and service recovered:

  • Auto-Resolve: Monitor recovers → incident auto-resolves
  • Manual Resolve: Engineer marks as resolved with note
  • Status Change: acknowledgedresolved
  • Metrics Recorded: Downtime, MTTR, resolution time

Resolution Example

10:38:42 - Monitor recovers (3 consecutive successful checks)

Action: Incident auto-resolved
Resolved By: System (auto-recovery)
Downtime: 8 minutes 42 seconds
MTTR: 6 minutes 27 seconds (from ack to resolve)

Resolution Note: "Service restarted, connection pool cleared"
Monitor Status: up

Manual Resolution

# When monitor still down but issue fixed differently:

Action: Engineer clicks "Resolve Incident"
Resolved By: John Doe
Resolution Note: "Scaled database, increased connection pool limit"
Status: resolved (manual)

Note: Monitor may still show "down" if checks haven't run yet

5. Postmortem (Optional)

After resolution, document learnings:

  • Root Cause: What caused the incident?
  • Impact: How many users affected? Revenue lost?
  • Response: What went well? What could improve?
  • Action Items: Prevent recurrence

Postmortem Template

Incident: Production API Down (inc_abc123)
Date: 2026-02-13
Duration: 8 minutes 42 seconds

Root Cause:
- Database connection pool exhausted (50/50 connections)
- Long-running analytical queries blocking pool

Impact:
- 100% API downtime for 8.7 minutes
- ~500 users affected
- 0 revenue impact (free tier users)

Timeline:
- 10:30:00: First failure detected
- 10:32:15: Acknowledged by John
- 10:35:00: Root cause identified
- 10:38:42: Service recovered

What Went Well:
- Fast detection (3 min threshold appropriate)
- Quick acknowledgment (2m 15s)
- Clear investigation notes
- Auto-recovery worked

What Could Improve:
- Connection pool monitoring (add alert)
- Separate read-only connection pool for analytics
- Query timeout enforcement (max 30s)

Action Items:
1. Add connection pool utilization monitor (owner: John, due: 2026-02-20)
2. Implement query timeout (owner: Jane, due: 2026-02-27)
3. Split analytics to read replica (owner: DevOps, due: 2026-03-15)

Reopening Incidents

If issue recurs after resolution:

  • Auto-Reopen: Monitor fails again within 24h → same incident reopened
  • Manual Reopen: Engineer clicks "Reopen" with reason
  • Status Change: resolvedopen

Reopen Example

11:00:00 - Monitor fails again (within 24h window)

Action: Incident reopened automatically
Status: open (reopened)
Reopened By: System
Reason: "Monitor failed again after recovery"

Note: Previous resolution was ineffective, deeper investigation needed

Status Transitions

FromToTriggerWho
-openMonitor failure threshold reachedSystem
openacknowledgedEngineer acknowledgesUser
acknowledgedresolvedMonitor recovers or manual resolveSystem/User
openresolvedMonitor recovers (skip ack)System
resolvedopenMonitor fails again or manual reopenSystem/User

SLA Metrics

MetricTargetDescription
MTTD<2 minMean Time To Detect
MTTA<3 minMean Time To Acknowledge
MTTR<15 minMean Time To Recovery
MTBF>30 daysMean Time Between Failures

Best Practices

1. Acknowledge Quickly

  • Target: <3 minutes for critical incidents
  • Use mobile app for faster acknowledgment
  • Acknowledge even if you can't fix immediately

2. Document Everything

  • Add notes throughout investigation
  • Include commands run, logs checked, findings
  • Link to external tools (Datadog, Sentry)
  • Makes postmortem writing easier

3. Communicate Proactively

  • Update status page with incident details
  • Post updates to Slack every 15 minutes
  • Notify when resolved
  • Thank team for patience

4. Learn from Incidents

  • Write postmortems for critical incidents
  • Blameless culture (focus on systems, not people)
  • Track action items to completion
  • Review metrics monthly

Next Steps