Skip to main content

Monitoring Best Practices

Learn proven strategies for effective monitoring. Follow the 3-3-3 rule, avoid alert fatigue, and implement monitoring patterns that scale.

The 3-3-3 Rule

The 3-3-3 Rule is a simple framework for balanced monitoring:

3 Consecutive Failures

Wait for 3 consecutive check failures before alerting.

Why?

  • Reduces false positives from transient network blips
  • Confirms the issue is real, not a one-time fluke
  • Gives time for automatic recovery (e.g., container restart)
Alert Rule:
consecutiveFailures: 3
interval: 60 seconds

Timeline:
00:00 - Check 1: āŒ Failed
00:60 - Check 2: āŒ Failed
01:20 - Check 3: āŒ Failed → 🚨 ALERT TRIGGERED

3 Minutes to Acknowledge

Team should acknowledge alerts within 3 minutes of receiving them.

Why?

  • Confirms someone is aware and investigating
  • Prevents duplicate escalation
  • Establishes accountability

3 Channels for Redundancy

Configure 3 alert channels for critical services:

  • Channel 1 (Primary): Slack/Discord for team visibility
  • Channel 2 (Backup): Email for offline notifications
  • Channel 3 (Escalation): PagerDuty for on-call rotation
Alert Channels:
ā”œā”€ Slack (#alerts) - Immediate team notification
ā”œā”€ Email (team@example.com) - Backup, reaches everyone
└─ PagerDuty (oncall-team) - Escalation after 5 min

Avoiding Alert Fatigue

What is Alert Fatigue?

Alert Fatigue occurs when teams receive too many alerts, causing them to:

  • Ignore or dismiss alerts without reading
  • Miss critical issues buried in noise
  • Become desensitized to alerts

Symptoms of Alert Fatigue

  • 🚩 Receiving 50+ alerts per day
  • 🚩 Most alerts don't require action
  • 🚩 Team stops checking alert channels
  • 🚩 Critical alerts get ignored

Solutions

1. Increase Thresholds

Be conservative with alert conditions:

  • āŒ Bad: Alert on 1 failed check
  • āœ… Good: Alert on 3 consecutive failures
  • āŒ Bad: Alert on latency > 100ms
  • āœ… Good: Alert on latency > 2000ms

2. Use Alert Severity Levels

Not everything is P0 critical:

SeverityDefinitionChannelResponse Time
P0 (Critical)Production down, revenue impactPagerDuty + SlackImmediate
P1 (High)Degraded performanceSlack + Email<15 min
P2 (Medium)Non-critical issueSlack only<1 hour
P3 (Low)InformationalEmail onlyNext business day

3. Group Related Alerts

Instead of 10 individual alerts, send 1 grouped alert:

āŒ Bad: Individual alerts
- Database server 1 down
- Database server 2 down
- Database server 3 down
- API endpoint /users failing
- API endpoint /posts failing

āœ… Good: Grouped alert
Database Cluster Down (3 servers)
→ Cascading failure: 2 API endpoints affected

4. Silence During Maintenance

Always silence monitors during planned maintenance:

  • Deployments
  • Database migrations
  • Infrastructure updates

5. Weekly Alert Review

Analyze alert patterns every week:

  • Which alerts triggered most often?
  • How many required action?
  • Were any critical issues missed?
  • Adjust thresholds based on data

Monitor Layering Strategy

Layer 1: Network (ICMP)

Basic reachability check:

ICMP Monitor: server.example.com
Purpose: Is server online?
Interval: 1 minute
Alert: Yes (critical if server unreachable)

Layer 2: Port (TCP)

Verify service port is open:

TCP Monitor: server.example.com:5432
Purpose: Is PostgreSQL port listening?
Interval: 1 minute
Alert: Yes (critical if port closed)

Layer 3: Application (HTTP)

Check application health:

HTTP Monitor: https://api.example.com/health
Purpose: Is application responding?
Validation: JSONPath $.status = "healthy"
Interval: 30 seconds
Alert: Yes (critical if unhealthy)

Layer 4: Business Logic (Custom)

Verify end-to-end workflows:

HTTP Monitor: POST /api/orders
Purpose: Can users place orders?
Validation: $.orderId exists
Interval: 5 minutes
Alert: Yes (critical if orders failing)

Layering Example

Problem: API not responding

Layer 1 (ICMP): āœ… Server reachable
Layer 2 (TCP): āœ… Port 443 open
Layer 3 (HTTP): āŒ /health returns 503
Layer 4 (Business): āŒ Orders failing

→ Conclusion: Application issue, not infrastructure

Check Interval Guidelines

Traditional Monitors

Service TypeIntervalReasoning
Critical API15-30sFast detection of issues
Database1 minuteBalance speed vs load
Background Jobs5-15 minutesJobs run infrequently
SSL Certificates24 hoursCertificates rarely change

Web3 Monitors

Monitor TypeIntervalReasoning
Gas Price30s - 1mVolatile, changes rapidly
Whale Wallet1-5 minutesInfrequent large transactions
Contract Events1 minuteBlock time ~12-15s
Liquidation Risk5 minutesHealth factor changes slowly
DeFi Protocol5-15 minutesTVL/APY update gradually

Naming Conventions

Monitor Names

Use descriptive, consistent naming:

  • āœ… Production API - Health Check
  • āœ… PostgreSQL Primary - Port 5432
  • āœ… SSL Certificate - api.example.com
  • āŒ Monitor 1
  • āŒ test
  • āŒ check-api

Alert Rule Names

Include severity and service:

  • āœ… [P0] Production API Down
  • āœ… [P1] Database Slow Queries
  • āœ… [P2] Cache Hit Rate Low

Tags

Use tags for organization and filtering:

  • Environment: production, staging, development
  • Severity: critical, important, monitoring
  • Service: api, database, frontend
  • Team: backend-team, devops, sre

Documentation Standards

Runbooks

Every critical monitor should have a runbook:

# API Down Runbook

## Symptoms
- HTTP health check returns 503
- Users cannot access application

## Diagnosis
1. Check application logs: `kubectl logs -f api-deployment`
2. Verify database connectivity
3. Check recent deployments

## Resolution
1. If deployment issue: Rollback to previous version
2. If database issue: Check connection pool
3. If resource issue: Scale up pods

## Escalation
If not resolved in 15 minutes, page on-call engineer

Alert Descriptions

Include context in alert messages:

āŒ Bad: "API is down"

āœ… Good:
Production API Down - 3 consecutive failures

URL: https://api.example.com/health
Expected: 200 OK, got: 503 Service Unavailable
Uptime: 99.95% (last 30 days)
Last successful check: 2 minutes ago

Runbook: https://wiki.example.com/runbooks/api-down
Dashboard: https://blacktide.xyz/monitors/mon_abc123

Testing Your Monitoring

Chaos Engineering

Regularly test that your alerts work:

  • Intentionally stop a service
  • Verify alert is triggered
  • Check alert reaches all channels
  • Measure time to detection
  • Practice incident response

Monthly Drills

Monthly Monitoring Drill:
1. Simulate outage in staging
2. Verify alerts trigger correctly
3. Practice runbook procedures
4. Time the incident response
5. Document improvements needed

Common Pitfalls

1. Monitoring Everything

āŒ Don't: Monitor every metric and log line

āœ… Do: Focus on user-impacting metrics

2. Ignoring False Positives

āŒ Don't: Leave noisy alerts in place

āœ… Do: Fix or delete alerts that cry wolf

3. No Recovery Notifications

āŒ Don't: Only alert on failures

āœ… Do: Enable notifyOnRecovery

4. Single Point of Failure

āŒ Don't: Rely on one alert channel

āœ… Do: Configure backup channels

5. No Maintenance Windows

āŒ Don't: Deploy without silencing alerts

āœ… Do: Use silence feature during deployments

Metrics to Track

Monitoring Health Metrics

  • Mean Time to Detect (MTTD): How fast do you detect issues?
  • Mean Time to Resolve (MTTR): How fast do you fix issues?
  • Alert Accuracy: % of alerts that require action
  • False Positive Rate: % of alerts that are noise
  • Coverage: % of services monitored

Target KPIs

MetricTarget
MTTD< 2 minutes
MTTR< 15 minutes
Alert Accuracy> 90%
False Positive Rate< 5%
Service Coverage100% critical services

Next Steps