Monitoring Best Practices
Learn proven strategies for effective monitoring. Follow the 3-3-3 rule, avoid alert fatigue, and implement monitoring patterns that scale.
The 3-3-3 Rule
The 3-3-3 Rule is a simple framework for balanced monitoring:
3 Consecutive Failures
Wait for 3 consecutive check failures before alerting.
Why?
- Reduces false positives from transient network blips
- Confirms the issue is real, not a one-time fluke
- Gives time for automatic recovery (e.g., container restart)
Alert Rule:
consecutiveFailures: 3
interval: 60 seconds
Timeline:
00:00 - Check 1: ā Failed
00:60 - Check 2: ā Failed
01:20 - Check 3: ā Failed ā šØ ALERT TRIGGERED3 Minutes to Acknowledge
Team should acknowledge alerts within 3 minutes of receiving them.
Why?
- Confirms someone is aware and investigating
- Prevents duplicate escalation
- Establishes accountability
3 Channels for Redundancy
Configure 3 alert channels for critical services:
- Channel 1 (Primary): Slack/Discord for team visibility
- Channel 2 (Backup): Email for offline notifications
- Channel 3 (Escalation): PagerDuty for on-call rotation
Alert Channels:
āā Slack (#alerts) - Immediate team notification
āā Email (team@example.com) - Backup, reaches everyone
āā PagerDuty (oncall-team) - Escalation after 5 minAvoiding Alert Fatigue
What is Alert Fatigue?
Alert Fatigue occurs when teams receive too many alerts, causing them to:
- Ignore or dismiss alerts without reading
- Miss critical issues buried in noise
- Become desensitized to alerts
Symptoms of Alert Fatigue
- š© Receiving 50+ alerts per day
- š© Most alerts don't require action
- š© Team stops checking alert channels
- š© Critical alerts get ignored
Solutions
1. Increase Thresholds
Be conservative with alert conditions:
- ā Bad: Alert on 1 failed check
- ā Good: Alert on 3 consecutive failures
- ā Bad: Alert on latency > 100ms
- ā Good: Alert on latency > 2000ms
2. Use Alert Severity Levels
Not everything is P0 critical:
| Severity | Definition | Channel | Response Time |
|---|---|---|---|
| P0 (Critical) | Production down, revenue impact | PagerDuty + Slack | Immediate |
| P1 (High) | Degraded performance | Slack + Email | <15 min |
| P2 (Medium) | Non-critical issue | Slack only | <1 hour |
| P3 (Low) | Informational | Email only | Next business day |
3. Group Related Alerts
Instead of 10 individual alerts, send 1 grouped alert:
ā Bad: Individual alerts
- Database server 1 down
- Database server 2 down
- Database server 3 down
- API endpoint /users failing
- API endpoint /posts failing
ā
Good: Grouped alert
Database Cluster Down (3 servers)
ā Cascading failure: 2 API endpoints affected4. Silence During Maintenance
Always silence monitors during planned maintenance:
- Deployments
- Database migrations
- Infrastructure updates
5. Weekly Alert Review
Analyze alert patterns every week:
- Which alerts triggered most often?
- How many required action?
- Were any critical issues missed?
- Adjust thresholds based on data
Monitor Layering Strategy
Layer 1: Network (ICMP)
Basic reachability check:
ICMP Monitor: server.example.com
Purpose: Is server online?
Interval: 1 minute
Alert: Yes (critical if server unreachable)Layer 2: Port (TCP)
Verify service port is open:
TCP Monitor: server.example.com:5432
Purpose: Is PostgreSQL port listening?
Interval: 1 minute
Alert: Yes (critical if port closed)Layer 3: Application (HTTP)
Check application health:
HTTP Monitor: https://api.example.com/health
Purpose: Is application responding?
Validation: JSONPath $.status = "healthy"
Interval: 30 seconds
Alert: Yes (critical if unhealthy)Layer 4: Business Logic (Custom)
Verify end-to-end workflows:
HTTP Monitor: POST /api/orders
Purpose: Can users place orders?
Validation: $.orderId exists
Interval: 5 minutes
Alert: Yes (critical if orders failing)Layering Example
Problem: API not responding
Layer 1 (ICMP): ā
Server reachable
Layer 2 (TCP): ā
Port 443 open
Layer 3 (HTTP): ā /health returns 503
Layer 4 (Business): ā Orders failing
ā Conclusion: Application issue, not infrastructureCheck Interval Guidelines
Traditional Monitors
| Service Type | Interval | Reasoning |
|---|---|---|
| Critical API | 15-30s | Fast detection of issues |
| Database | 1 minute | Balance speed vs load |
| Background Jobs | 5-15 minutes | Jobs run infrequently |
| SSL Certificates | 24 hours | Certificates rarely change |
Web3 Monitors
| Monitor Type | Interval | Reasoning |
|---|---|---|
| Gas Price | 30s - 1m | Volatile, changes rapidly |
| Whale Wallet | 1-5 minutes | Infrequent large transactions |
| Contract Events | 1 minute | Block time ~12-15s |
| Liquidation Risk | 5 minutes | Health factor changes slowly |
| DeFi Protocol | 5-15 minutes | TVL/APY update gradually |
Naming Conventions
Monitor Names
Use descriptive, consistent naming:
- ā
Production API - Health Check - ā
PostgreSQL Primary - Port 5432 - ā
SSL Certificate - api.example.com - ā
Monitor 1 - ā
test - ā
check-api
Alert Rule Names
Include severity and service:
- ā
[P0] Production API Down - ā
[P1] Database Slow Queries - ā
[P2] Cache Hit Rate Low
Tags
Use tags for organization and filtering:
- Environment:
production,staging,development - Severity:
critical,important,monitoring - Service:
api,database,frontend - Team:
backend-team,devops,sre
Documentation Standards
Runbooks
Every critical monitor should have a runbook:
# API Down Runbook
## Symptoms
- HTTP health check returns 503
- Users cannot access application
## Diagnosis
1. Check application logs: `kubectl logs -f api-deployment`
2. Verify database connectivity
3. Check recent deployments
## Resolution
1. If deployment issue: Rollback to previous version
2. If database issue: Check connection pool
3. If resource issue: Scale up pods
## Escalation
If not resolved in 15 minutes, page on-call engineerAlert Descriptions
Include context in alert messages:
ā Bad: "API is down"
ā
Good:
Production API Down - 3 consecutive failures
URL: https://api.example.com/health
Expected: 200 OK, got: 503 Service Unavailable
Uptime: 99.95% (last 30 days)
Last successful check: 2 minutes ago
Runbook: https://wiki.example.com/runbooks/api-down
Dashboard: https://blacktide.xyz/monitors/mon_abc123Testing Your Monitoring
Chaos Engineering
Regularly test that your alerts work:
- Intentionally stop a service
- Verify alert is triggered
- Check alert reaches all channels
- Measure time to detection
- Practice incident response
Monthly Drills
Monthly Monitoring Drill:
1. Simulate outage in staging
2. Verify alerts trigger correctly
3. Practice runbook procedures
4. Time the incident response
5. Document improvements neededCommon Pitfalls
1. Monitoring Everything
ā Don't: Monitor every metric and log line
ā Do: Focus on user-impacting metrics
2. Ignoring False Positives
ā Don't: Leave noisy alerts in place
ā Do: Fix or delete alerts that cry wolf
3. No Recovery Notifications
ā Don't: Only alert on failures
ā
Do: Enable notifyOnRecovery
4. Single Point of Failure
ā Don't: Rely on one alert channel
ā Do: Configure backup channels
5. No Maintenance Windows
ā Don't: Deploy without silencing alerts
ā Do: Use silence feature during deployments
Metrics to Track
Monitoring Health Metrics
- Mean Time to Detect (MTTD): How fast do you detect issues?
- Mean Time to Resolve (MTTR): How fast do you fix issues?
- Alert Accuracy: % of alerts that require action
- False Positive Rate: % of alerts that are noise
- Coverage: % of services monitored
Target KPIs
| Metric | Target |
|---|---|
| MTTD | < 2 minutes |
| MTTR | < 15 minutes |
| Alert Accuracy | > 90% |
| False Positive Rate | < 5% |
| Service Coverage | 100% critical services |
Next Steps
- Quick Start: Apply these practices to your first monitor
- Alert Rules: Configure intelligent thresholds
- Monitor Types: Choose the right monitors
- Troubleshooting: Solve common issues