Monitoring Best Practices

Learn proven strategies for effective monitoring. Follow the 3-3-3 rule, avoid alert fatigue, and implement monitoring patterns that scale.

The 3-3-3 Rule

The 3-3-3 Rule is a simple framework for balanced monitoring:

3 Consecutive Failures

Wait for 3 consecutive check failures before alerting.

Why?

Reduces false positives from transient network blips
Confirms the issue is real, not a one-time fluke
Gives time for automatic recovery (e.g., container restart)

Alert Rule:
consecutiveFailures: 3
interval: 60 seconds

Timeline:
00:00 - Check 1: ❌ Failed
00:60 - Check 2: ❌ Failed
01:20 - Check 3: ❌ Failed → 🚨 ALERT TRIGGERED

3 Minutes to Acknowledge

Team should acknowledge alerts within 3 minutes of receiving them.

Why?

Confirms someone is aware and investigating
Prevents duplicate escalation
Establishes accountability

3 Channels for Redundancy

Configure 3 alert channels for critical services:

Channel 1 (Primary): Slack/Discord for team visibility
Channel 2 (Backup): Email for offline notifications
Channel 3 (Escalation): PagerDuty for on-call rotation

Alert Channels:
├─ Slack (#alerts) - Immediate team notification
├─ Email (team@example.com) - Backup, reaches everyone
└─ PagerDuty (oncall-team) - Escalation after 5 min

Avoiding Alert Fatigue

What is Alert Fatigue?

Alert Fatigue occurs when teams receive too many alerts, causing them to:

Ignore or dismiss alerts without reading
Miss critical issues buried in noise
Become desensitized to alerts

Symptoms of Alert Fatigue

🚩 Receiving 50+ alerts per day
🚩 Most alerts don't require action
🚩 Team stops checking alert channels
🚩 Critical alerts get ignored

Solutions

1. Increase Thresholds

Be conservative with alert conditions:

❌ Bad: Alert on 1 failed check
✅ Good: Alert on 3 consecutive failures

❌ Bad: Alert on latency > 100ms
✅ Good: Alert on latency > 2000ms

2. Use Alert Severity Levels

Not everything is P0 critical:

Severity	Definition	Channel	Response Time
P0 (Critical)	Production down, revenue impact	PagerDuty + Slack	Immediate
P1 (High)	Degraded performance	Slack + Email	<15 min
P2 (Medium)	Non-critical issue	Slack only	<1 hour
P3 (Low)	Informational	Email only	Next business day

3. Group Related Alerts

Instead of 10 individual alerts, send 1 grouped alert:

❌ Bad: Individual alerts
- Database server 1 down
- Database server 2 down
- Database server 3 down
- API endpoint /users failing
- API endpoint /posts failing

✅ Good: Grouped alert
Database Cluster Down (3 servers)
→ Cascading failure: 2 API endpoints affected

4. Silence During Maintenance

Always silence monitors during planned maintenance:

Deployments
Database migrations
Infrastructure updates

5. Weekly Alert Review

Analyze alert patterns every week:

Which alerts triggered most often?
How many required action?
Were any critical issues missed?
Adjust thresholds based on data

Monitor Layering Strategy

Layer 1: Network (ICMP)

Basic reachability check:

ICMP Monitor: server.example.com
Purpose: Is server online?
Interval: 1 minute
Alert: Yes (critical if server unreachable)

Layer 2: Port (TCP)

Verify service port is open:

TCP Monitor: server.example.com:5432
Purpose: Is PostgreSQL port listening?
Interval: 1 minute
Alert: Yes (critical if port closed)

Layer 3: Application (HTTP)

Check application health:

HTTP Monitor: https://api.example.com/health
Purpose: Is application responding?
Validation: JSONPath $.status = "healthy"
Interval: 30 seconds
Alert: Yes (critical if unhealthy)

Layer 4: Business Logic (Custom)

Verify end-to-end workflows:

HTTP Monitor: POST /api/orders
Purpose: Can users place orders?
Validation: $.orderId exists
Interval: 5 minutes
Alert: Yes (critical if orders failing)

Layering Example

Problem: API not responding

Layer 1 (ICMP): ✅ Server reachable
Layer 2 (TCP): ✅ Port 443 open
Layer 3 (HTTP): ❌ /health returns 503
Layer 4 (Business): ❌ Orders failing

→ Conclusion: Application issue, not infrastructure

Check Interval Guidelines

Traditional Monitors

Service Type	Interval	Reasoning
Critical API	15-30s	Fast detection of issues
Database	1 minute	Balance speed vs load
Background Jobs	5-15 minutes	Jobs run infrequently
SSL Certificates	24 hours	Certificates rarely change

Web3 Monitors

Monitor Type	Interval	Reasoning
Gas Price	30s - 1m	Volatile, changes rapidly
Whale Wallet	1-5 minutes	Infrequent large transactions
Contract Events	1 minute	Block time ~12-15s
Liquidation Risk	5 minutes	Health factor changes slowly
DeFi Protocol	5-15 minutes	TVL/APY update gradually

Naming Conventions

Monitor Names

Use descriptive, consistent naming:

✅ Production API - Health Check
✅ PostgreSQL Primary - Port 5432
✅ SSL Certificate - api.example.com
❌ Monitor 1
❌ test
❌ check-api

Alert Rule Names

Include severity and service:

✅ [P0] Production API Down
✅ [P1] Database Slow Queries
✅ [P2] Cache Hit Rate Low

Documentation Standards

Runbooks

Every critical monitor should have a runbook:

# API Down Runbook

## Symptoms
- HTTP health check returns 503
- Users cannot access application

## Diagnosis
1. Check application logs: `kubectl logs -f api-deployment`
2. Verify database connectivity
3. Check recent deployments

## Resolution
1. If deployment issue: Rollback to previous version
2. If database issue: Check connection pool
3. If resource issue: Scale up pods

## Escalation
If not resolved in 15 minutes, page on-call engineer

Alert Descriptions

Include context in alert messages:

❌ Bad: "API is down"

✅ Good:
Production API Down - 3 consecutive failures

URL: https://api.example.com/health
Expected: 200 OK, got: 503 Service Unavailable
Uptime: 99.95% (last 30 days)
Last successful check: 2 minutes ago

Runbook: https://wiki.example.com/runbooks/api-down
Dashboard: https://blacktide.xyz/monitors/mon_abc123

Testing Your Monitoring

Chaos Engineering

Regularly test that your alerts work:

Intentionally stop a service
Verify alert is triggered
Check alert reaches all channels
Measure time to detection
Practice incident response

Monthly Drills

Monthly Monitoring Drill:
1. Simulate outage in staging
2. Verify alerts trigger correctly
3. Practice runbook procedures
4. Time the incident response
5. Document improvements needed

Common Pitfalls

1. Monitoring Everything

❌ Don't: Monitor every metric and log line

✅ Do: Focus on user-impacting metrics

2. Ignoring False Positives

❌ Don't: Leave noisy alerts in place

✅ Do: Fix or delete alerts that cry wolf

3. No Recovery Notifications

❌ Don't: Only alert on failures

✅ Do: Enable notifyOnRecovery

4. Single Point of Failure

❌ Don't: Rely on one alert channel

✅ Do: Configure backup channels

5. No Maintenance Windows

❌ Don't: Deploy without silencing alerts

✅ Do: Use silence feature during deployments

Metrics to Track

Monitoring Health Metrics

Mean Time to Detect (MTTD): How fast do you detect issues?
Mean Time to Resolve (MTTR): How fast do you fix issues?
Alert Accuracy: % of alerts that require action
False Positive Rate: % of alerts that are noise
Coverage: % of services monitored

Target KPIs

Metric	Target
MTTD	< 2 minutes
MTTR	< 15 minutes
Alert Accuracy	> 90%
False Positive Rate	< 5%
Service Coverage	100% critical services

Next Steps

Quick Start: Apply these practices to your first monitor
Alert Rules: Configure intelligent thresholds
Monitor Types: Choose the right monitors
Troubleshooting: Solve common issues