Alert Rules

Alert Rules define when and how you should be notified about monitor failures. Configure conditions, thresholds, and notification channels.

Creating an Alert Rule

Navigate to Alert Rules

Go to Alerts → Alert Rules in the sidebar and click New Rule.

Select Monitors

Choose which monitors this rule applies to. You can select:

Specific monitors (e.g., "Production API")
All monitors with a tag (e.g., tag:production)
All monitors in the team

Configure Conditions

Set the conditions that trigger the alert. Common conditions include:

Consecutive failures threshold
Response time threshold
SSL certificate expiry days

Select Alert Channels

Choose where notifications should be sent (Email, Slack, PagerDuty, etc.). You must have at least one alert channel configured.

Multiple Channels

You can select multiple channels for redundancy. Critical alerts often go to both Slack and PagerDuty.

Rule Conditions

Condition	Type	Description
`consecutiveFailures`	Number	Alert after N failed checks in a row (recommended: 3)
`latencyThresholdMs`	Number	Alert if response time exceeds threshold (milliseconds)
`notifyOnRecovery`	Boolean	Send notification when monitor recovers
`sslExpiryDays`	Number	Alert N days before SSL certificate expires
`muteUntil`	Timestamp	Silence alerts until specific time (maintenance window)

Example Configurations

Basic Alert Rule

Simple rule that alerts after 3 consecutive failures:

Basic Configuration

json

{
  "name": "Production API Down",
  "monitorIds": ["550e8400-e29b-41d4-a716-446655440000"],
  "conditions": {
    "consecutiveFailures": 3,
    "notifyOnRecovery": true
  },
  "channels": ["email-channel-id", "slack-channel-id"]
}

Performance Degradation Alert

Alert when response times are slow, even if the service is still up:

Latency Alert

json

{
  "name": "API Slow Response",
  "monitorIds": ["550e8400-e29b-41d4-a716-446655440000"],
  "conditions": {
    "latencyThresholdMs": 2000,
    "consecutiveFailures": 2
  },
  "channels": ["slack-channel-id"]
}

Latency Alerts

Latency alerts help you catch performance degradation before complete failures. Set thresholds based on your SLAs (e.g., 2 seconds for APIs).

SSL Certificate Expiry

Get notified before your SSL certificate expires:

SSL Expiry Alert

json

{
  "name": "SSL Certificate Expiring",
  "monitorIds": ["all-https-monitors"],
  "conditions": {
    "sslExpiryDays": 30
  },
  "channels": ["email-channel-id"]
}

Critical Service with Escalation

Route critical alerts to PagerDuty with on-call escalation:

Critical Alert with Escalation

json

{
  "name": "Payment API Critical",
  "monitorIds": ["payment-api-id"],
  "conditions": {
    "consecutiveFailures": 2,
    "notifyOnRecovery": true
  },
  "channels": ["pagerduty-oncall-id", "slack-incidents-id"],
  "priority": "critical"
}

Consecutive Failures

The consecutiveFailures condition is the most important for preventing false alarms. It ensures your service has truly failed, not just experienced a transient network issue.

How It Works

Monitor check fails (e.g., timeout or 500 error)
Counter increments: 1 failure
Monitor check fails again (e.g., 60 seconds later)
Counter increments: 2 failures
Monitor check fails again
Counter reaches 3 → Alert triggered

If any check succeeds, the counter resets to 0. This means the service must fail N times in a row before an alert is sent.

Recommended Thresholds

Critical production services: 2-3 failures (2-3 minutes with 60s interval)
Standard services: 3-5 failures (3-5 minutes)
Non-critical services: 5-10 failures (5-10 minutes)

Notify on Recovery

The notifyOnRecovery option sends a notification when your monitor recovers after being down. This is useful for:

Knowing when an issue is resolved without checking manually
Measuring Mean Time To Recovery (MTTR)
Confirming deployments didn't break anything
Closing incident loops

Recovery Notification

json

{
  "event": "monitor.up",
  "monitor": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Production API",
    "status": "up"
  },
  "downtime": "5m 32s",
  "timestamp": "2024-02-14T10:05:32Z"
}

Rule Priority

Assign priority levels to rules for better organization and routing:

Priority	Use Case	Example
Critical	Revenue-impacting services	Payment API, Checkout flow
High	Core product functionality	User authentication, Core API
Medium	Important but not critical	Analytics, Background jobs
Low	Non-critical services	Marketing site, Blog

Tag-Based Rules

Instead of selecting specific monitors, you can create rules based on tags. This is powerful for managing many monitors:

Tag-Based Rule

json

{
  "name": "All Production Services",
  "tags": ["production", "critical"],
  "conditions": {
    "consecutiveFailures": 2,
    "notifyOnRecovery": true
  },
  "channels": ["pagerduty-oncall-id"]
}

When you add a new monitor with the production tag, this rule automatically applies to it. No need to update the rule!

Scalable Alert Management

Tag-based rules scale effortlessly. As you add more monitors, they automatically inherit alert rules based on their tags.

Maintenance Windows

Use the muteUntil condition to temporarily silence alerts during planned maintenance:

Muted Rule

json

{
  "name": "Database Maintenance",
  "monitorIds": ["database-monitor-id"],
  "conditions": {
    "consecutiveFailures": 3,
    "muteUntil": "2024-02-15T02:00:00Z"
  }
}

Alerts will be silenced until the specified timestamp. After that, alerting automatically resumes.

Rule Testing

Before deploying an alert rule, test it to ensure notifications are working:

Create the alert rule with your desired conditions
Click Send Test Alert in the rule details
Verify you receive the test notification in all configured channels
Check that the message format and content are correct

Test in Production

Test alerts are sent to real channels. Make sure your team knows you're testing to avoid confusion.

Best Practices

Start Conservative

Begin with higher consecutive failure thresholds (5-10) and gradually reduce them as you gain confidence in your monitoring setup. It's better to miss an alert initially than to wake up the on-call engineer at 3 AM for a false alarm.

Use Multiple Rules

Create different rules for different scenarios:

Rule 1: Critical failures (2 consecutive) → PagerDuty
Rule 2: Performance degradation (5s latency) → Slack
Rule 3: SSL expiry (30 days) → Email

Review Alert History

Regularly check your alert history to identify:

False positives (adjust consecutive failures threshold)
Missing alerts (lower thresholds or add more rules)
Noisy monitors (silence or adjust intervals)

API Access

You can manage alert rules programmatically via the API:

Create Alert Rule via API

bash

curl -X POST "https://api.blacktide.xyz/v1/alert-rules" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production API Down",
    "monitorIds": ["550e8400-e29b-41d4-a716-446655440000"],
    "conditions": {
      "consecutiveFailures": 3,
      "notifyOnRecovery": true
    },
    "channels": ["email-channel-id"]
  }'

See the Alert Rules API Reference for full documentation.

Next Steps

Configure Alert Channels - Set up notification destinations
Alert Silencing - Manage maintenance windows
Integrations - Detailed setup guides for each channel type
Best Practices - Advanced alerting strategies