Alerting System Design: Getting the Right Signal to the Right Person at the Right Time
April 12, 202625 min read0 views
Alerting System Design: Getting the Right Signal to the Right Person at the Right Time
Introduction
Alerting bridges the gap between monitoring data and human action. A good alerting system wakes you up for real problems and stays silent for everything else. A bad one produces so much noise that engineers ignore pages, mute channels, and miss the one alert that actually matters. The goal isn't more alerts — it's fewer, better ones that are actionable, timely, and routed to the right person.
Alert Quality: The Signal-to-Noise Problem
Every alert should answer YES to ALL of these:
1. Is this actionable? (Can someone do something right now?)
2. Does it require immediate attention? (Will waiting cause harm?)
3. Is the condition real? (Not a transient blip or flapping metric?)
If any answer is NO, it should NOT be a page/alert. It should be a:
- Dashboard (visible but not urgent)
- Weekly report (trends, not incidents)
- Auto-remediation (system fixes itself, logs the action)
Alert fatigue math:
Team receives 100 alerts/week
10 are actionable → 90% noise
Engineers learn to ignore alerts → miss the critical one
This is how outages happen.
Target: < 5 alerts/week per on-call engineer, > 80% actionable
What to Alert On
Alert on Symptoms, Not Causes
BAD (alerting on causes):
"CPU usage > 80%"
→ CPU at 85% might be totally normal for this service
→ Or it might be a runaway process
→ The alert doesn't tell you if users are affected
GOOD (alerting on symptoms):
"Error rate > 1% for 5 minutes"
→ Users are being affected RIGHT NOW
→ Investigate the cause (maybe CPU, maybe a bad deploy, maybe a dependency)
"P99 latency > 2 seconds for 5 minutes"
→ Users are experiencing slowness
→ Investigate why (CPU, memory, database, network)
Cause-based alerts are useful as CONTEXT on dashboards, not as pages.
Alert Tiers
P1 — Page (wake someone up):
- Service is DOWN (error rate > 50%)
- SLO error budget is exhausted
- Data loss or corruption detected
- Security breach indicators
P2 — Urgent notification (Slack, don't page):
- Error rate elevated (1-10%)
- Latency degraded (P99 > SLO threshold)
- Dependency partially failing
- Resource trending toward exhaustion (disk 85%, connections 90%)
P3 — Informational (ticket/dashboard):
- Certificate expiring in 14 days
- Capacity trending toward limits
- Deployment completed
- Test failure in CI
Alerting Architecture
┌──────────────┐ ┌──────────────────┐ ┌───────────────┐
│ Metric Store │────▶│ Alert Evaluator │────▶│ Alert Router │
│ (Prometheus) │ │ (Rule engine) │ │ (Notification)│
└──────────────┘ └──────────────────┘ └───────┬───────┘
│
┌────────────────────────┼────────────┐
│ │ │
┌────▼────┐ ┌─────▼────┐ ┌───▼────┐
│PagerDuty│ │ Slack │ │ Email │
│ (page) │ │ (urgent) │ │ (info) │
└─────────┘ └──────────┘ └────────┘
Prometheus Alerting Rules
# alerting_rules.yml
groups:
- name: service-health
rules:
# P1: Service is down
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m # Must be true for 5 continuous minutes
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
description: "Current error rate: {{ $value | humanizePercentage }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
# P2: Latency degraded
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s for {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/high-latency"
# P2: SLO error budget burn rate
- alert: SLOBudgetBurning
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
)
) > 14.4 * (1 - 0.999)
for: 5m
labels:
severity: warning
annotations:
summary: "SLO error budget burning 14.4x faster than allowed"
description: "At this rate, the monthly error budget will be exhausted in 5 days"
Alertmanager Configuration
# alertmanager.yml
route:
receiver: default-slack
group_by: ['alertname', 'service']
group_wait: 30s # Wait 30s to batch related alerts
group_interval: 5m # Don't re-notify for same group within 5m
repeat_interval: 4h # Repeat unresolved alerts every 4 hours
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty-oncall
repeat_interval: 30m
# Warning alerts → Slack
- match:
severity: warning
receiver: team-slack
repeat_interval: 2h
# Info alerts → Email digest
- match:
severity: info
receiver: email-digest
group_wait: 1h # Batch for 1 hour
repeat_interval: 24h
receivers:
- name: pagerduty-oncall
pagerduty_configs:
- service_key: '<PAGERDUTY_KEY>'
severity: '{{ .CommonLabels.severity }}'
- name: team-slack
slack_configs:
- channel: '#alerts-team'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: email-digest
email_configs:
- to: 'team@example.com'
# Inhibition: suppress low-severity alerts when high-severity is firing
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['service']
# If a critical alert is firing for order-service,
# suppress warning alerts for order-service
SLO-Based Alerting (Multi-Window, Multi-Burn-Rate)
The most sophisticated alerting approach. Instead of threshold-based alerts, alert when the error budget is being consumed too fast:
Concept:
SLO: 99.9% success rate over 30 days
Error budget: 0.1% of requests can fail = ~43 minutes of downtime
Instead of alerting on "error rate > X%",
alert on "error budget is being consumed at Y× the sustainable rate"
Multi-window burn rates:
┌───────────────────┬──────────────┬──────────────────────────────┐
│ Severity │ Burn Rate │ Detection windows │
├───────────────────┼──────────────┼──────────────────────────────┤
│ P1 (page) │ 14.4× budget │ 1h long window + 5m short │
│ P2 (ticket) │ 6× budget │ 6h long window + 30m short │
│ P3 (low priority) │ 3× budget │ 3d long window + 6h short │
└───────────────────┴──────────────┴──────────────────────────────┘
14.4× burn rate: exhausts 30-day budget in 2 days → page immediately
6× burn rate: exhausts budget in 5 days → urgent ticket
3× burn rate: exhausts budget in 10 days → investigate when convenient
Why two windows (long + short):
Long window (1h): Confirms the problem is sustained, not a blip
Short window (5m): Confirms the problem is still happening right now
Both must be true → alert fires
Prevents alerting on issues that already resolved themselves.
Reducing Alert Noise
Grouping
Without grouping:
Alert: Pod-1 of order-service is down
Alert: Pod-2 of order-service is down
Alert: Pod-3 of order-service is down
Alert: Pod-4 of order-service is down
→ 4 pages for the same incident
With grouping (group_by: ['service']):
Alert: order-service is down (4 pods affected)
→ 1 page with full context
Inhibition
If the database is completely down:
→ Don't also alert for every service that depends on the database
→ One alert: "Database is down"
→ All "order-service error rate high" alerts are suppressed
(they're symptoms of the database being down)
Dead Man's Switch
Problem: If your alerting system breaks, you get NO alerts.
You don't know that you don't know.
Solution: A "dead man's switch" alert that ALWAYS fires.
If it STOPS firing, the alerting pipeline is broken.
- alert: DeadMansSwitch
expr: vector(1) # Always true
labels:
severity: none
annotations:
summary: "Alerting pipeline is healthy"
Route this to a service (e.g., Healthchecks.io) that pages you
if the heartbeat STOPS arriving.
Runbooks: Making Alerts Actionable
Every alert must link to a runbook:
## Runbook: HighErrorRate
**Alert**: Error rate above 5% for order-service
### Triage (first 5 minutes)
1. Check if there was a recent deployment: [Deploy Dashboard](link)
2. Check dependency health: [Dependency Dashboard](link)
3. Check for widespread infrastructure issues: [Status Page](link)
### Common causes and fixes
| Cause | How to verify | Fix |
|-------|--------------|-----|
| Bad deployment | Errors started at deploy time | Rollback: `kubectl rollout undo` |
| Database overload | DB dashboard shows high CPU/connections | Scale DB read replicas |
| Downstream service down | Circuit breaker is open | Wait for recovery or enable fallback |
### Escalation
- If not resolved in 15 minutes: page the service owner
- If not resolved in 30 minutes: page the engineering manager
Key Takeaways
- Alert on symptoms, not causes — "error rate > 5%" tells you users are affected; "CPU > 80%" doesn't; use cause metrics as diagnostic context, not alert triggers
- SLO-based alerting is the gold standard — burn rate alerts tell you "you're consuming error budget too fast" rather than arbitrary thresholds
- Every alert must be actionable and link to a runbook — if an engineer can't do anything about it, it shouldn't be an alert
- Group, inhibit, and deduplicate — 10 alerts about the same incident is noise; one grouped alert with full context is signal
- Target < 5 pages per on-call shift — more than that and engineers become desensitized; audit and eliminate noisy alerts relentlessly
- Use
forduration to prevent flapping — require the condition to be true for 5-10 minutes before firing; transient blips aren't incidents - Monitor your alerting system — a dead man's switch ensures you know if the alerting pipeline itself is broken
What did you think?