Alerting System Design: Getting the Right Signal to the Right Person at the Right Time

April 12, 202625 min read0 views

alerting system

reliability engineering

sre

scalability

Alerting System Design: Getting the Right Signal to the Right Person at the Right Time

Introduction

Alerting bridges the gap between monitoring data and human action. A good alerting system wakes you up for real problems and stays silent for everything else. A bad one produces so much noise that engineers ignore pages, mute channels, and miss the one alert that actually matters. The goal isn't more alerts — it's fewer, better ones that are actionable, timely, and routed to the right person.

Alert Quality: The Signal-to-Noise Problem

Every alert should answer YES to ALL of these:
  1. Is this actionable? (Can someone do something right now?)
  2. Does it require immediate attention? (Will waiting cause harm?)
  3. Is the condition real? (Not a transient blip or flapping metric?)

If any answer is NO, it should NOT be a page/alert. It should be a:
  - Dashboard (visible but not urgent)
  - Weekly report (trends, not incidents)
  - Auto-remediation (system fixes itself, logs the action)

Alert fatigue math:
  Team receives 100 alerts/week
  10 are actionable → 90% noise
  Engineers learn to ignore alerts → miss the critical one
  This is how outages happen.
  
  Target: < 5 alerts/week per on-call engineer, > 80% actionable

What to Alert On

Alert on Symptoms, Not Causes

BAD (alerting on causes):
  "CPU usage > 80%"
  → CPU at 85% might be totally normal for this service
  → Or it might be a runaway process
  → The alert doesn't tell you if users are affected

GOOD (alerting on symptoms):
  "Error rate > 1% for 5 minutes"
  → Users are being affected RIGHT NOW
  → Investigate the cause (maybe CPU, maybe a bad deploy, maybe a dependency)

  "P99 latency > 2 seconds for 5 minutes"  
  → Users are experiencing slowness
  → Investigate why (CPU, memory, database, network)

Cause-based alerts are useful as CONTEXT on dashboards, not as pages.

Alert Tiers

P1 — Page (wake someone up):
  - Service is DOWN (error rate > 50%)
  - SLO error budget is exhausted
  - Data loss or corruption detected
  - Security breach indicators

P2 — Urgent notification (Slack, don't page):
  - Error rate elevated (1-10%)
  - Latency degraded (P99 > SLO threshold)
  - Dependency partially failing
  - Resource trending toward exhaustion (disk 85%, connections 90%)

P3 — Informational (ticket/dashboard):
  - Certificate expiring in 14 days
  - Capacity trending toward limits
  - Deployment completed
  - Test failure in CI

Alerting Architecture

┌──────────────┐     ┌──────────────────┐     ┌───────────────┐
│ Metric Store │────▶│ Alert Evaluator  │────▶│ Alert Router  │
│ (Prometheus) │     │ (Rule engine)    │     │ (Notification)│
└──────────────┘     └──────────────────┘     └───────┬───────┘
                                                      │
                              ┌────────────────────────┼────────────┐
                              │                        │            │
                         ┌────▼────┐            ┌─────▼────┐  ┌───▼────┐
                         │PagerDuty│            │  Slack   │  │ Email  │
                         │ (page)  │            │ (urgent) │  │ (info) │
                         └─────────┘            └──────────┘  └────────┘

Prometheus Alerting Rules

# alerting_rules.yml

groups:
  - name: service-health
    rules:
      # P1: Service is down
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m  # Must be true for 5 continuous minutes
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"
          description: "Current error rate: {{ $value | humanizePercentage }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      # P2: Latency degraded
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"
          runbook: "https://wiki.internal/runbooks/high-latency"

      # P2: SLO error budget burn rate
      - alert: SLOBudgetBurning
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h])) /
              sum(rate(http_requests_total[1h]))
            )
          ) > 14.4 * (1 - 0.999)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO error budget burning 14.4x faster than allowed"
          description: "At this rate, the monthly error budget will be exhausted in 5 days"

Alertmanager Configuration

# alertmanager.yml

route:
  receiver: default-slack
  group_by: ['alertname', 'service']
  group_wait: 30s       # Wait 30s to batch related alerts
  group_interval: 5m    # Don't re-notify for same group within 5m
  repeat_interval: 4h   # Repeat unresolved alerts every 4 hours
  
  routes:
    # Critical alerts → PagerDuty
    - match:
        severity: critical
      receiver: pagerduty-oncall
      repeat_interval: 30m
    
    # Warning alerts → Slack
    - match:
        severity: warning
      receiver: team-slack
      repeat_interval: 2h
    
    # Info alerts → Email digest
    - match:
        severity: info
      receiver: email-digest
      group_wait: 1h     # Batch for 1 hour
      repeat_interval: 24h

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: '<PAGERDUTY_KEY>'
        severity: '{{ .CommonLabels.severity }}'
  
  - name: team-slack
    slack_configs:
      - channel: '#alerts-team'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
  
  - name: email-digest
    email_configs:
      - to: 'team@example.com'

# Inhibition: suppress low-severity alerts when high-severity is firing
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['service']
    # If a critical alert is firing for order-service,
    # suppress warning alerts for order-service

SLO-Based Alerting (Multi-Window, Multi-Burn-Rate)

The most sophisticated alerting approach. Instead of threshold-based alerts, alert when the error budget is being consumed too fast:

Concept:
  SLO: 99.9% success rate over 30 days
  Error budget: 0.1% of requests can fail = ~43 minutes of downtime

  Instead of alerting on "error rate > X%",
  alert on "error budget is being consumed at Y× the sustainable rate"

Multi-window burn rates:
  ┌───────────────────┬──────────────┬──────────────────────────────┐
  │ Severity          │ Burn Rate    │ Detection windows            │
  ├───────────────────┼──────────────┼──────────────────────────────┤
  │ P1 (page)         │ 14.4× budget │ 1h long window + 5m short   │
  │ P2 (ticket)       │ 6× budget    │ 6h long window + 30m short  │
  │ P3 (low priority) │ 3× budget    │ 3d long window + 6h short   │
  └───────────────────┴──────────────┴──────────────────────────────┘

  14.4× burn rate: exhausts 30-day budget in 2 days → page immediately
  6× burn rate: exhausts budget in 5 days → urgent ticket
  3× burn rate: exhausts budget in 10 days → investigate when convenient

Why two windows (long + short):
  Long window (1h): Confirms the problem is sustained, not a blip
  Short window (5m): Confirms the problem is still happening right now
  Both must be true → alert fires
  Prevents alerting on issues that already resolved themselves.

Reducing Alert Noise

Grouping

Without grouping:
  Alert: Pod-1 of order-service is down
  Alert: Pod-2 of order-service is down
  Alert: Pod-3 of order-service is down
  Alert: Pod-4 of order-service is down
  → 4 pages for the same incident

With grouping (group_by: ['service']):
  Alert: order-service is down (4 pods affected)
  → 1 page with full context

Inhibition

If the database is completely down:
  → Don't also alert for every service that depends on the database
  → One alert: "Database is down"
  → All "order-service error rate high" alerts are suppressed
     (they're symptoms of the database being down)

Dead Man's Switch

Problem: If your alerting system breaks, you get NO alerts.
You don't know that you don't know.

Solution: A "dead man's switch" alert that ALWAYS fires.
  If it STOPS firing, the alerting pipeline is broken.

  - alert: DeadMansSwitch
    expr: vector(1)  # Always true
    labels:
      severity: none
    annotations:
      summary: "Alerting pipeline is healthy"

  Route this to a service (e.g., Healthchecks.io) that pages you
  if the heartbeat STOPS arriving.

Runbooks: Making Alerts Actionable

Every alert must link to a runbook:

## Runbook: HighErrorRate

**Alert**: Error rate above 5% for order-service

### Triage (first 5 minutes)
1. Check if there was a recent deployment: [Deploy Dashboard](link)
2. Check dependency health: [Dependency Dashboard](link)
3. Check for widespread infrastructure issues: [Status Page](link)

### Common causes and fixes
| Cause | How to verify | Fix |
|-------|--------------|-----|
| Bad deployment | Errors started at deploy time | Rollback: `kubectl rollout undo` |
| Database overload | DB dashboard shows high CPU/connections | Scale DB read replicas |
| Downstream service down | Circuit breaker is open | Wait for recovery or enable fallback |

### Escalation
- If not resolved in 15 minutes: page the service owner
- If not resolved in 30 minutes: page the engineering manager

Key Takeaways

Alert on symptoms, not causes — "error rate > 5%" tells you users are affected; "CPU > 80%" doesn't; use cause metrics as diagnostic context, not alert triggers
SLO-based alerting is the gold standard — burn rate alerts tell you "you're consuming error budget too fast" rather than arbitrary thresholds
Every alert must be actionable and link to a runbook — if an engineer can't do anything about it, it shouldn't be an alert
Group, inhibit, and deduplicate — 10 alerts about the same incident is noise; one grouped alert with full context is signal
Target < 5 pages per on-call shift — more than that and engineers become desensitized; audit and eliminate noisy alerts relentlessly
Use for duration to prevent flapping — require the condition to be true for 5-10 minutes before firing; transient blips aren't incidents
Monitor your alerting system — a dead man's switch ensures you know if the alerting pipeline itself is broken

What did you think?