Designing Observability Into Your System From Day One
Designing Observability Into Your System From Day One
Introduction
"We'll add monitoring later."
These four words have launched a thousand on-call nightmares. The system grows, complexity compounds, and suddenly you're debugging production issues by adding print statements and redeploying—the software equivalent of performing surgery in the dark.
Observability isn't monitoring. It's not dashboards. It's not alerts. Observability is the property of a system that allows you to understand its internal state by examining its outputs. And like security or performance, it's dramatically easier to build in from the start than to retrofit later.
This guide covers how to design observability into your system from day one—before you have production traffic, before you have incidents, before you desperately need it.
Observability vs Monitoring: The Distinction That Matters
┌─────────────────────────────────────────────────────────────────────┐
│ MONITORING VS OBSERVABILITY │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ MONITORING OBSERVABILITY │
│ ══════════ ═════════════ │
│ │
│ "Is the system working?" "Why isn't it working?" │
│ │
│ Answers KNOWN questions Answers UNKNOWN questions │
│ - Is CPU above 80%? - Why are these specific │
│ - Is API latency > 200ms? requests slow? │
│ - Are there 5xx errors? - What changed between │
│ yesterday and today? │
│ │
│ Predefined dashboards Ad-hoc exploration │
│ - Static metrics - Slice and dice any dimension │
│ - Known failure modes - Discover unknown unknowns │
│ │
│ Alerts when thresholds cross Investigate why thresholds │
│ crossed │
│ │
│ Works for simple systems Required for complex systems │
│ │
└─────────────────────────────────────────────────────────────────────┘
The key insight:
MONITORING tells you THAT something is wrong.
OBSERVABILITY helps you understand WHY.
You need both. But observability is what enables you to:
- Debug issues you've never seen before
- Understand behavior you didn't anticipate
- Answer questions you didn't know to ask
The Three Pillars (And Why They're Not Enough)
THE CLASSIC THREE PILLARS:
════════════════════════════════════════════════════════════════════
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ LOGS │ │ METRICS │ │ TRACES │
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
│ │ │ │ │ │
│ What happened │ │ Aggregated │ │ Request flow │
│ (events) │ │ measurements │ │ across │
│ │ │ over time │ │ services │
│ │ │ │ │ │
│ "User 123 │ │ "p99 latency │ │ "Request ABC │
│ logged in │ │ is 250ms" │ │ took 50ms in │
│ at 10:30" │ │ │ │ service A, │
│ │ │ │ │ 200ms in B" │
│ │ │ │ │ │
│ High volume │ │ Low volume │ │ Medium volume │
│ High cost │ │ Low cost │ │ Medium cost │
│ High detail │ │ Low detail │ │ High detail │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
BUT THE PILLARS ALONE AREN'T ENOUGH:
════════════════════════════════════════════════════════════════════
What's missing:
1. CORRELATION
Logs, metrics, and traces must be connected.
"Show me the logs for this slow trace."
"Show me the trace that caused this error metric spike."
2. CONTEXT
Raw data without context is noise.
"This error happened" → "This error happened for user X,
in region Y, on version Z"
3. CARDINALITY
High-cardinality data (user IDs, request IDs) is essential
for debugging but expensive to store.
4. EVENTS
Structured events that combine log-like detail with
metric-like queryability.
The Day-One Observability Stack
Foundational Architecture
DAY-ONE OBSERVABILITY ARCHITECTURE:
════════════════════════════════════════════════════════════════════
Your Application
│
│ Instrumentation Layer
│ (SDK, auto-instrumentation)
▼
┌─────────────────────────────────────────────────────────────────┐
│ TELEMETRY PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │
│ │ │ │ │ │ │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Collector │ (OpenTelemetry Collector, │
│ │ (processing, │ Vector, Fluent Bit) │
│ │ routing) │ │
│ └────────┬─────────┘ │
│ │ │
└─────────────────────┼───────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌───────────┐ ┌───────────┐
│ Log │ │ Metrics │ │ Trace │
│ Storage │ │ Storage │ │ Storage │
│ │ │ │ │ │
│ (Loki, │ │(Prometheus│ │ (Jaeger, │
│ Elastic, │ │ Mimir, │ │ Tempo, │
│ CloudWatch)│ │ Datadog) │ │ Zipkin) │
└──────┬──────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼─────────────┘
│
▼
┌──────────────────┐
│ Unified UI │
│ (Grafana, │
│ Datadog, │
│ Honeycomb) │
└──────────────────┘
Why OpenTelemetry From Day One
OPENTELEMETRY: THE VENDOR-NEUTRAL CHOICE
════════════════════════════════════════════════════════════════════
Before OpenTelemetry:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ App instrumented with: │
│ - Datadog SDK for metrics │
│ - Jaeger SDK for traces │
│ - Custom logging library │
│ │
│ Want to switch vendors? Re-instrument EVERYTHING. │
│ │
└─────────────────────────────────────────────────────────────────┘
With OpenTelemetry:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ App instrumented with: │
│ - OpenTelemetry SDK (logs, metrics, traces) │
│ │ │
│ ▼ │
│ OTel Collector │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ Datadog Honeycomb Jaeger │
│ │
│ Switch vendors? Change collector config. Zero code changes. │
│ │
└─────────────────────────────────────────────────────────────────┘
Day-One Decision: Use OpenTelemetry
Why:
• Vendor neutral - switch backends without code changes
• Industry standard - wide ecosystem support
• Single API - consistent instrumentation
• Auto-instrumentation - instrument libraries automatically
• Future-proof - CNCF project, active development
Structured Logging: The Foundation
Unstructured logs are almost useless at scale. Structured logging is non-negotiable.
The Structured Logging Contract
UNSTRUCTURED (Bad):
════════════════════════════════════════════════════════════════════
2024-01-15 10:30:45 INFO User john@example.com logged in from 192.168.1.1
2024-01-15 10:30:46 ERROR Failed to process payment for order 12345: timeout
2024-01-15 10:30:47 DEBUG Request to /api/users took 234ms
Problems:
• Can't filter by user email without regex
• Can't aggregate errors by order
• Can't correlate with traces
• Different formats = parsing nightmare
STRUCTURED (Good):
════════════════════════════════════════════════════════════════════
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "info",
"message": "User logged in",
"service": "auth-service",
"version": "2.3.1",
"environment": "production",
"trace_id": "abc123def456",
"span_id": "789xyz",
"user": {
"id": "user_123",
"email": "john@example.com"
},
"request": {
"ip": "192.168.1.1",
"user_agent": "Mozilla/5.0...",
"method": "POST",
"path": "/api/auth/login"
},
"duration_ms": 45
}
Benefits:
• Query: user.email = "john@example.com"
• Aggregate: COUNT(*) GROUP BY error.code
• Correlate: Join with traces via trace_id
• Consistent: Same schema everywhere
Logging Implementation
// logger.ts - Day One Logging Setup
import { createLogger, format, transports } from 'winston';
import { trace, context } from '@opentelemetry/api';
// Standard fields that appear on EVERY log entry
interface BaseLogContext {
service: string;
version: string;
environment: string;
instance: string;
}
// Request-scoped context
interface RequestContext {
traceId?: string;
spanId?: string;
requestId?: string;
userId?: string;
sessionId?: string;
}
const baseContext: BaseLogContext = {
service: process.env.SERVICE_NAME || 'unknown',
version: process.env.APP_VERSION || 'unknown',
environment: process.env.NODE_ENV || 'development',
instance: process.env.HOSTNAME || 'unknown',
};
// Extract trace context from OpenTelemetry
const getTraceContext = (): Partial<RequestContext> => {
const span = trace.getActiveSpan();
if (!span) return {};
const spanContext = span.spanContext();
return {
traceId: spanContext.traceId,
spanId: spanContext.spanId,
};
};
// Custom format that merges all context
const contextFormat = format((info) => {
return {
...info,
...baseContext,
...getTraceContext(),
timestamp: new Date().toISOString(),
};
});
// The logger
export const logger = createLogger({
level: process.env.LOG_LEVEL || 'info',
format: format.combine(
contextFormat(),
format.errors({ stack: true }),
format.json()
),
transports: [
new transports.Console(),
// Add file transport, cloud transport, etc.
],
});
// Request-scoped logger with additional context
export class RequestLogger {
private requestContext: RequestContext;
constructor(requestContext: RequestContext) {
this.requestContext = requestContext;
}
private log(level: string, message: string, data?: Record<string, any>) {
logger.log(level, message, {
...this.requestContext,
...data,
});
}
info(message: string, data?: Record<string, any>) {
this.log('info', message, data);
}
warn(message: string, data?: Record<string, any>) {
this.log('warn', message, data);
}
error(message: string, error?: Error, data?: Record<string, any>) {
this.log('error', message, {
...data,
error: error ? {
name: error.name,
message: error.message,
stack: error.stack,
} : undefined,
});
}
// Create child logger with additional context
child(additionalContext: Record<string, any>): RequestLogger {
return new RequestLogger({
...this.requestContext,
...additionalContext,
});
}
}
// Middleware to create request-scoped logger
export function loggingMiddleware(req: Request, res: Response, next: NextFunction) {
const requestId = req.headers['x-request-id'] as string || generateRequestId();
const requestLogger = new RequestLogger({
requestId,
userId: req.user?.id,
sessionId: req.session?.id,
...getTraceContext(),
});
// Attach to request for use in handlers
req.log = requestLogger;
// Log request start
requestLogger.info('Request started', {
request: {
method: req.method,
path: req.path,
query: req.query,
userAgent: req.headers['user-agent'],
ip: req.ip,
},
});
// Log request end
const startTime = Date.now();
res.on('finish', () => {
const duration = Date.now() - startTime;
requestLogger.info('Request completed', {
response: {
statusCode: res.statusCode,
duration_ms: duration,
},
});
});
next();
}
Log Levels: A Practical Guide
┌─────────────────────────────────────────────────────────────────────┐
│ LOG LEVEL GUIDELINES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL WHEN TO USE EXAMPLES │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ ERROR Something failed that • Unhandled exception │
│ shouldn't have. • External service down │
│ Requires attention. • Data corruption │
│ Alerts may fire. • Payment failed │
│ │
│ WARN Something unexpected but • Retry succeeded │
│ handled. Worth noting. • Deprecated API used │
│ No immediate action. • High latency detected │
│ • Rate limit approaching │
│ │
│ INFO Normal operations worth • User logged in │
│ recording. Business events. • Order placed │
│ Audit trail. • Config loaded │
│ • Server started │
│ │
│ DEBUG Detailed diagnostic info. • Cache hit/miss │
│ Off in production usually. • Query executed │
│ Used for troubleshooting. • Function entry/exit │
│ • Variable values │
│ │
│ TRACE Very detailed tracing. • Loop iterations │
│ Rarely used. Very verbose. • Byte-level data │
│ Performance impact. • Protocol details │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ PRODUCTION DEFAULT: INFO │
│ • All INFO and above logged │
│ • DEBUG enabled per-service for troubleshooting │
│ • TRACE almost never enabled in production │
│ │
│ DON'T LOG: │
│ • Passwords, tokens, API keys │
│ • Full credit card numbers │
│ • Personal data (PII) unless required │
│ • High-frequency events that provide no value │
│ │
└─────────────────────────────────────────────────────────────────────┘
Metrics: Measuring What Matters
The Four Golden Signals
Google's Site Reliability Engineering book introduced the four golden signals. Start here.
THE FOUR GOLDEN SIGNALS:
════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────┐
│ 1. LATENCY │
│ ═══════════ │
│ The time it takes to service a request. │
│ │
│ Key insight: Measure SUCCESSFUL and FAILED requests separately │
│ A fast error is different from a slow success. │
│ │
│ Metrics: │
│ • http_request_duration_seconds (histogram) │
│ • Percentiles: p50, p90, p95, p99 │
│ • Segment by: endpoint, status_code, method │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 2. TRAFFIC │
│ ══════════ │
│ How much demand is being placed on your system. │
│ │
│ Key insight: Context for other metrics. High latency during │
│ peak traffic is different from high latency during low traffic.│
│ │
│ Metrics: │
│ • http_requests_total (counter) │
│ • requests_per_second (rate of above) │
│ • active_connections │
│ • Segment by: endpoint, method, customer_tier │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 3. ERRORS │
│ ══════════ │
│ The rate of failed requests. │
│ │
│ Key insight: Define "error" clearly. Is a 404 an error? │
│ Is a timeout an error? Is a business logic rejection? │
│ │
│ Metrics: │
│ • http_requests_total{status="5xx"} (counter) │
│ • error_rate = errors / total_requests │
│ • errors_by_type (counter with error_code label) │
│ • Segment by: error_type, endpoint, customer │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 4. SATURATION │
│ ═════════════ │
│ How "full" your service is. Utilization of constrained │
│ resources. │
│ │
│ Key insight: Leading indicator. High saturation predicts │
│ future problems before they cause errors. │
│ │
│ Metrics: │
│ • CPU utilization (%) │
│ • Memory utilization (%) │
│ • Disk I/O utilization │
│ • Database connection pool usage │
│ • Thread pool usage │
│ • Queue depth / backlog │
└─────────────────────────────────────────────────────────────────┘
RED and USE Methods
RED METHOD (For Services):
════════════════════════════════════════════════════════════════════
R - Rate: Requests per second
E - Errors: Failed requests per second
D - Duration: Time per request (distribution)
Best for: Request-driven services (APIs, web servers)
USE METHOD (For Resources):
════════════════════════════════════════════════════════════════════
U - Utilization: % time resource is busy
S - Saturation: Work queued waiting for resource
E - Errors: Error events for this resource
Best for: Infrastructure resources (CPU, disk, network)
COMBINED APPROACH FOR DAY ONE:
════════════════════════════════════════════════════════════════════
For each SERVICE, measure RED:
┌──────────────────────────────────────────────────────────────┐
│ api-gateway: │
│ Rate: 2,500 req/s │
│ Errors: 0.1% (2.5/s) │
│ Duration: p50=25ms, p99=180ms │
└──────────────────────────────────────────────────────────────┘
For each RESOURCE, measure USE:
┌──────────────────────────────────────────────────────────────┐
│ postgres-primary: │
│ Utilization: 45% CPU, 70% memory │
│ Saturation: 5 queries waiting (connection pool) │
│ Errors: 0 connection failures │
└──────────────────────────────────────────────────────────────┘
Metrics Implementation
// metrics.ts - Day One Metrics Setup
import {
MeterProvider,
Counter,
Histogram,
ObservableGauge
} from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
// Initialize meter
const meterProvider = new MeterProvider({
resource: new Resource({
'service.name': process.env.SERVICE_NAME,
'service.version': process.env.APP_VERSION,
'deployment.environment': process.env.NODE_ENV,
}),
});
const meter = meterProvider.getMeter('app-metrics');
// ════════════════════════════════════════════════════════════════
// HTTP METRICS (RED Method)
// ════════════════════════════════════════════════════════════════
// Rate: Request counter
const httpRequestsTotal = meter.createCounter('http_requests_total', {
description: 'Total number of HTTP requests',
});
// Duration: Request latency histogram
const httpRequestDuration = meter.createHistogram('http_request_duration_seconds', {
description: 'HTTP request duration in seconds',
unit: 'seconds',
// Bucket boundaries for percentile calculation
boundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
// Errors: Tracked via labels on httpRequestsTotal
// ════════════════════════════════════════════════════════════════
// BUSINESS METRICS
// ════════════════════════════════════════════════════════════════
const ordersTotal = meter.createCounter('orders_total', {
description: 'Total orders placed',
});
const orderValue = meter.createHistogram('order_value_dollars', {
description: 'Order value in dollars',
boundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
});
const activeUsers = meter.createObservableGauge('active_users', {
description: 'Currently active users',
});
activeUsers.addCallback((observableResult) => {
// Fetch from session store, Redis, etc.
const count = getActiveUserCount();
observableResult.observe(count);
});
// ════════════════════════════════════════════════════════════════
// RESOURCE METRICS (USE Method)
// ════════════════════════════════════════════════════════════════
const dbConnectionPoolSize = meter.createObservableGauge('db_connection_pool_size', {
description: 'Database connection pool size',
});
const dbConnectionPoolUsed = meter.createObservableGauge('db_connection_pool_used', {
description: 'Database connections currently in use',
});
const dbQueryDuration = meter.createHistogram('db_query_duration_seconds', {
description: 'Database query duration',
boundaries: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
});
// ════════════════════════════════════════════════════════════════
// EXPRESS MIDDLEWARE
// ════════════════════════════════════════════════════════════════
export function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
const startTime = process.hrtime.bigint();
res.on('finish', () => {
const endTime = process.hrtime.bigint();
const durationSeconds = Number(endTime - startTime) / 1e9;
// Common labels
const labels = {
method: req.method,
path: normalizeRoute(req.route?.path || req.path), // Prevent cardinality explosion
status_code: String(res.statusCode),
status_class: `${Math.floor(res.statusCode / 100)}xx`,
};
// Record request count
httpRequestsTotal.add(1, labels);
// Record duration
httpRequestDuration.record(durationSeconds, labels);
});
next();
}
// CRITICAL: Normalize routes to prevent cardinality explosion
function normalizeRoute(path: string): string {
// /users/123/orders/456 → /users/:id/orders/:id
return path
.replace(/\/\d+/g, '/:id')
.replace(/\/[a-f0-9-]{36}/g, '/:uuid') // UUIDs
.replace(/\/[a-f0-9]{24}/g, '/:objectId'); // MongoDB ObjectIds
}
// ════════════════════════════════════════════════════════════════
// DATABASE INSTRUMENTATION
// ════════════════════════════════════════════════════════════════
export function instrumentQuery<T>(
operation: string,
table: string,
query: () => Promise<T>
): Promise<T> {
const startTime = process.hrtime.bigint();
return query()
.then((result) => {
recordQueryMetrics(operation, table, 'success', startTime);
return result;
})
.catch((error) => {
recordQueryMetrics(operation, table, 'error', startTime);
throw error;
});
}
function recordQueryMetrics(
operation: string,
table: string,
status: string,
startTime: bigint
) {
const durationSeconds = Number(process.hrtime.bigint() - startTime) / 1e9;
dbQueryDuration.record(durationSeconds, {
operation, // SELECT, INSERT, UPDATE, DELETE
table,
status,
});
}
Metric Naming Conventions
METRIC NAMING BEST PRACTICES:
════════════════════════════════════════════════════════════════════
FORMAT: <namespace>_<name>_<unit>
Good Examples:
http_requests_total (counter, no unit needed)
http_request_duration_seconds (histogram, unit in name)
process_memory_bytes (gauge, unit in name)
db_connections_active (gauge)
orders_total (counter)
order_processing_duration_seconds (histogram)
Bad Examples:
httpRequests (camelCase, no namespace)
request_latency (ambiguous unit - ms? s?)
numOrders (ambiguous type - total? rate?)
db_connection_count (is this total or active?)
LABEL NAMING:
════════════════════════════════════════════════════════════════════
Good Labels:
method="POST"
status_code="200"
service="user-api"
environment="production"
region="us-east-1"
Bad Labels (HIGH CARDINALITY - AVOID):
user_id="12345" # Millions of unique values!
request_id="abc-123" # Unique per request!
email="user@example.com" # PII + high cardinality
timestamp="..." # Infinite values
THE CARDINALITY RULE:
════════════════════════════════════════════════════════════════════
Cardinality = Product of all unique label value combinations
http_requests_total with labels:
method: 5 values (GET, POST, PUT, DELETE, PATCH)
path: 50 endpoints
status_code: 20 codes
region: 3 regions
─────────────────────────
Cardinality: 5 × 50 × 20 × 3 = 15,000 time series
Add user_id with 1M users?
Cardinality: 15,000 × 1,000,000 = 15 BILLION time series
→ System crashes, bills explode
RULE: Total cardinality per metric should stay under 10,000
Label values should be bounded and known ahead of time
Distributed Tracing: Following the Request
Tracing Fundamentals
WHAT IS A TRACE?
════════════════════════════════════════════════════════════════════
A trace represents the entire journey of a request through your system.
User Request: "Create Order"
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ TRACE ID: abc-123-def-456 │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ SPAN: api-gateway (25ms) │ │
│ │ span_id: 001 │ │
│ │ parent_span_id: null (root span) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ SPAN: order-service (120ms) │ │ │
│ │ │ span_id: 002 │ │ │
│ │ │ parent_span_id: 001 │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────────────────────────────────────┐ │ │ │
│ │ │ │ SPAN: validate-inventory (40ms) │ │ │ │
│ │ │ │ span_id: 003 │ │ │ │
│ │ │ │ parent_span_id: 002 │ │ │ │
│ │ │ └──────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────────────────────────────────────┐ │ │ │
│ │ │ │ SPAN: charge-payment (60ms) │ │ │ │
│ │ │ │ span_id: 004 │ │ │ │
│ │ │ │ parent_span_id: 002 │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ┌────────────────────────────────────────┐ │ │ │ │
│ │ │ │ │ SPAN: stripe-api (50ms) │ │ │ │ │
│ │ │ │ │ span_id: 005 │ │ │ │ │
│ │ │ │ │ parent_span_id: 004 │ │ │ │ │
│ │ │ │ │ span.kind: client │ │ │ │ │
│ │ │ │ └────────────────────────────────────────┘ │ │ │ │
│ │ │ └──────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────────────────────────────────────┐ │ │ │
│ │ │ │ SPAN: db-insert-order (15ms) │ │ │ │
│ │ │ │ span_id: 006 │ │ │ │
│ │ │ │ parent_span_id: 002 │ │ │ │
│ │ │ └──────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Total Duration: 145ms │
└─────────────────────────────────────────────────────────────────┘
KEY CONCEPTS:
• Trace: Entire journey, identified by trace_id
• Span: Single operation, has start time and duration
• Parent Span: The span that called this span
• Root Span: The first span (no parent)
• Span Context: trace_id + span_id, propagated across services
Tracing Implementation
// tracing.ts - Day One Tracing Setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { trace, SpanStatusCode, Span } from '@opentelemetry/api';
// Initialize SDK BEFORE importing application code
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME,
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION,
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instrument HTTP, Express, database clients, etc.
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics'], // Don't trace health checks
},
'@opentelemetry/instrumentation-express': {},
'@opentelemetry/instrumentation-pg': {}, // PostgreSQL
'@opentelemetry/instrumentation-redis': {}, // Redis
'@opentelemetry/instrumentation-mongodb': {}, // MongoDB
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
// ════════════════════════════════════════════════════════════════
// MANUAL INSTRUMENTATION HELPERS
// ════════════════════════════════════════════════════════════════
const tracer = trace.getTracer('app-tracer');
// Create a span for a function
export async function withSpan<T>(
name: string,
attributes: Record<string, string | number | boolean>,
fn: (span: Span) => Promise<T>
): Promise<T> {
return tracer.startActiveSpan(name, async (span) => {
try {
// Set initial attributes
span.setAttributes(attributes);
// Execute function
const result = await fn(span);
// Mark success
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
// Record error
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// Example usage: Custom business logic span
async function processOrder(order: Order): Promise<ProcessedOrder> {
return withSpan('process_order', {
'order.id': order.id,
'order.total': order.total,
'customer.id': order.customerId,
}, async (span) => {
// Validate inventory
const inventoryValid = await withSpan('validate_inventory', {
'items.count': order.items.length,
}, async () => {
return inventoryService.checkAvailability(order.items);
});
if (!inventoryValid) {
span.addEvent('inventory_validation_failed');
throw new Error('Insufficient inventory');
}
// Charge payment
const payment = await withSpan('charge_payment', {
'payment.amount': order.total,
'payment.currency': order.currency,
}, async () => {
return paymentService.charge(order);
});
span.addEvent('payment_completed', {
'payment.id': payment.id,
});
// Create order record
const savedOrder = await withSpan('save_order', {
'db.operation': 'INSERT',
'db.table': 'orders',
}, async () => {
return orderRepository.create(order);
});
return savedOrder;
});
}
// ════════════════════════════════════════════════════════════════
// SPAN CONTEXT PROPAGATION
// ════════════════════════════════════════════════════════════════
import { propagation, context } from '@opentelemetry/api';
// When making HTTP calls to other services, context is auto-propagated
// by the HTTP instrumentation. For other transports (queues, etc.):
// Inject context when sending
function injectTraceContext(carrier: Record<string, string>) {
propagation.inject(context.active(), carrier);
return carrier;
}
// Extract context when receiving
function extractTraceContext(carrier: Record<string, string>) {
return propagation.extract(context.active(), carrier);
}
// Example: Publishing to a queue with trace context
async function publishToQueue(queueName: string, message: any) {
const headers: Record<string, string> = {};
injectTraceContext(headers); // Inject trace context into headers
await queue.publish(queueName, {
body: message,
headers, // Headers contain traceparent, tracestate
});
}
// Example: Consuming from queue with trace context
async function consumeFromQueue(queueName: string) {
const message = await queue.consume(queueName);
// Extract and activate trace context
const extractedContext = extractTraceContext(message.headers);
// Run handler within extracted context
context.with(extractedContext, async () => {
await withSpan('process_queue_message', {
'queue.name': queueName,
'message.id': message.id,
}, async () => {
await handleMessage(message.body);
});
});
}
What to Trace
┌─────────────────────────────────────────────────────────────────────┐
│ WHAT TO TRACE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ALWAYS TRACE (Critical Path): │
│ ───────────────────────────────────────────────────────────────── │
│ • Incoming HTTP requests (usually auto-instrumented) │
│ • Outgoing HTTP calls to other services │
│ • Database queries │
│ • Cache operations (Redis, Memcached) │
│ • Message queue publish/consume │
│ • External API calls (Stripe, Twilio, etc.) │
│ │
│ USUALLY TRACE (Business Logic): │
│ ───────────────────────────────────────────────────────────────── │
│ • Core business operations (createOrder, processPayment) │
│ • Authentication/authorization checks │
│ • Data transformations that might be slow │
│ • Background job processing │
│ │
│ DON'T TRACE (Noise): │
│ ───────────────────────────────────────────────────────────────── │
│ • Health check endpoints │
│ • Metrics endpoints │
│ • Simple utility functions │
│ • Tight loops (trace the loop, not each iteration) │
│ • Static asset serving │
│ │
│ SPAN ATTRIBUTES TO CAPTURE: │
│ ───────────────────────────────────────────────────────────────── │
│ • User ID (who made this request?) │
│ • Tenant ID (which customer?) │
│ • Resource IDs (which order? which product?) │
│ • Operation results (success/failure, count, etc.) │
│ • Error details (error code, message) │
│ │
│ DON'T CAPTURE: │
│ ───────────────────────────────────────────────────────────────── │
│ • Passwords or secrets │
│ • Full request/response bodies (too large) │
│ • PII unless necessary and compliant │
│ │
└─────────────────────────────────────────────────────────────────────┘
Sampling Strategies
At scale, tracing everything is expensive. Sampling is essential.
SAMPLING STRATEGIES:
════════════════════════════════════════════════════════════════════
1. HEAD-BASED SAMPLING (Probabilistic)
───────────────────────────────────────
Decision made at trace START.
// Sample 10% of all traces
const sampler = new TraceIdRatioBased(0.1);
Pros:
✓ Simple to implement
✓ Low overhead
✓ Consistent across services
Cons:
✗ Might miss interesting traces
✗ Errors are sampled out with same probability
2. TAIL-BASED SAMPLING
──────────────────────
Decision made at trace END, after seeing all spans.
// Keep traces that:
// - Have errors
// - Have high latency
// - Match certain patterns
const sampler = new TailSampler({
alwaysKeep: [
{ attribute: 'http.status_code', operator: '>=', value: 500 },
{ attribute: 'duration_ms', operator: '>', value: 1000 },
{ attribute: 'user.tier', operator: '==', value: 'enterprise' },
],
defaultSampleRate: 0.05, // 5% of everything else
});
Pros:
✓ Keep all interesting traces
✓ Error traces never lost
✓ High-latency traces captured
Cons:
✗ Higher complexity
✗ Requires collector buffer
✗ Higher resource usage
3. ADAPTIVE SAMPLING
────────────────────
Adjust sampling rate based on traffic volume.
const sampler = new AdaptiveSampler({
targetTracesPerMinute: 1000,
minSampleRate: 0.01, // At least 1%
maxSampleRate: 1.0, // Up to 100%
});
At 100 req/min: 100% sampled (1000 target > 100 actual)
At 50K req/min: 2% sampled (1000 / 50000)
RECOMMENDED DAY-ONE STRATEGY:
════════════════════════════════════════════════════════════════════
Start with head-based sampling:
• 100% in development/staging
• 10-20% in production for most services
• 100% for critical paths (checkout, payment)
As traffic grows, add tail-based sampling:
• Always keep errors
• Always keep slow requests (>p99)
• Always keep specific users/tenants
// OpenTelemetry Collector config for tail sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
- name: everything-else
type: probabilistic
probabilistic: {sampling_percentage: 5}
Context Propagation and Correlation
The magic of observability comes from connecting the pieces.
The Correlation ID Pattern
REQUEST FLOW WITH CORRELATION:
════════════════════════════════════════════════════════════════════
User Request
│
│ request_id: "req-abc-123"
│ trace_id: "trace-xyz-789"
│ user_id: "user-456"
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway │
│ ───────────────────────────────────────────────────────────── │
│ LOG: {"request_id":"req-abc-123","trace_id":"trace-xyz-789", │
│ "user_id":"user-456","msg":"Request received"} │
│ │
│ METRIC: http_requests_total{path="/orders"} │
│ │
│ SPAN: api-gateway (trace_id: trace-xyz-789) │
└────────────────────────────────┬────────────────────────────────┘
│
│ HTTP Headers:
│ X-Request-ID: req-abc-123
│ traceparent: 00-trace-xyz-789-...
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Order Service │
│ ───────────────────────────────────────────────────────────── │
│ LOG: {"request_id":"req-abc-123","trace_id":"trace-xyz-789", │
│ "user_id":"user-456","msg":"Creating order"} │
│ │
│ SPAN: order-service (trace_id: trace-xyz-789) │
│ └── span: validate-inventory │
│ └── span: db-insert │
└────────────────────────────────┬────────────────────────────────┘
│
│ Queue Message Headers:
│ request_id: req-abc-123
│ traceparent: 00-trace-xyz-789-...
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Notification Worker │
│ ───────────────────────────────────────────────────────────── │
│ LOG: {"request_id":"req-abc-123","trace_id":"trace-xyz-789", │
│ "user_id":"user-456","msg":"Sending confirmation"} │
│ │
│ SPAN: notification-worker (trace_id: trace-xyz-789) │
│ └── span: send-email │
└─────────────────────────────────────────────────────────────────┘
NOW YOU CAN:
════════════════════════════════════════════════════════════════════
1. Find all logs for a request:
query: request_id = "req-abc-123"
2. Find the trace for a request:
query: trace_id = "trace-xyz-789"
3. Find all activity for a user:
query: user_id = "user-456"
4. Jump from log → trace:
Click trace_id in log viewer → opens trace viewer
5. Jump from trace → logs:
Click span → shows logs from that time window with same trace_id
Implementation: Request Context
// context.ts - Request Context Management
import { AsyncLocalStorage } from 'async_hooks';
import { v4 as uuidv4 } from 'uuid';
import { trace } from '@opentelemetry/api';
// Request context that flows through async operations
interface RequestContext {
requestId: string;
traceId: string;
spanId: string;
userId?: string;
tenantId?: string;
sessionId?: string;
userAgent?: string;
ip?: string;
// Add more as needed
}
// AsyncLocalStorage provides request-scoped storage
const requestContextStorage = new AsyncLocalStorage<RequestContext>();
// Get current request context
export function getRequestContext(): RequestContext | undefined {
return requestContextStorage.getStore();
}
// Run code with request context
export function runWithContext<T>(
context: RequestContext,
fn: () => T
): T {
return requestContextStorage.run(context, fn);
}
// Middleware to establish request context
export function requestContextMiddleware(
req: Request,
res: Response,
next: NextFunction
) {
// Get or create request ID
const requestId = (req.headers['x-request-id'] as string) || uuidv4();
// Get trace context from OpenTelemetry
const span = trace.getActiveSpan();
const spanContext = span?.spanContext();
// Build request context
const context: RequestContext = {
requestId,
traceId: spanContext?.traceId || uuidv4(),
spanId: spanContext?.spanId || '',
userId: req.user?.id,
tenantId: req.user?.tenantId,
sessionId: req.session?.id,
userAgent: req.headers['user-agent'],
ip: req.ip,
};
// Set request ID in response header for client correlation
res.setHeader('X-Request-ID', requestId);
// Run handler within context
runWithContext(context, () => {
next();
});
}
// ════════════════════════════════════════════════════════════════
// CONTEXT-AWARE LOGGER
// ════════════════════════════════════════════════════════════════
export function log(level: string, message: string, data?: Record<string, any>) {
const context = getRequestContext();
const logEntry = {
timestamp: new Date().toISOString(),
level,
message,
// Always include correlation IDs
request_id: context?.requestId,
trace_id: context?.traceId,
span_id: context?.spanId,
user_id: context?.userId,
tenant_id: context?.tenantId,
// Additional data
...data,
};
// Output structured JSON
console.log(JSON.stringify(logEntry));
}
// Convenience methods
export const logger = {
info: (message: string, data?: Record<string, any>) => log('info', message, data),
warn: (message: string, data?: Record<string, any>) => log('warn', message, data),
error: (message: string, data?: Record<string, any>) => log('error', message, data),
debug: (message: string, data?: Record<string, any>) => log('debug', message, data),
};
// ════════════════════════════════════════════════════════════════
// CONTEXT PROPAGATION TO ASYNC OPERATIONS
// ════════════════════════════════════════════════════════════════
// When publishing to queues, include context
export async function publishMessage(
queue: string,
payload: any
): Promise<void> {
const context = getRequestContext();
const message = {
payload,
metadata: {
request_id: context?.requestId,
trace_id: context?.traceId,
user_id: context?.userId,
tenant_id: context?.tenantId,
published_at: new Date().toISOString(),
},
};
await messageQueue.publish(queue, message);
}
// When consuming from queues, restore context
export async function processMessage(message: QueueMessage): Promise<void> {
const context: RequestContext = {
requestId: message.metadata.request_id || uuidv4(),
traceId: message.metadata.trace_id || uuidv4(),
spanId: '',
userId: message.metadata.user_id,
tenantId: message.metadata.tenant_id,
};
runWithContext(context, async () => {
logger.info('Processing message', { queue: message.queue });
await handleMessage(message.payload);
});
}
// ════════════════════════════════════════════════════════════════
// HTTP CLIENT WITH CONTEXT PROPAGATION
// ════════════════════════════════════════════════════════════════
import axios, { AxiosRequestConfig } from 'axios';
export const httpClient = axios.create();
// Interceptor to add context headers
httpClient.interceptors.request.use((config: AxiosRequestConfig) => {
const context = getRequestContext();
if (context) {
config.headers = config.headers || {};
config.headers['X-Request-ID'] = context.requestId;
// traceparent is added automatically by OTel instrumentation
}
return config;
});
SLIs, SLOs, and Error Budgets
Defining Service Level Indicators
SLI (Service Level Indicator):
═══════════════════════════════════════════════════════════════════
A quantitative measure of some aspect of the level of service.
SLO (Service Level Objective):
═══════════════════════════════════════════════════════════════════
A target value or range for an SLI.
SLA (Service Level Agreement):
═══════════════════════════════════════════════════════════════════
A contract specifying consequences of meeting/missing SLOs.
ERROR BUDGET:
═══════════════════════════════════════════════════════════════════
The allowed failure rate: 100% - SLO target
EXAMPLE:
───────────────────────────────────────────────────────────────────
SLI: Percentage of successful HTTP requests (status < 500)
successful_requests
─────────────────── × 100%
total_requests
SLO: 99.9% of requests should succeed
Error Budget: 100% - 99.9% = 0.1% of requests can fail
In a month with 10 million requests:
• Error budget = 10,000 failed requests allowed
• If you've used 8,000, you have 2,000 remaining
• If you've used 12,000, you've exceeded budget
Choosing Good SLIs
┌─────────────────────────────────────────────────────────────────────┐
│ SLI SELECTION GUIDE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ FOR REQUEST-DRIVEN SERVICES (APIs): │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ AVAILABILITY │
│ • Proportion of successful requests │
│ • (requests - 5xx errors) / requests │
│ │
│ LATENCY │
│ • Proportion of requests faster than threshold │
│ • requests_under_200ms / total_requests │
│ • Usually use p50, p90, p99 │
│ │
│ QUALITY (for degraded responses) │
│ • Proportion of requests with full (not degraded) response │
│ • full_responses / total_responses │
│ │
│ │
│ FOR DATA PROCESSING SYSTEMS (Pipelines): │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ FRESHNESS │
│ • Proportion of data updated within threshold │
│ • records_updated_in_1h / total_records │
│ │
│ CORRECTNESS │
│ • Proportion of data that is correct │
│ • correct_records / total_records │
│ │
│ THROUGHPUT │
│ • Proportion of time processing at expected rate │
│ • time_at_target_rate / total_time │
│ │
│ │
│ FOR STORAGE SYSTEMS: │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ DURABILITY │
│ • Proportion of data not lost │
│ • (stored_objects - lost_objects) / stored_objects │
│ │
│ AVAILABILITY │
│ • Proportion of time data is accessible │
│ │
│ LATENCY │
│ • Read/write latency distribution │
│ │
└─────────────────────────────────────────────────────────────────────┘
SLO Implementation
// slo.ts - SLO Tracking Implementation
interface SLODefinition {
name: string;
description: string;
target: number; // e.g., 0.999 for 99.9%
window: '7d' | '28d' | '30d';
sliQuery: string; // Prometheus query or similar
}
const sloDefinitions: SLODefinition[] = [
{
name: 'api_availability',
description: 'API requests returning non-5xx responses',
target: 0.999, // 99.9%
window: '30d',
sliQuery: `
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
`,
},
{
name: 'api_latency_p99',
description: 'API requests completing under 500ms',
target: 0.99, // 99%
window: '30d',
sliQuery: `
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
`,
},
{
name: 'checkout_success',
description: 'Checkout attempts that complete successfully',
target: 0.995, // 99.5%
window: '7d',
sliQuery: `
sum(rate(checkout_completed_total[5m]))
/
sum(rate(checkout_started_total[5m]))
`,
},
];
// Error budget calculation
interface ErrorBudget {
sloName: string;
target: number;
current: number; // Current SLI value
budgetTotal: number; // Total error budget (1 - target)
budgetUsed: number; // How much used
budgetRemaining: number; // How much left
budgetUsedPercent: number;
status: 'healthy' | 'warning' | 'critical' | 'exhausted';
}
function calculateErrorBudget(
slo: SLODefinition,
currentSLI: number,
totalEvents: number
): ErrorBudget {
const budgetTotal = 1 - slo.target;
const errorRate = 1 - currentSLI;
const budgetUsed = errorRate;
const budgetRemaining = Math.max(0, budgetTotal - errorRate);
const budgetUsedPercent = (budgetUsed / budgetTotal) * 100;
let status: ErrorBudget['status'];
if (budgetUsedPercent >= 100) {
status = 'exhausted';
} else if (budgetUsedPercent >= 80) {
status = 'critical';
} else if (budgetUsedPercent >= 50) {
status = 'warning';
} else {
status = 'healthy';
}
return {
sloName: slo.name,
target: slo.target,
current: currentSLI,
budgetTotal,
budgetUsed,
budgetRemaining,
budgetUsedPercent,
status,
};
}
// Example usage
// Current SLI: 99.85% (measured over 30 days)
// SLO target: 99.9%
const budget = calculateErrorBudget(
sloDefinitions[0], // api_availability
0.9985, // 99.85% current
10_000_000 // 10M requests in window
);
// Result:
// {
// sloName: 'api_availability',
// target: 0.999,
// current: 0.9985,
// budgetTotal: 0.001 (0.1%),
// budgetUsed: 0.0015 (0.15%),
// budgetRemaining: 0 (exhausted!),
// budgetUsedPercent: 150,
// status: 'exhausted'
// }
SLO Dashboard
┌─────────────────────────────────────────────────────────────────────┐
│ SLO DASHBOARD │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ API AVAILABILITY (30-day window) │
│ ═══════════════════════════════════════════════════════════════ │
│ Target: 99.9% Current: 99.85% Status: ⚠ CRITICAL │
│ │
│ Error Budget: │
│ ████████████████████████████████████████░░░░░░ 85% consumed │
│ │
│ Remaining: 1,500 errors of 10,000 budget (this month) │
│ Burn rate: 1.3x (consuming budget 30% faster than sustainable) │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ API LATENCY P99 (30-day window) │
│ ═══════════════════════════════════════════════════════════════ │
│ Target: 99% < 500ms Current: 99.2% Status: ✓ HEALTHY │
│ │
│ Error Budget: │
│ ████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░ 40% consumed │
│ │
│ Remaining: 60,000 slow requests of 100,000 budget │
│ Burn rate: 0.8x (sustainable) │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ CHECKOUT SUCCESS (7-day window) │
│ ═══════════════════════════════════════════════════════════════ │
│ Target: 99.5% Current: 99.7% Status: ✓ HEALTHY │
│ │
│ Error Budget: │
│ ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 12% consumed │
│ │
│ Remaining: 4,400 failed checkouts of 5,000 budget │
│ Burn rate: 0.3x (well under budget) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Alerting Philosophy
Alert Design Principles
┌─────────────────────────────────────────────────────────────────────┐
│ ALERTING PRINCIPLES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ALERT ON SYMPTOMS, NOT CAUSES │
│ ───────────────────────────────────────────────────────────────── │
│ Bad: "CPU > 80%" │
│ CPU might be high but service is fine │
│ │
│ Good: "Error rate > 1%" │
│ Users are actually affected │
│ │
│ 2. ALERT ON USER IMPACT │
│ ───────────────────────────────────────────────────────────────── │
│ Bad: "Database replica lag > 5s" │
│ Might not affect users at all │
│ │
│ Good: "Read queries returning stale data" │
│ Direct user impact │
│ │
│ 3. EVERY ALERT NEEDS AN ACTION │
│ ───────────────────────────────────────────────────────────────── │
│ If you can't do anything about it, don't alert. │
│ If it can wait until morning, don't page. │
│ │
│ 4. PREFER ERROR BUDGET-BASED ALERTS │
│ ───────────────────────────────────────────────────────────────── │
│ Bad: "Error rate > 0.1%" │
│ Arbitrary threshold, might be normal │
│ │
│ Good: "Error budget burn rate > 2x" │
│ Will exhaust budget if not addressed │
│ │
│ 5. TIME-BASED SEVERITY │
│ ───────────────────────────────────────────────────────────────── │
│ Severity 1 (Page immediately): Will exhaust error budget in <1hr │
│ Severity 2 (Page during hours): Will exhaust in <6hr │
│ Severity 3 (Ticket): Will exhaust in <3 days │
│ Severity 4 (Dashboard): Worth watching, no action needed │
│ │
└─────────────────────────────────────────────────────────────────────┘
Multi-Window Burn Rate Alerts
BURN RATE ALERTING:
════════════════════════════════════════════════════════════════════
Instead of: "Error rate > 0.1%"
Use: "Burning error budget at unsustainable rate"
BURN RATE CALCULATION:
──────────────────────
(error rate / error budget) × window
Burn Rate = ──────────────────────────────────────────────
time elapsed
If SLO is 99.9% (0.1% error budget) over 30 days:
• Sustainable burn rate: 1x (use all budget in 30 days)
• Burn rate 2x: Would exhaust budget in 15 days
• Burn rate 10x: Would exhaust budget in 3 days
• Burn rate 36x: Would exhaust budget in 20 hours
MULTI-WINDOW ALERT STRATEGY:
────────────────────────────
Short window + Long window = Fewer false positives
┌─────────────────────────────────────────────────────────────────┐
│ │
│ PAGE (Severity 1): │
│ • 5-minute burn rate > 14.4x AND │
│ • 1-hour burn rate > 14.4x │
│ = Will exhaust monthly budget in 2 hours │
│ │
│ Why both? A 5-minute spike alone might be noise. │
│ Combined with 1-hour confirms real problem. │
│ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PAGE (Severity 2): │
│ • 30-minute burn rate > 6x AND │
│ • 6-hour burn rate > 6x │
│ = Will exhaust monthly budget in 5 days │
│ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TICKET (Severity 3): │
│ • 2-hour burn rate > 3x AND │
│ • 1-day burn rate > 3x │
│ = Will exhaust monthly budget in 10 days │
│ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ REVIEW (Severity 4): │
│ • 1-day burn rate > 1x │
│ = Consuming budget faster than planned │
│ (Not urgent, but should investigate) │
│ │
└─────────────────────────────────────────────────────────────────┘
Alert Configuration
# prometheus-alerts.yaml
groups:
- name: slo-alerts
rules:
# ═══════════════════════════════════════════════════════════════
# CRITICAL: Page immediately - 2 hour budget exhaustion
# ═══════════════════════════════════════════════════════════════
- alert: APIHighErrorBurnRate-Critical
# 5-minute burn rate > 14.4 AND 1-hour burn rate > 14.4
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001) # 14.4x the 0.1% error budget
AND
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate burning through SLO budget"
description: |
Error rate is {{ $value | humanizePercentage }}
At this rate, monthly error budget will be exhausted in ~2 hours.
Runbook: https://runbooks.example.com/api-high-errors
Dashboard: https://grafana.example.com/d/api-overview
# ═══════════════════════════════════════════════════════════════
# WARNING: Page during business hours - 6 hour exhaustion
# ═══════════════════════════════════════════════════════════════
- alert: APIHighErrorBurnRate-Warning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (6 * 0.001)
AND
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Elevated error rate impacting SLO budget"
description: |
Error rate is elevated at {{ $value | humanizePercentage }}
At this rate, monthly error budget will be exhausted in ~5 days.
# ═══════════════════════════════════════════════════════════════
# LATENCY ALERTS
# ═══════════════════════════════════════════════════════════════
- alert: APIHighLatency-Critical
expr: |
(
1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))
)
) > (14.4 * 0.01) # 14.4x the 1% error budget for latency
AND
(
1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
/ sum(rate(http_request_duration_seconds_count[1h]))
)
) > (14.4 * 0.01)
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "High latency burning through SLO budget"
description: |
{{ $value | humanizePercentage }} of requests exceeding 500ms latency target.
At this rate, monthly latency budget will be exhausted in ~2 hours.
Day-One Checklist
┌─────────────────────────────────────────────────────────────────────┐
│ DAY-ONE OBSERVABILITY CHECKLIST │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ FOUNDATION │
│ □ OpenTelemetry SDK integrated │
│ □ Collector deployed and configured │
│ □ Storage backends chosen and provisioned │
│ □ Visualization tool set up (Grafana, etc.) │
│ │
│ LOGGING │
│ □ Structured JSON logging implemented │
│ □ Log levels defined and documented │
│ □ Request ID / trace ID in all logs │
│ □ Sensitive data redaction in place │
│ □ Log retention policy defined │
│ │
│ METRICS │
│ □ HTTP request metrics (rate, errors, duration) │
│ □ Database query metrics │
│ □ External dependency metrics │
│ □ Business metrics (orders, signups, etc.) │
│ □ Resource metrics (CPU, memory, connections) │
│ □ Metric naming convention documented │
│ □ Cardinality guidelines established │
│ │
│ TRACING │
│ □ Auto-instrumentation configured │
│ □ Custom spans for business operations │
│ □ Context propagation across services │
│ □ Context propagation through queues │
│ □ Sampling strategy defined │
│ │
│ CORRELATION │
│ □ Request ID flows through all services │
│ □ Logs ↔ Traces linked via trace_id │
│ □ Traces ↔ Metrics linked via exemplars │
│ □ User/tenant context in all telemetry │
│ │
│ SLOs & ALERTING │
│ □ SLIs defined for critical user journeys │
│ □ SLO targets set │
│ □ Error budget tracking implemented │
│ □ Burn rate alerts configured │
│ □ Runbooks for each alert │
│ □ On-call rotation established │
│ │
│ DASHBOARDS │
│ □ Service overview dashboard │
│ □ SLO status dashboard │
│ □ Infrastructure dashboard │
│ □ On-call dashboard (active alerts, recent deploys) │
│ │
│ DOCUMENTATION │
│ □ Observability architecture documented │
│ □ Runbooks for common issues │
│ □ On-call procedures documented │
│ □ Escalation paths defined │
│ │
└─────────────────────────────────────────────────────────────────────┘
Cost Considerations
Observability at scale gets expensive. Plan for this from day one.
COST DRIVERS:
════════════════════════════════════════════════════════════════════
┌───────────────┬────────────────────┬─────────────────────────────┐
│ Signal │ Cost Driver │ Mitigation │
├───────────────┼────────────────────┼─────────────────────────────┤
│ │ │ │
│ Logs │ Volume (GB) │ • Log levels (INFO in prod)│
│ │ Retention (days) │ • Structured → queryable │
│ │ │ • Sample debug logs │
│ │ │ • Short retention + archive│
│ │ │ │
│ Metrics │ Cardinality │ • Limit label values │
│ │ (time series) │ • Aggregate at edge │
│ │ │ • Recording rules │
│ │ │ • Drop unused metrics │
│ │ │ │
│ Traces │ Span count │ • Head-based sampling │
│ │ Retention │ • Tail-based sampling │
│ │ │ • Short retention │
│ │ │ • Only trace what matters │
│ │ │ │
└───────────────┴────────────────────┴─────────────────────────────┘
COST ESTIMATION (Rough):
════════════════════════════════════════════════════════════════════
For a mid-size application (1000 req/s, 10 services):
LOGS:
• ~100 GB/day at moderate verbosity
• Cloud logging: $0.50/GB ingest + $0.01/GB storage
• Monthly: ~$1,500/month
METRICS:
• ~50,000 active time series
• Managed Prometheus: ~$0.10/1000 samples
• Monthly: ~$500/month
TRACES:
• At 10% sampling: ~8.6M spans/day
• Managed tracing: ~$0.20/million spans
• Monthly: ~$500/month
TOTAL: ~$2,500/month for moderate scale
At 10x scale (10,000 req/s):
Logs: $15,000/month (linear)
Metrics: $1,500/month (sublinear if cardinality controlled)
Traces: $2,000/month (sampling keeps it reasonable)
TOTAL: ~$18,500/month
COST OPTIMIZATION STRATEGIES:
════════════════════════════════════════════════════════════════════
1. SAMPLE INTELLIGENTLY
• 100% of errors, 10% of success
• 100% of slow requests
• Lower rates for high-volume endpoints
2. AGGREGATE AT THE EDGE
• Compute percentiles in collector, not storage
• Pre-aggregate metrics before sending
3. TIERED RETENTION
• Hot: 7 days (fast queries)
• Warm: 30 days (slower queries)
• Cold: 1 year (archive, cheap storage)
4. DROP THE NOISE
• Don't log health checks
• Don't trace static assets
• Drop debug logs in production
5. USE THE RIGHT TOOL
• Logs for events (search)
• Metrics for aggregates (dashboards)
• Traces for request flow (debugging)
• Don't duplicate across all three
Organizational Practices
Observability as a Team Practice
┌─────────────────────────────────────────────────────────────────────┐
│ TEAM OBSERVABILITY PRACTICES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ DEVELOPMENT │
│ ───────────────────────────────────────────────────────────────── │
│ • Every PR that changes behavior adds/updates metrics │
│ • Every new endpoint gets traced │
│ • Every error path has structured logging │
│ • Review dashboards in PR reviews │
│ │
│ CODE REVIEW CHECKLIST │
│ ───────────────────────────────────────────────────────────────── │
│ □ Are new endpoints instrumented? │
│ □ Are error paths logged with context? │
│ □ Are business metrics captured? │
│ □ Is sensitive data redacted? │
│ □ Will this cause cardinality explosion? │
│ │
│ DEPLOYMENT │
│ ───────────────────────────────────────────────────────────────── │
│ • Watch key metrics during deploy │
│ • Compare error rates before/after │
│ • Monitor for 15 minutes post-deploy │
│ • Rollback if SLI degrades significantly │
│ │
│ ON-CALL │
│ ───────────────────────────────────────────────────────────────── │
│ • Start with SLO dashboard │
│ • Use traces to find root cause │
│ • Document findings in incident log │
│ • Update runbooks with new learnings │
│ │
│ WEEKLY REVIEW │
│ ───────────────────────────────────────────────────────────────── │
│ • Review SLO status │
│ • Review error budget consumption │
│ • Discuss recent incidents │
│ • Identify observability gaps │
│ • Plan improvements │
│ │
└─────────────────────────────────────────────────────────────────────┘
Observability Maturity Model
MATURITY LEVELS:
════════════════════════════════════════════════════════════════════
LEVEL 0: REACTIVE
─────────────────
• Users report issues
• SSH into servers to debug
• grep through log files
• No metrics, no traces
"We find out about problems from Twitter"
LEVEL 1: BASIC MONITORING
─────────────────────────
• Centralized logging
• Basic metrics (CPU, memory)
• Up/down health checks
• Manual correlation
"We know when something is down"
LEVEL 2: PROACTIVE MONITORING
─────────────────────────────
• Structured logging
• Application metrics (RED)
• Basic distributed tracing
• Correlation IDs
• Dashboards for key services
• Threshold-based alerts
"We know when something is wrong before users complain"
LEVEL 3: OBSERVABILITY
──────────────────────
• Full OpenTelemetry adoption
• SLIs and SLOs defined
• Error budget tracking
• Burn rate alerting
• Traces linked to logs and metrics
• Context propagation everywhere
• Runbooks for all alerts
• Regular reviews
"We can answer any question about our system's behavior"
LEVEL 4: ADVANCED OBSERVABILITY
───────────────────────────────
• Automated anomaly detection
• Correlation across all signals
• AI-assisted root cause analysis
• Continuous profiling
• Chaos engineering with observability
• Observability-driven development
"Our system tells us what's wrong and often why"
AIM FOR LEVEL 3 BY END OF YEAR ONE.
Conclusion: The Day-One Mindset
Observability isn't a project you complete—it's a capability you build and maintain. Starting on day one means:
┌─────────────────────────────────────────────────────────────────────┐
│ THE DAY-ONE OBSERVABILITY MINDSET │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. INSTRUMENT FIRST, OPTIMIZE LATER │
│ Start with more instrumentation than you need. │
│ It's easier to reduce than to add after an incident. │
│ │
│ 2. CORRELATION IS EVERYTHING │
│ Request IDs, trace IDs, user IDs—flow them everywhere. │
│ The value multiplies when signals connect. │
│ │
│ 3. MEASURE WHAT USERS EXPERIENCE │
│ Not what your servers experience. │
│ SLIs should reflect user impact. │
│ │
│ 4. ALERTS ARE FOR HUMANS │
│ Every alert should be actionable. │
│ If you can't do anything, don't page. │
│ │
│ 5. OBSERVABILITY IS CULTURAL │
│ It's not just tools—it's how your team works. │
│ Make it part of development, review, and operations. │
│ │
│ 6. PLAN FOR COST │
│ Observability at scale is expensive. │
│ Design sampling and retention from the start. │
│ │
│ 7. ITERATE │
│ Your first SLOs will be wrong. │
│ Your first dashboards will be incomplete. │
│ Keep improving. │
│ │
└─────────────────────────────────────────────────────────────────────┘
The best time to add observability was when you wrote the first line of code. The second best time is now. The worst time is during your first major outage, when you're flying blind and users are angry.
Start on day one. Your future on-call self will thank you.
What did you think?