Backend Observability: Distributed Tracing, Metrics Pipelines, Structured Logging & SLO Monitoring
Backend Observability: Distributed Tracing, Metrics Pipelines, Structured Logging & SLO Monitoring
Why Observability Matters
Monitoring tells you what is broken. Observability tells you why. In a monolith, you grep a log file. In a distributed system with 50 microservices, a single user request touches 12 services, 4 databases, 3 caches, and 2 message queues. Without observability, debugging is guessing.
Request Flow in Microservices:
User → API Gateway → Auth Service → User Service → Postgres
→ Order Service → Redis Cache
→ Payment Service → Stripe API
→ Inventory Service → DynamoDB
→ Notification Service → SQS → Email
Where is the 2-second latency?
Without tracing: ¯\_(ツ)_/¯ check all 8 services
With tracing: Payment Service → Stripe API (1.8s timeout)
The Three Pillars + Events
┌─────────────────────────────────────────────────────────────────┐
│ Observability Pillars │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │ Events │ │
│ │ │ │ │ │ │ │ │ │
│ │ What │ │ How much │ │ Where │ │ What │ │
│ │ happened │ │ / how fast │ │ / how long │ │ changed │ │
│ │ │ │ │ │ │ │ │ │
│ │ JSON lines │ │ Counters │ │ Spans │ │ Deploys │ │
│ │ Error stacks│ │ Gauges │ │ Parent/ │ │ Configs │ │
│ │ Audit trail │ │ Histograms │ │ child │ │ Alerts │ │
│ │ │ │ Summaries │ │ Waterfall │ │ Rollbacks│ │
│ │ │ │ │ │ │ │ │ │
│ │ Volume: │ │ Volume: │ │ Volume: │ │ Volume: │ │
│ │ HIGH │ │ LOW │ │ MEDIUM │ │ LOW │ │
│ │ Cost: $$$ │ │ Cost: $ │ │ Cost: $$ │ │ Cost: $ │ │
│ └─────┬───────┘ └──────┬─────┘ └──────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────┬────────┴───────┬───────┴──────────────┘ │
│ │ │ │
│ ┌──────▼────────┐ ┌───▼──────────────┐ │
│ │ Correlation │ │ Unified Query │ │
│ │ (Trace ID) │ │ (Grafana, etc.) │ │
│ └───────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Structured Logging
Unstructured logs are strings you grep. Structured logs are JSON you query. The difference matters at scale — 1TB/day of logs needs indexing, filtering, and aggregation, not grep.
Unstructured (bad):
[2024-03-15 10:23:45] ERROR - Failed to process order #12345 for user john@example.com
Structured (good):
{
"timestamp": "2024-03-15T10:23:45.123Z",
"level": "error",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Failed to process order",
"order_id": "12345",
"user_id": "usr_001",
"error_code": "PAYMENT_DECLINED",
"duration_ms": 234,
"retry_count": 2
}
Why structured?
1. Query: SELECT * FROM logs WHERE error_code = 'PAYMENT_DECLINED' AND duration_ms > 200
2. Aggregate: COUNT(*) GROUP BY error_code, service
3. Correlate: JOIN with traces ON trace_id
4. Alert: WHEN count(error_code='PAYMENT_DECLINED') > 100/min
// --- Structured Logger ---
enum LogLevel {
DEBUG = 0,
INFO = 1,
WARN = 2,
ERROR = 3,
FATAL = 4,
}
interface LogEntry {
timestamp: string;
level: string;
service: string;
message: string;
traceId?: string;
spanId?: string;
[key: string]: any;
}
interface LogTransport {
write(entry: LogEntry): void;
}
class StructuredLogger {
private baseFields: Record<string, any> = {};
private transports: LogTransport[] = [];
private minLevel: LogLevel = LogLevel.INFO;
private samplingRate = 1.0; // 1.0 = log everything
constructor(
private serviceName: string,
options?: {
level?: LogLevel;
transports?: LogTransport[];
samplingRate?: number;
},
) {
if (options?.level !== undefined) this.minLevel = options.level;
if (options?.transports) this.transports = options.transports;
if (options?.samplingRate !== undefined) this.samplingRate = options.samplingRate;
}
// Create a child logger with additional base fields
child(fields: Record<string, any>): StructuredLogger {
const child = new StructuredLogger(this.serviceName, {
level: this.minLevel,
transports: this.transports,
samplingRate: this.samplingRate,
});
child.baseFields = { ...this.baseFields, ...fields };
return child;
}
debug(message: string, fields?: Record<string, any>): void {
this.log(LogLevel.DEBUG, message, fields);
}
info(message: string, fields?: Record<string, any>): void {
this.log(LogLevel.INFO, message, fields);
}
warn(message: string, fields?: Record<string, any>): void {
this.log(LogLevel.WARN, message, fields);
}
error(message: string, error?: Error, fields?: Record<string, any>): void {
this.log(LogLevel.ERROR, message, {
...fields,
error_name: error?.name,
error_message: error?.message,
stack_trace: error?.stack,
});
}
fatal(message: string, error?: Error, fields?: Record<string, any>): void {
this.log(LogLevel.FATAL, message, {
...fields,
error_name: error?.name,
error_message: error?.message,
stack_trace: error?.stack,
});
}
private log(level: LogLevel, message: string, fields?: Record<string, any>): void {
if (level < this.minLevel) return;
// Sampling for high-volume debug/info logs
if (level <= LogLevel.INFO && Math.random() > this.samplingRate) return;
const entry: LogEntry = {
timestamp: new Date().toISOString(),
level: LogLevel[level].toLowerCase(),
service: this.serviceName,
message,
...this.baseFields,
...fields,
};
// Remove undefined values
for (const key of Object.keys(entry)) {
if (entry[key] === undefined) delete entry[key];
}
for (const transport of this.transports) {
transport.write(entry);
}
}
addTransport(transport: LogTransport): void {
this.transports.push(transport);
}
}
// --- Log Transports ---
class ConsoleTransport implements LogTransport {
write(entry: LogEntry): void {
const output = JSON.stringify(entry);
switch (entry.level) {
case 'error':
case 'fatal':
console.error(output);
break;
case 'warn':
console.warn(output);
break;
default:
console.log(output);
}
}
}
// Buffered transport: batch writes to reduce I/O
class BufferedTransport implements LogTransport {
private buffer: LogEntry[] = [];
private flushTimer?: ReturnType<typeof setInterval>;
constructor(
private downstream: LogTransport,
private bufferSize: number = 100,
private flushInterval: number = 5000,
) {
this.flushTimer = setInterval(() => this.flush(), flushInterval);
}
write(entry: LogEntry): void {
this.buffer.push(entry);
if (this.buffer.length >= this.bufferSize) {
this.flush();
}
}
flush(): void {
if (this.buffer.length === 0) return;
const batch = this.buffer.splice(0);
for (const entry of batch) {
this.downstream.write(entry);
}
}
destroy(): void {
if (this.flushTimer) clearInterval(this.flushTimer);
this.flush();
}
}
// Async transport with backpressure
class AsyncTransport implements LogTransport {
private queue: LogEntry[] = [];
private processing = false;
private maxQueueSize: number;
constructor(
private downstream: {
writeBatch(entries: LogEntry[]): Promise<void>;
},
maxQueueSize = 10000,
) {
this.maxQueueSize = maxQueueSize;
}
write(entry: LogEntry): void {
if (this.queue.length >= this.maxQueueSize) {
// Backpressure: drop oldest entries
this.queue.shift();
}
this.queue.push(entry);
if (!this.processing) {
this.processQueue();
}
}
private async processQueue(): Promise<void> {
this.processing = true;
while (this.queue.length > 0) {
const batch = this.queue.splice(0, 100);
try {
await this.downstream.writeBatch(batch);
} catch (error) {
// Failed to send — re-queue at front
this.queue.unshift(...batch);
// Backoff
await new Promise((r) => setTimeout(r, 1000));
}
}
this.processing = false;
}
}
// --- Log Context Propagation (via AsyncLocalStorage) ---
class LogContext {
private static storage = new Map<string, Record<string, any>>();
private static currentContextId = '';
static run<T>(fields: Record<string, any>, fn: () => T): T {
const contextId = Math.random().toString(36).slice(2);
const parentFields = this.storage.get(this.currentContextId) ?? {};
this.storage.set(contextId, { ...parentFields, ...fields });
const prevContextId = this.currentContextId;
this.currentContextId = contextId;
try {
return fn();
} finally {
this.currentContextId = prevContextId;
this.storage.delete(contextId);
}
}
static getFields(): Record<string, any> {
return this.storage.get(this.currentContextId) ?? {};
}
}
// Usage:
// LogContext.run({ traceId: 'abc123', userId: 'usr_001' }, () => {
// logger.info('Processing request', LogContext.getFields());
// // All logs in this scope automatically include traceId and userId
// });
Distributed Tracing
A trace follows a single request across all services it touches. Each service creates a span — a timed operation with context.
Trace: trace_id = abc123
│
├── Span A: API Gateway (0ms - 250ms)
│ ├── http.method: POST
│ ├── http.url: /api/orders
│ ├── http.status_code: 200
│ │
│ ├── Span B: Auth Service (5ms - 25ms)
│ │ ├── auth.method: JWT
│ │ └── auth.user_id: usr_001
│ │
│ ├── Span C: Order Service (30ms - 240ms)
│ │ ├── db.system: postgresql
│ │ │
│ │ ├── Span D: DB Query (35ms - 45ms)
│ │ │ ├── db.statement: INSERT INTO orders...
│ │ │ └── db.rows_affected: 1
│ │ │
│ │ ├── Span E: Payment Service (50ms - 200ms)
│ │ │ ├── payment.provider: stripe
│ │ │ ├── payment.amount: 99.99
│ │ │ │
│ │ │ └── Span F: Stripe API (55ms - 195ms) ← BOTTLENECK
│ │ │ ├── http.url: https://api.stripe.com/charges
│ │ │ └── http.status_code: 200
│ │ │
│ │ └── Span G: Notification (205ms - 235ms)
│ │ ├── notification.type: email
│ │ └── notification.queued: true
│ │
│ └── total: 250ms (Payment=150ms is 60% of total)
Waterfall View:
0ms 50ms 100ms 150ms 200ms 250ms
│ │ │ │ │ │
├─ A: Gateway ───────────────────────────────────────────────────┤
│ ├─ B: Auth ──┤ │
│ ├─ C: Order ──────────────────────────────────────────────┤ │
│ │ ├─ D: DB ─┤ │ │
│ │ ├─ E: Payment ─────────────────────────────────┤ │ │
│ │ │ └─ F: Stripe ──────────────────────────────┤ │ │
│ │ └─ G: Notify ────┤ │ │
Tracing Implementation (OpenTelemetry-compatible)
// --- Trace & Span Types ---
interface TraceContext {
traceId: string; // 16 bytes hex (128-bit)
spanId: string; // 8 bytes hex (64-bit)
traceFlags: number; // 1 byte (sampled = 0x01)
traceState?: string; // Vendor-specific key-value pairs
}
interface SpanData {
traceId: string;
spanId: string;
parentSpanId?: string;
name: string;
kind: SpanKind;
startTime: number; // Unix timestamp (microseconds)
endTime?: number;
status: SpanStatus;
attributes: Record<string, string | number | boolean>;
events: SpanEvent[];
links: SpanLink[];
}
enum SpanKind {
INTERNAL = 0, // Default — internal operation
SERVER = 1, // Handles incoming request
CLIENT = 2, // Makes outgoing request
PRODUCER = 3, // Creates a message (async)
CONSUMER = 4, // Processes a message (async)
}
interface SpanStatus {
code: StatusCode;
message?: string;
}
enum StatusCode {
UNSET = 0,
OK = 1,
ERROR = 2,
}
interface SpanEvent {
name: string;
timestamp: number;
attributes?: Record<string, string | number | boolean>;
}
interface SpanLink {
traceId: string;
spanId: string;
attributes?: Record<string, string | number | boolean>;
}
// --- Span Implementation ---
class Span {
private data: SpanData;
private ended = false;
constructor(
name: string,
traceId: string,
parentSpanId?: string,
kind: SpanKind = SpanKind.INTERNAL,
) {
this.data = {
traceId,
spanId: this.generateId(8),
parentSpanId,
name,
kind,
startTime: performance.now() * 1000, // microseconds
status: { code: StatusCode.UNSET },
attributes: {},
events: [],
links: [],
};
}
setAttribute(key: string, value: string | number | boolean): this {
this.data.attributes[key] = value;
return this;
}
setAttributes(attrs: Record<string, string | number | boolean>): this {
Object.assign(this.data.attributes, attrs);
return this;
}
addEvent(name: string, attributes?: Record<string, string | number | boolean>): this {
this.data.events.push({
name,
timestamp: performance.now() * 1000,
attributes,
});
return this;
}
setStatus(code: StatusCode, message?: string): this {
this.data.status = { code, message };
return this;
}
recordException(error: Error): this {
this.addEvent('exception', {
'exception.type': error.name,
'exception.message': error.message,
'exception.stacktrace': error.stack ?? '',
});
this.setStatus(StatusCode.ERROR, error.message);
return this;
}
end(): void {
if (this.ended) return;
this.ended = true;
this.data.endTime = performance.now() * 1000;
}
get spanId(): string {
return this.data.spanId;
}
get traceId(): string {
return this.data.traceId;
}
get spanData(): SpanData {
return { ...this.data };
}
get duration(): number {
if (!this.data.endTime) return 0;
return (this.data.endTime - this.data.startTime) / 1000; // milliseconds
}
private generateId(bytes: number): string {
const array = new Uint8Array(bytes);
crypto.getRandomValues(array);
return Array.from(array, (b) => b.toString(16).padStart(2, '0')).join('');
}
}
// --- Tracer ---
interface SpanExporter {
export(spans: SpanData[]): Promise<void>;
}
interface Sampler {
shouldSample(traceId: string, name: string): boolean;
}
class Tracer {
private activeSpans = new Map<string, Span>();
private completedSpans: SpanData[] = [];
private currentSpan: Span | null = null;
private exporters: SpanExporter[] = [];
private sampler: Sampler;
private batchSize: number;
private flushInterval: number;
private flushTimer?: ReturnType<typeof setInterval>;
constructor(
private serviceName: string,
options?: {
sampler?: Sampler;
exporters?: SpanExporter[];
batchSize?: number;
flushInterval?: number;
},
) {
this.sampler = options?.sampler ?? new AlwaysSampler();
this.exporters = options?.exporters ?? [];
this.batchSize = options?.batchSize ?? 100;
this.flushInterval = options?.flushInterval ?? 5000;
this.flushTimer = setInterval(() => this.flush(), this.flushInterval);
}
// Start a new root span (new trace)
startSpan(name: string, kind?: SpanKind): Span {
const traceId = this.generateTraceId();
if (!this.sampler.shouldSample(traceId, name)) {
// Return a no-op span that doesn't export
return new Span(name, traceId, undefined, kind);
}
const span = new Span(name, traceId, undefined, kind);
span.setAttribute('service.name', this.serviceName);
this.activeSpans.set(span.spanId, span);
this.currentSpan = span;
return span;
}
// Start a child span under the current active span
startChildSpan(name: string, kind?: SpanKind): Span {
const parent = this.currentSpan;
const traceId = parent?.traceId ?? this.generateTraceId();
const span = new Span(name, traceId, parent?.spanId, kind);
span.setAttribute('service.name', this.serviceName);
this.activeSpans.set(span.spanId, span);
this.currentSpan = span;
return span;
}
// Start a span from a propagated context (incoming request)
startSpanFromContext(name: string, context: TraceContext, kind?: SpanKind): Span {
const span = new Span(name, context.traceId, context.spanId, kind);
span.setAttribute('service.name', this.serviceName);
this.activeSpans.set(span.spanId, span);
this.currentSpan = span;
return span;
}
// End a span and queue for export
endSpan(span: Span): void {
span.end();
this.activeSpans.delete(span.spanId);
this.completedSpans.push(span.spanData);
// Restore parent as current
if (span.spanData.parentSpanId) {
this.currentSpan = this.activeSpans.get(span.spanData.parentSpanId) ?? null;
} else {
this.currentSpan = null;
}
if (this.completedSpans.length >= this.batchSize) {
this.flush();
}
}
// Instrument an async function with a span
async trace<T>(
name: string,
fn: (span: Span) => Promise<T>,
kind?: SpanKind,
): Promise<T> {
const span = this.currentSpan
? this.startChildSpan(name, kind)
: this.startSpan(name, kind);
try {
const result = await fn(span);
span.setStatus(StatusCode.OK);
return result;
} catch (error) {
span.recordException(error as Error);
throw error;
} finally {
this.endSpan(span);
}
}
private async flush(): Promise<void> {
if (this.completedSpans.length === 0) return;
const batch = this.completedSpans.splice(0);
for (const exporter of this.exporters) {
try {
await exporter.export(batch);
} catch (error) {
console.error(`Failed to export spans: ${error}`);
}
}
}
private generateTraceId(): string {
const array = new Uint8Array(16);
crypto.getRandomValues(array);
return Array.from(array, (b) => b.toString(16).padStart(2, '0')).join('');
}
destroy(): void {
if (this.flushTimer) clearInterval(this.flushTimer);
this.flush();
}
}
// --- Samplers ---
class AlwaysSampler implements Sampler {
shouldSample(): boolean {
return true;
}
}
class RateSampler implements Sampler {
constructor(private rate: number) {} // 0.0 to 1.0
shouldSample(): boolean {
return Math.random() < this.rate;
}
}
// Sample based on trace ID — deterministic across services
// Same trace ID always gets the same sampling decision
class TraceIdRatioSampler implements Sampler {
private threshold: number;
constructor(ratio: number) {
this.threshold = Math.floor(ratio * 0xFFFFFFFF);
}
shouldSample(traceId: string): boolean {
// Hash the last 8 chars of trace ID to get a deterministic number
const hash = parseInt(traceId.slice(-8), 16);
return hash < this.threshold;
}
}
// Head-based + tail-based hybrid
class AdaptiveSampler implements Sampler {
private errorTraces = new Set<string>();
private baseSampler: Sampler;
constructor(baseRate: number) {
this.baseSampler = new TraceIdRatioSampler(baseRate);
}
shouldSample(traceId: string, name: string): boolean {
// Always sample traces that had errors
if (this.errorTraces.has(traceId)) return true;
// Always sample slow-path operations
if (name.includes('payment') || name.includes('checkout')) return true;
return this.baseSampler.shouldSample(traceId, name);
}
markError(traceId: string): void {
this.errorTraces.add(traceId);
// Clean up after 5 minutes
setTimeout(() => this.errorTraces.delete(traceId), 300000);
}
}
Context Propagation (W3C Trace Context)
// --- W3C Trace Context Propagation ---
// Header: traceparent: 00-{traceId}-{spanId}-{flags}
// Header: tracestate: vendor1=value1,vendor2=value2
class W3CTraceContextPropagator {
static readonly TRACEPARENT = 'traceparent';
static readonly TRACESTATE = 'tracestate';
// Inject trace context into outgoing request headers
static inject(span: Span): Record<string, string> {
const headers: Record<string, string> = {};
// Format: {version}-{trace-id}-{span-id}-{trace-flags}
headers[this.TRACEPARENT] =
`00-${span.traceId}-${span.spanId}-01`;
return headers;
}
// Extract trace context from incoming request headers
static extract(headers: Record<string, string>): TraceContext | null {
const traceparent = headers[this.TRACEPARENT];
if (!traceparent) return null;
// Parse: 00-{traceId(32hex)}-{spanId(16hex)}-{flags(2hex)}
const parts = traceparent.split('-');
if (parts.length !== 4) return null;
const [version, traceId, spanId, flags] = parts;
if (version !== '00') return null;
if (traceId.length !== 32) return null;
if (spanId.length !== 16) return null;
return {
traceId,
spanId,
traceFlags: parseInt(flags, 16),
traceState: headers[this.TRACESTATE],
};
}
}
// --- HTTP Middleware for Tracing ---
class TracingMiddleware {
constructor(private tracer: Tracer) {}
// Server-side: extract context from incoming request, create span
handleIncoming(
method: string,
url: string,
headers: Record<string, string>,
): Span {
const parentContext = W3CTraceContextPropagator.extract(headers);
let span: Span;
if (parentContext) {
span = this.tracer.startSpanFromContext(
`${method} ${url}`,
parentContext,
SpanKind.SERVER,
);
} else {
span = this.tracer.startSpan(`${method} ${url}`, SpanKind.SERVER);
}
span.setAttributes({
'http.method': method,
'http.url': url,
'http.flavor': '1.1',
});
return span;
}
// Client-side: inject context into outgoing request
handleOutgoing(
span: Span,
method: string,
url: string,
): Record<string, string> {
const childSpan = new Span(
`${method} ${url}`,
span.traceId,
span.spanId,
SpanKind.CLIENT,
);
childSpan.setAttributes({
'http.method': method,
'http.url': url,
});
return W3CTraceContextPropagator.inject(childSpan);
}
}
Metrics Pipeline
Metrics are numeric measurements collected over time. They're cheap to store, fast to query, and the foundation of dashboards and alerts.
Metric Types:
1. Counter (monotonically increasing)
http_requests_total{method="GET", status="200"} = 15234
Only goes up. Rate = requests/second.
2. Gauge (current value, goes up or down)
db_connections_active{pool="primary"} = 42
Current state. Used for queue depth, memory usage.
3. Histogram (distribution of values)
http_request_duration_seconds_bucket{le="0.1"} = 8000
http_request_duration_seconds_bucket{le="0.5"} = 9500
http_request_duration_seconds_bucket{le="1.0"} = 9900
http_request_duration_seconds_bucket{le="+Inf"} = 10000
Bucket counts. Compute percentiles: p50, p95, p99.
4. Summary (pre-computed percentiles)
http_request_duration_seconds{quantile="0.99"} = 0.85
Computed client-side. Not aggregatable across instances.
Pipeline:
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ App │──→ │ Collector │──→ │ TSDB │──→ │ Query / │
│ (client) │ │ (OTel/ │ │ (Prom/ │ │ Dashboard│
│ │ │ StatsD) │ │ Mimir) │ │ (Grafana)│
│ counter │ │ │ │ │ │ │
│ gauge │ │ aggregate │ │ store │ │ PromQL │
│ histogram│ │ batch │ │ compress │ │ alert │
│ │ │ export │ │ downsamp │ │ visualize│
└──────────┘ └───────────┘ └──────────┘ └──────────┘
// --- Metrics Collection Library ---
type Labels = Record<string, string>;
// Counter: monotonically increasing value
class Counter {
private values = new Map<string, number>();
constructor(
public readonly name: string,
public readonly help: string,
private labelNames: string[] = [],
) {}
inc(labels: Labels = {}, value = 1): void {
const key = this.labelsToKey(labels);
this.values.set(key, (this.values.get(key) ?? 0) + value);
}
get(labels: Labels = {}): number {
return this.values.get(this.labelsToKey(labels)) ?? 0;
}
// Prometheus exposition format
toPrometheus(): string {
const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} counter`];
for (const [key, value] of this.values) {
const labelStr = key ? `{${key}}` : '';
lines.push(`${this.name}${labelStr} ${value}`);
}
return lines.join('\n');
}
private labelsToKey(labels: Labels): string {
return this.labelNames
.map((name) => `${name}="${labels[name] ?? ''}"`)
.join(',');
}
}
// Gauge: value that goes up and down
class Gauge {
private values = new Map<string, number>();
constructor(
public readonly name: string,
public readonly help: string,
private labelNames: string[] = [],
) {}
set(labels: Labels, value: number): void {
this.values.set(this.labelsToKey(labels), value);
}
inc(labels: Labels = {}, value = 1): void {
const key = this.labelsToKey(labels);
this.values.set(key, (this.values.get(key) ?? 0) + value);
}
dec(labels: Labels = {}, value = 1): void {
const key = this.labelsToKey(labels);
this.values.set(key, (this.values.get(key) ?? 0) - value);
}
toPrometheus(): string {
const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} gauge`];
for (const [key, value] of this.values) {
const labelStr = key ? `{${key}}` : '';
lines.push(`${this.name}${labelStr} ${value}`);
}
return lines.join('\n');
}
private labelsToKey(labels: Labels): string {
return this.labelNames
.map((name) => `${name}="${labels[name] ?? ''}"`)
.join(',');
}
}
// Histogram: distribution of values in configurable buckets
class Histogram {
private buckets: Map<string, number[]>; // label key → bucket counts
private sums = new Map<string, number>();
private counts = new Map<string, number>();
private boundaries: number[];
constructor(
public readonly name: string,
public readonly help: string,
private labelNames: string[] = [],
boundaries: number[] = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
) {
this.boundaries = boundaries.sort((a, b) => a - b);
this.buckets = new Map();
}
observe(labels: Labels, value: number): void {
const key = this.labelsToKey(labels);
// Initialize buckets if needed
if (!this.buckets.has(key)) {
this.buckets.set(key, new Array(this.boundaries.length + 1).fill(0));
this.sums.set(key, 0);
this.counts.set(key, 0);
}
const bucketCounts = this.buckets.get(key)!;
// Increment all buckets where value <= boundary
for (let i = 0; i < this.boundaries.length; i++) {
if (value <= this.boundaries[i]) {
bucketCounts[i]++;
}
}
bucketCounts[this.boundaries.length]++; // +Inf bucket always increments
this.sums.set(key, this.sums.get(key)! + value);
this.counts.set(key, this.counts.get(key)! + 1);
}
// Timer helper
startTimer(labels: Labels): () => void {
const start = performance.now();
return () => {
const duration = (performance.now() - start) / 1000; // seconds
this.observe(labels, duration);
};
}
// Estimate percentile from histogram buckets
percentile(labels: Labels, p: number): number {
const key = this.labelsToKey(labels);
const bucketCounts = this.buckets.get(key);
const total = this.counts.get(key);
if (!bucketCounts || !total || total === 0) return 0;
const target = Math.ceil(p * total);
let cumulative = 0;
for (let i = 0; i < this.boundaries.length; i++) {
cumulative = bucketCounts[i];
if (cumulative >= target) {
// Linear interpolation within this bucket
const prevCumulative = i > 0 ? bucketCounts[i - 1] : 0;
const prevBoundary = i > 0 ? this.boundaries[i - 1] : 0;
const bucketWidth = this.boundaries[i] - prevBoundary;
const bucketCount = cumulative - prevCumulative;
if (bucketCount === 0) return this.boundaries[i];
const ratio = (target - prevCumulative) / bucketCount;
return prevBoundary + bucketWidth * ratio;
}
}
return this.boundaries[this.boundaries.length - 1];
}
toPrometheus(): string {
const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} histogram`];
for (const [key, bucketCounts] of this.buckets) {
const labelStr = key ? `${key},` : '';
for (let i = 0; i < this.boundaries.length; i++) {
lines.push(`${this.name}_bucket{${labelStr}le="${this.boundaries[i]}"} ${bucketCounts[i]}`);
}
lines.push(`${this.name}_bucket{${labelStr}le="+Inf"} ${bucketCounts[this.boundaries.length]}`);
lines.push(`${this.name}_sum{${key}} ${this.sums.get(key)}`);
lines.push(`${this.name}_count{${key}} ${this.counts.get(key)}`);
}
return lines.join('\n');
}
private labelsToKey(labels: Labels): string {
return this.labelNames
.map((name) => `${name}="${labels[name] ?? ''}"`)
.join(',');
}
}
// --- Metrics Registry ---
class MetricsRegistry {
private metrics = new Map<string, Counter | Gauge | Histogram>();
createCounter(name: string, help: string, labelNames?: string[]): Counter {
const counter = new Counter(name, help, labelNames);
this.metrics.set(name, counter);
return counter;
}
createGauge(name: string, help: string, labelNames?: string[]): Gauge {
const gauge = new Gauge(name, help, labelNames);
this.metrics.set(name, gauge);
return gauge;
}
createHistogram(
name: string,
help: string,
labelNames?: string[],
boundaries?: number[],
): Histogram {
const histogram = new Histogram(name, help, labelNames, boundaries);
this.metrics.set(name, histogram);
return histogram;
}
// Expose /metrics endpoint (Prometheus format)
toPrometheus(): string {
const sections: string[] = [];
for (const metric of this.metrics.values()) {
sections.push(metric.toPrometheus());
}
return sections.join('\n\n') + '\n';
}
}
// --- Standard Backend Metrics ---
function createStandardMetrics(registry: MetricsRegistry) {
return {
// HTTP metrics
httpRequestsTotal: registry.createCounter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status'],
),
httpRequestDuration: registry.createHistogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'path'],
[0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
),
httpRequestSize: registry.createHistogram(
'http_request_size_bytes',
'HTTP request body size in bytes',
['method', 'path'],
),
httpResponseSize: registry.createHistogram(
'http_response_size_bytes',
'HTTP response body size in bytes',
['method', 'path'],
),
// Database metrics
dbQueryDuration: registry.createHistogram(
'db_query_duration_seconds',
'Database query duration',
['operation', 'table'],
[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
),
dbConnectionsActive: registry.createGauge(
'db_connections_active',
'Active database connections',
['pool'],
),
dbConnectionsIdle: registry.createGauge(
'db_connections_idle',
'Idle database connections',
['pool'],
),
dbQueryErrors: registry.createCounter(
'db_query_errors_total',
'Database query errors',
['operation', 'error_type'],
),
// Cache metrics
cacheHits: registry.createCounter(
'cache_hits_total',
'Cache hit count',
['cache', 'operation'],
),
cacheMisses: registry.createCounter(
'cache_misses_total',
'Cache miss count',
['cache', 'operation'],
),
cacheLatency: registry.createHistogram(
'cache_operation_duration_seconds',
'Cache operation duration',
['cache', 'operation'],
[0.0001, 0.0005, 0.001, 0.005, 0.01],
),
// Queue metrics
queueDepth: registry.createGauge(
'queue_depth',
'Current queue depth',
['queue'],
),
queueProcessed: registry.createCounter(
'queue_messages_processed_total',
'Messages processed from queue',
['queue', 'status'],
),
queueProcessingDuration: registry.createHistogram(
'queue_processing_duration_seconds',
'Queue message processing duration',
['queue'],
),
// Runtime metrics
processMemoryBytes: registry.createGauge(
'process_memory_bytes',
'Process memory usage',
['type'],
),
processUptime: registry.createGauge(
'process_uptime_seconds',
'Process uptime in seconds',
),
eventLoopLag: registry.createHistogram(
'nodejs_event_loop_lag_seconds',
'Node.js event loop lag',
[],
[0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
),
};
}
SLO Monitoring & Error Budgets
SLI → SLO → Error Budget → Alerts
SLI (Service Level Indicator):
The measurement. Example: "99.2% of requests completed in < 500ms"
SLO (Service Level Objective):
The target. Example: "99.9% of requests must complete in < 500ms"
Error Budget:
The slack. 99.9% SLO = 0.1% error budget = 43.2 minutes/month
If you've used 30 minutes this month → 13 minutes remain
If budget exhausted → freeze deployments, focus on reliability
┌─────────────────────────────────────────────────────────────┐
│ SLO: 99.9% availability (43.2 min downtime/month budget) │
│ │
│ Day 1-10: ████████████░░░░░░░░░░░░░░░░░░ 85% budget left │
│ Day 10-20: ████████████████████░░░░░░░░░░ 50% budget left │
│ Day 20-25: █████████████████████████████░ 10% budget left │
│ Day 25-30: ██████████████████████████████ BUDGET EXHAUSTED │
│ ↑ │
│ Freeze deploys │
│ Page on-call │
│ │
│ Multi-window alerts: │
│ ┌────────────────┬──────────┬──────────┐ │
│ │ Window │ Burn Rate│ Action │ │
│ ├────────────────┼──────────┼──────────┤ │
│ │ 1 hour │ > 14.4x │ Page NOW │ │
│ │ 6 hours │ > 6x │ Page │ │
│ │ 3 days │ > 1x │ Ticket │ │
│ └────────────────┴──────────┴──────────┘ │
└─────────────────────────────────────────────────────────────┘
// --- SLO Monitor ---
interface SLODefinition {
name: string;
description: string;
target: number; // e.g., 0.999 (99.9%)
windowDays: number; // e.g., 30 (rolling 30-day window)
sliType: 'availability' | 'latency' | 'throughput' | 'correctness';
// For latency SLOs
latencyThresholdMs?: number;
}
interface SLOStatus {
slo: SLODefinition;
currentSLI: number; // Current SLI value (e.g., 0.9985)
errorBudgetTotal: number; // Total budget (minutes or requests)
errorBudgetConsumed: number; // Used budget
errorBudgetRemaining: number;// Remaining budget
burnRate: number; // Current burn rate (1.0 = on track)
isViolating: boolean; // Are we currently below the SLO?
projectedSLI: number; // Where will we be at end of window?
}
class SLOMonitor {
private slos = new Map<string, SLODefinition>();
private measurements = new Map<string, {
totalEvents: number;
goodEvents: number;
windowStart: number;
buckets: Array<{ timestamp: number; total: number; good: number }>;
}>();
private alertCallbacks: Array<(alert: SLOAlert) => void> = [];
registerSLO(slo: SLODefinition): void {
this.slos.set(slo.name, slo);
this.measurements.set(slo.name, {
totalEvents: 0,
goodEvents: 0,
windowStart: Date.now(),
buckets: [],
});
}
// Record an event (request, operation, etc.)
record(sloName: string, isGood: boolean): void {
const measurement = this.measurements.get(sloName);
if (!measurement) return;
measurement.totalEvents++;
if (isGood) measurement.goodEvents++;
// Store in time buckets (1-minute granularity)
const now = Date.now();
const bucketKey = Math.floor(now / 60000) * 60000;
let bucket = measurement.buckets.find((b) => b.timestamp === bucketKey);
if (!bucket) {
bucket = { timestamp: bucketKey, total: 0, good: 0 };
measurement.buckets.push(bucket);
}
bucket.total++;
if (isGood) bucket.good++;
// Prune old buckets
const windowMs = (this.slos.get(sloName)?.windowDays ?? 30) * 86400000;
measurement.buckets = measurement.buckets.filter(
(b) => now - b.timestamp < windowMs,
);
// Check alerts
this.checkAlerts(sloName);
}
getStatus(sloName: string): SLOStatus | null {
const slo = this.slos.get(sloName);
const measurement = this.measurements.get(sloName);
if (!slo || !measurement) return null;
// Calculate SLI over the rolling window
const windowBuckets = this.getWindowBuckets(sloName);
const totalInWindow = windowBuckets.reduce((sum, b) => sum + b.total, 0);
const goodInWindow = windowBuckets.reduce((sum, b) => sum + b.good, 0);
const currentSLI = totalInWindow > 0 ? goodInWindow / totalInWindow : 1.0;
// Error budget
const errorBudgetTotal = (1 - slo.target) * totalInWindow;
const badEvents = totalInWindow - goodInWindow;
const errorBudgetRemaining = Math.max(0, errorBudgetTotal - badEvents);
// Burn rate: how fast are we consuming the budget?
// 1.0 = consuming at exactly the sustainable rate
// 2.0 = consuming twice as fast as sustainable
const elapsedFraction = this.getWindowElapsedFraction(slo);
const expectedBudgetConsumed = errorBudgetTotal * elapsedFraction;
const burnRate = expectedBudgetConsumed > 0
? badEvents / expectedBudgetConsumed
: 0;
// Project where we'll be at end of window
const projectedBadEvents = badEvents / Math.max(elapsedFraction, 0.01);
const projectedTotal = totalInWindow / Math.max(elapsedFraction, 0.01);
const projectedSLI = projectedTotal > 0
? (projectedTotal - projectedBadEvents) / projectedTotal
: 1.0;
return {
slo,
currentSLI,
errorBudgetTotal,
errorBudgetConsumed: badEvents,
errorBudgetRemaining,
burnRate,
isViolating: currentSLI < slo.target,
projectedSLI,
};
}
private getWindowBuckets(sloName: string): Array<{ total: number; good: number }> {
const slo = this.slos.get(sloName)!;
const measurement = this.measurements.get(sloName)!;
const windowMs = slo.windowDays * 86400000;
const now = Date.now();
return measurement.buckets.filter((b) => now - b.timestamp < windowMs);
}
private getWindowElapsedFraction(slo: SLODefinition): number {
const measurement = this.measurements.get(slo.name)!;
const windowMs = slo.windowDays * 86400000;
const elapsed = Date.now() - measurement.windowStart;
return Math.min(1, elapsed / windowMs);
}
// Multi-window, multi-burn-rate alerting (Google SRE book)
private checkAlerts(sloName: string): void {
const status = this.getStatus(sloName);
if (!status) return;
// Fast burn: 14.4x rate over 1 hour → exhausts budget in 2 days
// Checked with 5-minute short window
const burn1h = this.getBurnRateForWindow(sloName, 60);
const burn5m = this.getBurnRateForWindow(sloName, 5);
if (burn1h > 14.4 && burn5m > 14.4) {
this.emitAlert({
sloName,
severity: 'critical',
message: `SLO ${sloName}: burn rate ${burn1h.toFixed(1)}x (1h window). Budget exhaustion in ${(status.errorBudgetRemaining / burn1h).toFixed(0)} minutes.`,
burnRate: burn1h,
windowMinutes: 60,
budgetRemaining: status.errorBudgetRemaining,
});
}
// Medium burn: 6x rate over 6 hours
const burn6h = this.getBurnRateForWindow(sloName, 360);
const burn30m = this.getBurnRateForWindow(sloName, 30);
if (burn6h > 6 && burn30m > 6) {
this.emitAlert({
sloName,
severity: 'warning',
message: `SLO ${sloName}: burn rate ${burn6h.toFixed(1)}x (6h window). Budget at risk.`,
burnRate: burn6h,
windowMinutes: 360,
budgetRemaining: status.errorBudgetRemaining,
});
}
// Slow burn: 1x rate sustained over 3 days
const burn3d = this.getBurnRateForWindow(sloName, 4320);
if (burn3d > 1 && status.errorBudgetRemaining < status.errorBudgetTotal * 0.1) {
this.emitAlert({
sloName,
severity: 'info',
message: `SLO ${sloName}: <10% error budget remaining. Slow degradation.`,
burnRate: burn3d,
windowMinutes: 4320,
budgetRemaining: status.errorBudgetRemaining,
});
}
}
private getBurnRateForWindow(sloName: string, windowMinutes: number): number {
const slo = this.slos.get(sloName)!;
const measurement = this.measurements.get(sloName)!;
const windowMs = windowMinutes * 60000;
const now = Date.now();
const windowBuckets = measurement.buckets.filter(
(b) => now - b.timestamp < windowMs,
);
const total = windowBuckets.reduce((sum, b) => sum + b.total, 0);
const good = windowBuckets.reduce((sum, b) => sum + b.good, 0);
const bad = total - good;
if (total === 0) return 0;
const expectedBudgetPerMinute = (1 - slo.target) * total / windowMinutes;
const actualBudgetPerMinute = bad / windowMinutes;
return expectedBudgetPerMinute > 0
? actualBudgetPerMinute / expectedBudgetPerMinute
: 0;
}
private emitAlert(alert: SLOAlert): void {
for (const callback of this.alertCallbacks) {
callback(alert);
}
}
onAlert(callback: (alert: SLOAlert) => void): void {
this.alertCallbacks.push(callback);
}
}
interface SLOAlert {
sloName: string;
severity: 'critical' | 'warning' | 'info';
message: string;
burnRate: number;
windowMinutes: number;
budgetRemaining: number;
}
Correlating Signals
The real power of observability is connecting logs, traces, and metrics through shared identifiers.
Correlation Strategy:
Every request gets a trace_id at the edge (API gateway).
This trace_id flows through:
→ Logs: every log line includes trace_id
→ Traces: the trace is identified by trace_id
→ Metrics: exemplars link metrics to trace_ids
User reports "my order was slow" → Support finds order_id
→ Logs: search trace_id from order_id log
→ Trace: view full waterfall for that trace_id
→ Metrics: check p99 latency spike at that timestamp
→ Alert: was there an active alert?
→ Deploy: was there a recent deployment?
┌─────────────────────────┐
│ Grafana Dashboard │
│ │
│ ┌─────────────────────┐ │
│ │ Metrics Panel │ │ Click on spike → jump to traces
│ │ p99 latency = 2.3s │─┼──→ Tempo: traces at t=14:32
│ │ ▲ │ │ │
│ │ │ ╱╲ │ │ ▼
│ │ │ ╱ ╲ │ │ Span: Payment API (1.8s)
│ │ │─╱────╲──── time │ │ │
│ └─────────────────────┘ │ ▼
│ │ Loki: logs for trace_id xyz
│ ┌─────────────────────┐ │ {"trace_id":"xyz",
│ │ Log Panel │ │ "error":"timeout",
│ │ ERROR: timeout │─┼──→ "service":"payment"}
│ │ trace_id: xyz │ │
│ └─────────────────────┘ │
└─────────────────────────┘
// --- Unified Observability Context ---
class ObservabilityContext {
private logger: StructuredLogger;
private tracer: Tracer;
private metrics: ReturnType<typeof createStandardMetrics>;
private sloMonitor: SLOMonitor;
constructor(
serviceName: string,
config: {
logLevel: LogLevel;
traceSamplingRate: number;
metricsRegistry: MetricsRegistry;
},
) {
this.logger = new StructuredLogger(serviceName, {
level: config.logLevel,
transports: [new ConsoleTransport()],
});
this.tracer = new Tracer(serviceName, {
sampler: new TraceIdRatioSampler(config.traceSamplingRate),
});
this.metrics = createStandardMetrics(config.metricsRegistry);
this.sloMonitor = new SLOMonitor();
}
// Instrument an HTTP handler with all three pillars
async instrumentHTTP<T>(
method: string,
path: string,
headers: Record<string, string>,
handler: (ctx: RequestContext) => Promise<T>,
): Promise<{ result: T; statusCode: number }> {
// 1. Extract or create trace context
const parentContext = W3CTraceContextPropagator.extract(headers);
const span = parentContext
? this.tracer.startSpanFromContext(`${method} ${path}`, parentContext, SpanKind.SERVER)
: this.tracer.startSpan(`${method} ${path}`, SpanKind.SERVER);
span.setAttributes({
'http.method': method,
'http.url': path,
});
// 2. Create correlated logger
const requestLogger = this.logger.child({
traceId: span.traceId,
spanId: span.spanId,
method,
path,
});
// 3. Start metrics timer
const timer = this.metrics.httpRequestDuration.startTimer({ method, path });
// 4. Build request context
const ctx: RequestContext = {
traceId: span.traceId,
spanId: span.spanId,
logger: requestLogger,
span,
};
let statusCode = 200;
try {
requestLogger.info('Request started');
const result = await handler(ctx);
span.setStatus(StatusCode.OK);
requestLogger.info('Request completed', { statusCode });
// Record SLO
this.sloMonitor.record('availability', true);
return { result, statusCode };
} catch (error) {
statusCode = 500;
span.recordException(error as Error);
requestLogger.error('Request failed', error as Error, { statusCode });
// Record SLO failure
this.sloMonitor.record('availability', false);
throw error;
} finally {
// Finalize metrics
timer(); // Records duration
this.metrics.httpRequestsTotal.inc({ method, path, status: String(statusCode) });
// End span
this.tracer.endSpan(span);
}
}
// Instrument a database query
async instrumentDB<T>(
ctx: RequestContext,
operation: string,
table: string,
query: () => Promise<T>,
): Promise<T> {
const childSpan = new Span(
`DB ${operation} ${table}`,
ctx.traceId,
ctx.spanId,
SpanKind.CLIENT,
);
childSpan.setAttributes({
'db.system': 'postgresql',
'db.operation': operation,
'db.sql.table': table,
});
const timer = this.metrics.dbQueryDuration.startTimer({ operation, table });
try {
const result = await query();
childSpan.setStatus(StatusCode.OK);
return result;
} catch (error) {
childSpan.recordException(error as Error);
this.metrics.dbQueryErrors.inc({ operation, error_type: (error as Error).name });
throw error;
} finally {
timer();
childSpan.end();
}
}
// Instrument a cache operation
async instrumentCache<T>(
ctx: RequestContext,
cacheName: string,
operation: string,
key: string,
fn: () => Promise<T | null>,
): Promise<T | null> {
const timer = this.metrics.cacheLatency.startTimer({ cache: cacheName, operation });
try {
const result = await fn();
if (result !== null) {
this.metrics.cacheHits.inc({ cache: cacheName, operation });
} else {
this.metrics.cacheMisses.inc({ cache: cacheName, operation });
}
return result;
} finally {
timer();
}
}
getSLOMonitor(): SLOMonitor {
return this.sloMonitor;
}
}
interface RequestContext {
traceId: string;
spanId: string;
logger: StructuredLogger;
span: Span;
}
Alerting Strategy
Alert Quality Pyramid:
┌─────────┐
│ SLO-based│ Best: Alert on error budget burn rate
│ alerts │ "We're consuming budget 14x faster than sustainable"
├──────────┤
│ Symptom │ Good: Alert on user-facing symptoms
│ alerts │ "p99 latency > 2s for 5 minutes"
├──────────┤
│ Cause │ OK: Alert on known failure modes
│ alerts │ "Disk usage > 90%"
├──────────┤
│ Threshold│ Bad: Static thresholds on raw metrics
│ alerts │ "CPU > 80%" (could be fine)
└──────────┘
Alert Rules (best practices):
DO:
✓ Alert on SLO burn rate (multi-window)
✓ Alert on user-facing errors (5xx rate > 1%)
✓ Alert on latency degradation relative to baseline
✓ Page only for things that need immediate human action
✓ Include runbook links in every alert
✓ Auto-resolve when condition clears
DON'T:
✗ Alert on CPU/memory (symptoms, not causes)
✗ Alert per-instance (use aggregate metrics)
✗ Create non-actionable alerts
✗ Page for things that self-heal
✗ Set arbitrary static thresholds
// --- Alert Rule Engine ---
interface AlertRule {
name: string;
description: string;
query: string; // PromQL-like expression
condition: AlertCondition;
severity: 'critical' | 'warning' | 'info';
for: number; // Duration condition must hold (seconds)
runbookUrl?: string;
labels: Record<string, string>;
annotations: Record<string, string>;
}
interface AlertCondition {
type: 'threshold' | 'burn_rate' | 'anomaly';
operator: '>' | '<' | '>=' | '<=' | '==' | '!=';
value: number;
}
enum AlertState {
INACTIVE = 'inactive',
PENDING = 'pending', // Condition true, waiting for `for` duration
FIRING = 'firing',
RESOLVED = 'resolved',
}
interface ActiveAlert {
rule: AlertRule;
state: AlertState;
value: number;
startedAt: number;
firedAt?: number;
resolvedAt?: number;
fingerprint: string;
}
class AlertEngine {
private rules: AlertRule[] = [];
private activeAlerts = new Map<string, ActiveAlert>();
private notifiers: AlertNotifier[] = [];
private evaluationInterval = 15000; // 15 seconds
private evaluationTimer?: ReturnType<typeof setInterval>;
addRule(rule: AlertRule): void {
this.rules.push(rule);
}
addNotifier(notifier: AlertNotifier): void {
this.notifiers.push(notifier);
}
start(): void {
this.evaluationTimer = setInterval(() => this.evaluate(), this.evaluationInterval);
}
stop(): void {
if (this.evaluationTimer) clearInterval(this.evaluationTimer);
}
private async evaluate(): Promise<void> {
for (const rule of this.rules) {
const currentValue = this.evaluateQuery(rule.query);
const conditionMet = this.checkCondition(currentValue, rule.condition);
const fingerprint = this.computeFingerprint(rule);
const existing = this.activeAlerts.get(fingerprint);
if (conditionMet) {
if (!existing) {
// New alert — start pending
this.activeAlerts.set(fingerprint, {
rule,
state: AlertState.PENDING,
value: currentValue,
startedAt: Date.now(),
fingerprint,
});
} else if (
existing.state === AlertState.PENDING &&
Date.now() - existing.startedAt >= rule.for * 1000
) {
// Pending long enough — fire
existing.state = AlertState.FIRING;
existing.firedAt = Date.now();
existing.value = currentValue;
await this.notify(existing, 'firing');
} else if (existing.state === AlertState.RESOLVED) {
// Was resolved, now firing again
existing.state = AlertState.PENDING;
existing.startedAt = Date.now();
existing.resolvedAt = undefined;
}
// Update value
if (existing) existing.value = currentValue;
} else {
if (existing && existing.state === AlertState.FIRING) {
existing.state = AlertState.RESOLVED;
existing.resolvedAt = Date.now();
await this.notify(existing, 'resolved');
} else if (existing) {
// Condition not met before firing — remove
this.activeAlerts.delete(fingerprint);
}
}
}
}
private evaluateQuery(_query: string): number {
// In production: evaluate PromQL against metric store
// Simplified: return a random value for demonstration
return Math.random() * 100;
}
private checkCondition(value: number, condition: AlertCondition): boolean {
switch (condition.operator) {
case '>': return value > condition.value;
case '<': return value < condition.value;
case '>=': return value >= condition.value;
case '<=': return value <= condition.value;
case '==': return value === condition.value;
case '!=': return value !== condition.value;
}
}
private async notify(alert: ActiveAlert, action: 'firing' | 'resolved'): Promise<void> {
const notification = {
alertName: alert.rule.name,
description: alert.rule.description,
severity: alert.rule.severity,
state: action,
value: alert.value,
startedAt: new Date(alert.startedAt).toISOString(),
firedAt: alert.firedAt ? new Date(alert.firedAt).toISOString() : undefined,
resolvedAt: alert.resolvedAt ? new Date(alert.resolvedAt).toISOString() : undefined,
labels: alert.rule.labels,
annotations: alert.rule.annotations,
runbookUrl: alert.rule.runbookUrl,
};
for (const notifier of this.notifiers) {
try {
await notifier.send(notification);
} catch (error) {
console.error(`Failed to send alert notification: ${error}`);
}
}
}
private computeFingerprint(rule: AlertRule): string {
return `${rule.name}:${JSON.stringify(rule.labels)}`;
}
getActiveAlerts(): ActiveAlert[] {
return Array.from(this.activeAlerts.values())
.filter((a) => a.state === AlertState.FIRING);
}
}
interface AlertNotifier {
send(notification: Record<string, any>): Promise<void>;
}
// PagerDuty notifier
class PagerDutyNotifier implements AlertNotifier {
constructor(private apiKey: string) {}
async send(notification: Record<string, any>): Promise<void> {
// POST to PagerDuty Events API v2
if (notification.severity === 'critical' && notification.state === 'firing') {
console.log(`[PagerDuty] TRIGGER: ${notification.alertName}`);
} else if (notification.state === 'resolved') {
console.log(`[PagerDuty] RESOLVE: ${notification.alertName}`);
}
}
}
// Slack notifier
class SlackNotifier implements AlertNotifier {
constructor(private webhookUrl: string) {}
async send(notification: Record<string, any>): Promise<void> {
const emoji = notification.state === 'resolved' ? '✅' :
notification.severity === 'critical' ? '🔴' : '🟡';
console.log(
`[Slack] ${emoji} ${notification.alertName}: ${notification.description} ` +
`(${notification.state}, value=${notification.value.toFixed(2)})`,
);
}
}
Observability Stack Comparison
| Component | Open Source | Commercial |
|---|---|---|
| Metrics | Prometheus + Thanos/Mimir | Datadog, New Relic |
| Logs | Loki, ELK (Elastic) | Splunk, Datadog Logs |
| Traces | Jaeger, Tempo | Honeycomb, Lightstep |
| Dashboards | Grafana | Datadog, New Relic |
| Alerting | Alertmanager | PagerDuty, OpsGenie |
| Collection | OpenTelemetry | Datadog Agent |
| APM | SigNoz, Uptrace | Dynatrace, AppDynamics |
| Approach | Monthly Cost (100 services) | Pros | Cons |
|---|---|---|---|
| Self-hosted OSS | $5-15K (infra) | Full control, no data egress, unlimited cardinality | Operational burden, needs dedicated team |
| Managed OSS | $10-30K | Lower ops burden, scalable | Still complex, multi-tool |
| Full SaaS | $30-100K+ | Zero ops, integrated, easy onboard | Expensive at scale, vendor lock-in |
Interview Questions & Answers
Q1: What is the difference between monitoring and observability?
A: Monitoring is predefined — you decide in advance what metrics and thresholds to watch. It answers "is the system healthy?" with yes/no. Observability is exploratory — it lets you ask arbitrary questions about the system's internal state from its external outputs. It answers "why is the system unhealthy?" without needing to have predicted the failure mode in advance. A system is observable when you can understand any internal state just by examining telemetry (logs, metrics, traces). Example: monitoring catches "error rate > 5%." Observability lets you drill down: "errors are 90% from the payment service, specifically the Stripe timeout path, only for requests from the EU region, starting 10 minutes after deploy v2.4.1." The distinction matters because in distributed systems, most failures are novel — you can't pre-define alerts for every possible failure mode.
Q2: How do you implement distributed tracing without significant performance overhead?
A: Three strategies: (1) Sampling — don't trace every request. Head-based sampling (decide at the edge, propagate the decision) with 1-10% sampling rate captures enough data for insight while reducing overhead 10-100x. Use deterministic sampling (trace ID hash) so all services agree on the same decision. (2) Async export — buffer spans in memory and export in batches to the collector. Don't block the request path for telemetry I/O. Use a bounded buffer with backpressure (drop oldest spans when full, never slow down requests). (3) Tail-based sampling — trace everything, but only keep interesting traces (errors, high latency, specific paths). This catches 100% of problems while storing <5% of data. Trade-off: requires a centralized collector to make sampling decisions after the trace completes. (4) Context propagation is cheap — passing a 32-byte trace ID in headers adds <0.01ms. The cost is in serializing and exporting span data. Keep span attributes minimal and export asynchronously.
Q3: How do you design effective SLOs and error budgets?
A: Start from user expectations, not system capabilities. (1) Choose SLIs — for a web service: availability (% of non-5xx responses), latency (% of requests under 500ms), correctness (% of correct results). (2) Set SLO targets — based on what users actually need, not "as high as possible." 99.9% (43 min/month downtime) is appropriate for most internal services. 99.95% for customer-facing. 99.99% only for critical infrastructure (auth, payments). (3) Calculate error budget — 99.9% SLO = 0.1% budget. For 1M requests/month, that's 1000 failed requests. (4) Alert on burn rate, not threshold — use multi-window burn rate alerting (Google SRE). A 14.4x burn rate over 1 hour means "budget will exhaust in 2 days" — page immediately. A 1x burn rate sustained over 3 days means "we'll barely miss" — create a ticket. (5) Policy — when budget is exhausted, freeze deployments and focus on reliability. When budget is healthy, deploy features faster. This aligns engineering velocity with reliability quantitatively.
Q4: How do you handle high-cardinality data in observability systems?
A: High cardinality means many unique label combinations (user_id, trace_id, request_id). Prometheus struggles past ~10M active time series. Solutions: (1) Don't use high-cardinality labels in metrics — user_id should be a log/trace attribute, not a metric label. Metrics are for aggregates (by endpoint, status code, region). (2) Use exemplars — attach a sample trace_id to a metric series. When you see a p99 spike, the exemplar links to an actual trace. You get cardinality in traces, not metrics. (3) Logs for unique identifiers — structured logs with user_id, order_id, etc. Indexed for search but not aggregated as time series. (4) Column-oriented storage — systems like ClickHouse handle high cardinality better than Prometheus (inverted indices + column compression). (5) Pre-aggregation — in the collector, combine per-instance metrics into per-service aggregates before storage. Reduces cardinality by 10-100x.
Q5: How do you correlate logs, metrics, and traces?
A: The key is a shared trace_id that flows through all signals. (1) Inject trace_id at the edge — API gateway generates or extracts a W3C traceparent header. Every downstream service propagates it. (2) Logs include trace_id — every structured log line has trace_id and span_id fields. In Grafana: click a trace → view all logs with that trace_id in Loki. (3) Metrics use exemplars — when recording a histogram observation, attach the trace_id as an exemplar. In Grafana: click a histogram bucket → see the exemplar trace_id → jump to Tempo trace. (4) Service graph — traces inherently contain the service call graph. Aggregate traces to build a live service topology showing request flow and error rates. (5) Unified timestamps — ensure all signals use the same time source (NTP-synced). Logs, traces, and metrics must align temporally for correlation to work. OpenTelemetry is the standard for generating all three signals from a single SDK with automatic correlation.
Real-World Problems & How to Solve Them
Problem 1: Prometheus memory explodes after a release
Symptom: Metrics backend OOMs, scrape latency increases, and dashboards time out.
Root cause: High-cardinality labels (user_id, request_id, trace_id) create massive unique time series. Metrics are aggregate signals, not per-entity records.
Fix — Enforce an allowlist of low-cardinality labels:
type Labels = Record<string, string>;
const ALLOWED_LABELS = new Set(["service", "route", "method", "status_code", "region"]);
function sanitizeMetricLabels(input: Labels): Labels {
const output: Labels = {};
for (const [key, value] of Object.entries(input)) {
if (!ALLOWED_LABELS.has(key)) continue;
output[key] = value;
}
return output;
}
function recordHttpRequest(durationMs: number, labels: Labels): void {
const safe = sanitizeMetricLabels(labels);
httpRequestDurationHistogram.observe(safe, durationMs / 1000);
}
Problem 2: Traces are fragmented across services
Symptom: A user request appears as multiple unrelated traces in your tracing UI.
Root cause: W3C Trace Context (traceparent) is not propagated over HTTP and async boundaries (queues/events).
Fix — Inject and extract context on every boundary:
import { context, propagation } from "@opentelemetry/api";
async function publishOrderEvent(topic: string, payload: unknown): Promise<void> {
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
await messageBus.publish(topic, {
payload,
headers: carrier,
});
}
async function handleOrderEvent(message: { payload: unknown; headers: Record<string, string> }): Promise<void> {
const parent = propagation.extract(context.active(), message.headers);
await context.with(parent, async () => {
await orderConsumer.process(message.payload);
});
}
Problem 3: Logs are impossible to query during incidents
Symptom: Engineers grep plain text logs for hours and still can’t isolate failing requests.
Root cause: Unstructured log strings don’t preserve typed fields (trace_id, error_code, duration_ms) needed for indexing and correlation.
Fix — Emit JSON logs with required fields and context inheritance:
type LogLevel = "info" | "warn" | "error";
function log(level: LogLevel, message: string, fields: Record<string, unknown> = {}): void {
const entry = {
timestamp: new Date().toISOString(),
level,
service: "checkout-service",
trace_id: requestContext.get("traceId"),
span_id: requestContext.get("spanId"),
message,
...fields,
};
process.stdout.write(`${JSON.stringify(entry)}\n`);
}
log("error", "Payment declined", {
error_code: "PAYMENT_DECLINED",
order_id: "ord_123",
duration_ms: 843,
});
Problem 4: p99 latency is invisible until customers complain
Symptom: Average latency looks stable, but tail latency spikes hurt real user experience.
Root cause: Latency was instrumented as a gauge or average counter instead of a histogram, so percentile math is impossible.
Fix — Use histogram buckets designed for your SLO range:
import { Histogram } from "prom-client";
const httpRequestDurationSeconds = new Histogram({
name: "http_request_duration_seconds",
help: "HTTP request latency",
labelNames: ["service", "route", "method", "status_code"],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5],
});
async function timedRequest(labels: Record<string, string>, fn: () => Promise<void>): Promise<void> {
const start = process.hrtime.bigint();
try {
await fn();
} finally {
const elapsedSeconds = Number(process.hrtime.bigint() - start) / 1e9;
httpRequestDurationSeconds.observe(labels, elapsedSeconds);
}
}
Problem 5: Alert fatigue from noisy threshold pages
Symptom: On-call gets paged repeatedly for transient spikes that self-heal.
Root cause: Static threshold alerts ignore error budget consumption rate and time window context.
Fix — Alert on multi-window burn rate, not raw error count:
type BurnRateWindow = { lookbackMinutes: number; threshold: number; severity: "page" | "ticket" };
const windows: BurnRateWindow[] = [
{ lookbackMinutes: 60, threshold: 14.4, severity: "page" },
{ lookbackMinutes: 360, threshold: 6, severity: "page" },
{ lookbackMinutes: 1440, threshold: 2, severity: "ticket" },
];
function evaluateBurnRate(currentBurnRate: number, lookbackMinutes: number): "page" | "ticket" | "none" {
const match = windows.find((w) => w.lookbackMinutes === lookbackMinutes);
if (!match) return "none";
if (currentBurnRate >= match.threshold) return match.severity;
return "none";
}
Problem 6: You can see a bad metric point but can’t inspect the request
Symptom: Dashboards show latency spikes, but engineers can’t jump to any representative trace.
Root cause: Histogram observations are recorded without exemplars, so metrics and traces are disconnected at query time.
Fix — Attach trace IDs as exemplars for sampled requests:
import { trace } from "@opentelemetry/api";
function recordLatencyWithExemplar(latencySeconds: number, labels: Record<string, string>): void {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
httpRequestDurationHistogram.observe(labels, latencySeconds, {
trace_id: traceId,
});
}
Key Takeaways
- Structured logging (JSON) enables querying and aggregation — unstructured string logs are useless at scale; every log entry needs trace_id, service name, and typed fields
- Distributed tracing follows a single request across all services — each service creates a span with parent-child relationships forming a waterfall view
- W3C Trace Context (traceparent header) is the standard for propagating trace IDs across service boundaries — deterministic sampling ensures all services agree
- Metrics are counters, gauges, and histograms — counters for totals, gauges for current values, histograms for distributions (p50/p95/p99 latency)
- SLOs define reliability targets quantitatively — 99.9% availability = 43 minutes/month error budget; burn rate alerts are better than threshold alerts
- Multi-window burn rate alerting catches both fast and slow degradation — 14.4x burn over 1 hour pages immediately, 1x over 3 days creates a ticket
- Correlation via trace_id is the glue — the same trace_id appears in logs, traces, and metric exemplars, enabling drill-down from any signal
- Sample traces, not metrics — sampling at 1-10% reduces trace storage and overhead while maintaining statistical significance for debugging
- High-cardinality labels (user_id, request_id) belong in logs and traces, not metrics — metrics should use low-cardinality labels (method, status, service)
- OpenTelemetry is the convergence point — a single SDK for logs, metrics, and traces with automatic correlation, replacing vendor-specific agents
What did you think?