Backend Observability: Distributed Tracing, Metrics Pipelines, Structured Logging & SLO Monitoring

March 23, 20269 min read0 views

backend observability

reliability engineering

scalable systems

production engineering

Backend Observability: Distributed Tracing, Metrics Pipelines, Structured Logging & SLO Monitoring

Why Observability Matters

Monitoring tells you what is broken. Observability tells you why. In a monolith, you grep a log file. In a distributed system with 50 microservices, a single user request touches 12 services, 4 databases, 3 caches, and 2 message queues. Without observability, debugging is guessing.

Request Flow in Microservices:

User → API Gateway → Auth Service → User Service → Postgres
                                   → Order Service → Redis Cache
                                                   → Payment Service → Stripe API
                                                   → Inventory Service → DynamoDB
                                                   → Notification Service → SQS → Email

Where is the 2-second latency?
Without tracing: ¯\_(ツ)_/¯ check all 8 services
With tracing:    Payment Service → Stripe API (1.8s timeout)

The Three Pillars + Events

┌─────────────────────────────────────────────────────────────────┐
│                    Observability Pillars                         │
│                                                                 │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────┐ │
│  │   Logs      │  │  Metrics   │  │   Traces   │  │  Events  │ │
│  │             │  │            │  │            │  │          │ │
│  │ What        │  │ How much   │  │ Where      │  │ What     │ │
│  │ happened    │  │ / how fast │  │ / how long │  │ changed  │ │
│  │             │  │            │  │            │  │          │ │
│  │ JSON lines  │  │ Counters   │  │ Spans      │  │ Deploys  │ │
│  │ Error stacks│  │ Gauges     │  │ Parent/    │  │ Configs  │ │
│  │ Audit trail │  │ Histograms │  │ child      │  │ Alerts   │ │
│  │             │  │ Summaries  │  │ Waterfall  │  │ Rollbacks│ │
│  │             │  │            │  │            │  │          │ │
│  │ Volume:     │  │ Volume:    │  │ Volume:    │  │ Volume:  │ │
│  │  HIGH       │  │  LOW       │  │  MEDIUM    │  │  LOW     │ │
│  │ Cost: $$$   │  │ Cost: $    │  │ Cost: $$   │  │ Cost: $  │ │
│  └─────┬───────┘  └──────┬─────┘  └──────┬─────┘  └────┬─────┘ │
│        │                 │               │              │       │
│        └────────┬────────┴───────┬───────┴──────────────┘       │
│                 │                │                               │
│          ┌──────▼────────┐  ┌───▼──────────────┐               │
│          │  Correlation  │  │  Unified Query   │               │
│          │  (Trace ID)   │  │  (Grafana, etc.) │               │
│          └───────────────┘  └──────────────────┘               │
└─────────────────────────────────────────────────────────────────┘

Structured Logging

Unstructured logs are strings you grep. Structured logs are JSON you query. The difference matters at scale — 1TB/day of logs needs indexing, filtering, and aggregation, not grep.

Unstructured (bad):
[2024-03-15 10:23:45] ERROR - Failed to process order #12345 for user john@example.com

Structured (good):
{
  "timestamp": "2024-03-15T10:23:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_id": "usr_001",
  "error_code": "PAYMENT_DECLINED",
  "duration_ms": 234,
  "retry_count": 2
}

Why structured?
1. Query: SELECT * FROM logs WHERE error_code = 'PAYMENT_DECLINED' AND duration_ms > 200
2. Aggregate: COUNT(*) GROUP BY error_code, service
3. Correlate: JOIN with traces ON trace_id
4. Alert: WHEN count(error_code='PAYMENT_DECLINED') > 100/min

// --- Structured Logger ---

enum LogLevel {
  DEBUG = 0,
  INFO = 1,
  WARN = 2,
  ERROR = 3,
  FATAL = 4,
}

interface LogEntry {
  timestamp: string;
  level: string;
  service: string;
  message: string;
  traceId?: string;
  spanId?: string;
  [key: string]: any;
}

interface LogTransport {
  write(entry: LogEntry): void;
}

class StructuredLogger {
  private baseFields: Record<string, any> = {};
  private transports: LogTransport[] = [];
  private minLevel: LogLevel = LogLevel.INFO;
  private samplingRate = 1.0; // 1.0 = log everything

  constructor(
    private serviceName: string,
    options?: {
      level?: LogLevel;
      transports?: LogTransport[];
      samplingRate?: number;
    },
  ) {
    if (options?.level !== undefined) this.minLevel = options.level;
    if (options?.transports) this.transports = options.transports;
    if (options?.samplingRate !== undefined) this.samplingRate = options.samplingRate;
  }

  // Create a child logger with additional base fields
  child(fields: Record<string, any>): StructuredLogger {
    const child = new StructuredLogger(this.serviceName, {
      level: this.minLevel,
      transports: this.transports,
      samplingRate: this.samplingRate,
    });
    child.baseFields = { ...this.baseFields, ...fields };
    return child;
  }

  debug(message: string, fields?: Record<string, any>): void {
    this.log(LogLevel.DEBUG, message, fields);
  }

  info(message: string, fields?: Record<string, any>): void {
    this.log(LogLevel.INFO, message, fields);
  }

  warn(message: string, fields?: Record<string, any>): void {
    this.log(LogLevel.WARN, message, fields);
  }

  error(message: string, error?: Error, fields?: Record<string, any>): void {
    this.log(LogLevel.ERROR, message, {
      ...fields,
      error_name: error?.name,
      error_message: error?.message,
      stack_trace: error?.stack,
    });
  }

  fatal(message: string, error?: Error, fields?: Record<string, any>): void {
    this.log(LogLevel.FATAL, message, {
      ...fields,
      error_name: error?.name,
      error_message: error?.message,
      stack_trace: error?.stack,
    });
  }

  private log(level: LogLevel, message: string, fields?: Record<string, any>): void {
    if (level < this.minLevel) return;
    
    // Sampling for high-volume debug/info logs
    if (level <= LogLevel.INFO && Math.random() > this.samplingRate) return;

    const entry: LogEntry = {
      timestamp: new Date().toISOString(),
      level: LogLevel[level].toLowerCase(),
      service: this.serviceName,
      message,
      ...this.baseFields,
      ...fields,
    };

    // Remove undefined values
    for (const key of Object.keys(entry)) {
      if (entry[key] === undefined) delete entry[key];
    }

    for (const transport of this.transports) {
      transport.write(entry);
    }
  }

  addTransport(transport: LogTransport): void {
    this.transports.push(transport);
  }
}

// --- Log Transports ---

class ConsoleTransport implements LogTransport {
  write(entry: LogEntry): void {
    const output = JSON.stringify(entry);
    
    switch (entry.level) {
      case 'error':
      case 'fatal':
        console.error(output);
        break;
      case 'warn':
        console.warn(output);
        break;
      default:
        console.log(output);
    }
  }
}

// Buffered transport: batch writes to reduce I/O
class BufferedTransport implements LogTransport {
  private buffer: LogEntry[] = [];
  private flushTimer?: ReturnType<typeof setInterval>;

  constructor(
    private downstream: LogTransport,
    private bufferSize: number = 100,
    private flushInterval: number = 5000,
  ) {
    this.flushTimer = setInterval(() => this.flush(), flushInterval);
  }

  write(entry: LogEntry): void {
    this.buffer.push(entry);
    
    if (this.buffer.length >= this.bufferSize) {
      this.flush();
    }
  }

  flush(): void {
    if (this.buffer.length === 0) return;
    
    const batch = this.buffer.splice(0);
    for (const entry of batch) {
      this.downstream.write(entry);
    }
  }

  destroy(): void {
    if (this.flushTimer) clearInterval(this.flushTimer);
    this.flush();
  }
}

// Async transport with backpressure
class AsyncTransport implements LogTransport {
  private queue: LogEntry[] = [];
  private processing = false;
  private maxQueueSize: number;

  constructor(
    private downstream: {
      writeBatch(entries: LogEntry[]): Promise<void>;
    },
    maxQueueSize = 10000,
  ) {
    this.maxQueueSize = maxQueueSize;
  }

  write(entry: LogEntry): void {
    if (this.queue.length >= this.maxQueueSize) {
      // Backpressure: drop oldest entries
      this.queue.shift();
    }
    
    this.queue.push(entry);
    
    if (!this.processing) {
      this.processQueue();
    }
  }

  private async processQueue(): Promise<void> {
    this.processing = true;
    
    while (this.queue.length > 0) {
      const batch = this.queue.splice(0, 100);
      try {
        await this.downstream.writeBatch(batch);
      } catch (error) {
        // Failed to send — re-queue at front
        this.queue.unshift(...batch);
        // Backoff
        await new Promise((r) => setTimeout(r, 1000));
      }
    }
    
    this.processing = false;
  }
}

// --- Log Context Propagation (via AsyncLocalStorage) ---

class LogContext {
  private static storage = new Map<string, Record<string, any>>();
  private static currentContextId = '';

  static run<T>(fields: Record<string, any>, fn: () => T): T {
    const contextId = Math.random().toString(36).slice(2);
    const parentFields = this.storage.get(this.currentContextId) ?? {};
    
    this.storage.set(contextId, { ...parentFields, ...fields });
    const prevContextId = this.currentContextId;
    this.currentContextId = contextId;
    
    try {
      return fn();
    } finally {
      this.currentContextId = prevContextId;
      this.storage.delete(contextId);
    }
  }

  static getFields(): Record<string, any> {
    return this.storage.get(this.currentContextId) ?? {};
  }
}

// Usage:
// LogContext.run({ traceId: 'abc123', userId: 'usr_001' }, () => {
//   logger.info('Processing request', LogContext.getFields());
//   // All logs in this scope automatically include traceId and userId
// });

Distributed Tracing

A trace follows a single request across all services it touches. Each service creates a span — a timed operation with context.

Trace: trace_id = abc123
│
├── Span A: API Gateway (0ms - 250ms)
│   ├── http.method: POST
│   ├── http.url: /api/orders
│   ├── http.status_code: 200
│   │
│   ├── Span B: Auth Service (5ms - 25ms)
│   │   ├── auth.method: JWT
│   │   └── auth.user_id: usr_001
│   │
│   ├── Span C: Order Service (30ms - 240ms)
│   │   ├── db.system: postgresql
│   │   │
│   │   ├── Span D: DB Query (35ms - 45ms)
│   │   │   ├── db.statement: INSERT INTO orders...
│   │   │   └── db.rows_affected: 1
│   │   │
│   │   ├── Span E: Payment Service (50ms - 200ms)
│   │   │   ├── payment.provider: stripe
│   │   │   ├── payment.amount: 99.99
│   │   │   │
│   │   │   └── Span F: Stripe API (55ms - 195ms)  ← BOTTLENECK
│   │   │       ├── http.url: https://api.stripe.com/charges
│   │   │       └── http.status_code: 200
│   │   │
│   │   └── Span G: Notification (205ms - 235ms)
│   │       ├── notification.type: email
│   │       └── notification.queued: true
│   │
│   └── total: 250ms (Payment=150ms is 60% of total)

Waterfall View:
0ms          50ms         100ms        150ms        200ms       250ms
│             │             │            │            │            │
├─ A: Gateway ───────────────────────────────────────────────────┤
│  ├─ B: Auth ──┤                                                │
│  ├─ C: Order ──────────────────────────────────────────────┤   │
│  │  ├─ D: DB ─┤                                            │   │
│  │  ├─ E: Payment ─────────────────────────────────┤       │   │
│  │  │  └─ F: Stripe ──────────────────────────────┤        │   │
│  │  └─ G: Notify ────┤                                     │   │

Tracing Implementation (OpenTelemetry-compatible)

// --- Trace & Span Types ---

interface TraceContext {
  traceId: string;       // 16 bytes hex (128-bit)
  spanId: string;        // 8 bytes hex (64-bit)
  traceFlags: number;    // 1 byte (sampled = 0x01)
  traceState?: string;   // Vendor-specific key-value pairs
}

interface SpanData {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  name: string;
  kind: SpanKind;
  startTime: number;      // Unix timestamp (microseconds)
  endTime?: number;
  status: SpanStatus;
  attributes: Record<string, string | number | boolean>;
  events: SpanEvent[];
  links: SpanLink[];
}

enum SpanKind {
  INTERNAL = 0,   // Default — internal operation
  SERVER = 1,     // Handles incoming request
  CLIENT = 2,     // Makes outgoing request
  PRODUCER = 3,   // Creates a message (async)
  CONSUMER = 4,   // Processes a message (async)
}

interface SpanStatus {
  code: StatusCode;
  message?: string;
}

enum StatusCode {
  UNSET = 0,
  OK = 1,
  ERROR = 2,
}

interface SpanEvent {
  name: string;
  timestamp: number;
  attributes?: Record<string, string | number | boolean>;
}

interface SpanLink {
  traceId: string;
  spanId: string;
  attributes?: Record<string, string | number | boolean>;
}

// --- Span Implementation ---

class Span {
  private data: SpanData;
  private ended = false;

  constructor(
    name: string,
    traceId: string,
    parentSpanId?: string,
    kind: SpanKind = SpanKind.INTERNAL,
  ) {
    this.data = {
      traceId,
      spanId: this.generateId(8),
      parentSpanId,
      name,
      kind,
      startTime: performance.now() * 1000, // microseconds
      status: { code: StatusCode.UNSET },
      attributes: {},
      events: [],
      links: [],
    };
  }

  setAttribute(key: string, value: string | number | boolean): this {
    this.data.attributes[key] = value;
    return this;
  }

  setAttributes(attrs: Record<string, string | number | boolean>): this {
    Object.assign(this.data.attributes, attrs);
    return this;
  }

  addEvent(name: string, attributes?: Record<string, string | number | boolean>): this {
    this.data.events.push({
      name,
      timestamp: performance.now() * 1000,
      attributes,
    });
    return this;
  }

  setStatus(code: StatusCode, message?: string): this {
    this.data.status = { code, message };
    return this;
  }

  recordException(error: Error): this {
    this.addEvent('exception', {
      'exception.type': error.name,
      'exception.message': error.message,
      'exception.stacktrace': error.stack ?? '',
    });
    this.setStatus(StatusCode.ERROR, error.message);
    return this;
  }

  end(): void {
    if (this.ended) return;
    this.ended = true;
    this.data.endTime = performance.now() * 1000;
  }

  get spanId(): string {
    return this.data.spanId;
  }

  get traceId(): string {
    return this.data.traceId;
  }

  get spanData(): SpanData {
    return { ...this.data };
  }

  get duration(): number {
    if (!this.data.endTime) return 0;
    return (this.data.endTime - this.data.startTime) / 1000; // milliseconds
  }

  private generateId(bytes: number): string {
    const array = new Uint8Array(bytes);
    crypto.getRandomValues(array);
    return Array.from(array, (b) => b.toString(16).padStart(2, '0')).join('');
  }
}

// --- Tracer ---

interface SpanExporter {
  export(spans: SpanData[]): Promise<void>;
}

interface Sampler {
  shouldSample(traceId: string, name: string): boolean;
}

class Tracer {
  private activeSpans = new Map<string, Span>();
  private completedSpans: SpanData[] = [];
  private currentSpan: Span | null = null;
  private exporters: SpanExporter[] = [];
  private sampler: Sampler;
  private batchSize: number;
  private flushInterval: number;
  private flushTimer?: ReturnType<typeof setInterval>;

  constructor(
    private serviceName: string,
    options?: {
      sampler?: Sampler;
      exporters?: SpanExporter[];
      batchSize?: number;
      flushInterval?: number;
    },
  ) {
    this.sampler = options?.sampler ?? new AlwaysSampler();
    this.exporters = options?.exporters ?? [];
    this.batchSize = options?.batchSize ?? 100;
    this.flushInterval = options?.flushInterval ?? 5000;

    this.flushTimer = setInterval(() => this.flush(), this.flushInterval);
  }

  // Start a new root span (new trace)
  startSpan(name: string, kind?: SpanKind): Span {
    const traceId = this.generateTraceId();
    
    if (!this.sampler.shouldSample(traceId, name)) {
      // Return a no-op span that doesn't export
      return new Span(name, traceId, undefined, kind);
    }

    const span = new Span(name, traceId, undefined, kind);
    span.setAttribute('service.name', this.serviceName);
    this.activeSpans.set(span.spanId, span);
    this.currentSpan = span;
    return span;
  }

  // Start a child span under the current active span
  startChildSpan(name: string, kind?: SpanKind): Span {
    const parent = this.currentSpan;
    const traceId = parent?.traceId ?? this.generateTraceId();
    
    const span = new Span(name, traceId, parent?.spanId, kind);
    span.setAttribute('service.name', this.serviceName);
    this.activeSpans.set(span.spanId, span);
    this.currentSpan = span;
    return span;
  }

  // Start a span from a propagated context (incoming request)
  startSpanFromContext(name: string, context: TraceContext, kind?: SpanKind): Span {
    const span = new Span(name, context.traceId, context.spanId, kind);
    span.setAttribute('service.name', this.serviceName);
    this.activeSpans.set(span.spanId, span);
    this.currentSpan = span;
    return span;
  }

  // End a span and queue for export
  endSpan(span: Span): void {
    span.end();
    this.activeSpans.delete(span.spanId);
    this.completedSpans.push(span.spanData);

    // Restore parent as current
    if (span.spanData.parentSpanId) {
      this.currentSpan = this.activeSpans.get(span.spanData.parentSpanId) ?? null;
    } else {
      this.currentSpan = null;
    }

    if (this.completedSpans.length >= this.batchSize) {
      this.flush();
    }
  }

  // Instrument an async function with a span
  async trace<T>(
    name: string,
    fn: (span: Span) => Promise<T>,
    kind?: SpanKind,
  ): Promise<T> {
    const span = this.currentSpan
      ? this.startChildSpan(name, kind)
      : this.startSpan(name, kind);

    try {
      const result = await fn(span);
      span.setStatus(StatusCode.OK);
      return result;
    } catch (error) {
      span.recordException(error as Error);
      throw error;
    } finally {
      this.endSpan(span);
    }
  }

  private async flush(): Promise<void> {
    if (this.completedSpans.length === 0) return;

    const batch = this.completedSpans.splice(0);
    
    for (const exporter of this.exporters) {
      try {
        await exporter.export(batch);
      } catch (error) {
        console.error(`Failed to export spans: ${error}`);
      }
    }
  }

  private generateTraceId(): string {
    const array = new Uint8Array(16);
    crypto.getRandomValues(array);
    return Array.from(array, (b) => b.toString(16).padStart(2, '0')).join('');
  }

  destroy(): void {
    if (this.flushTimer) clearInterval(this.flushTimer);
    this.flush();
  }
}

// --- Samplers ---

class AlwaysSampler implements Sampler {
  shouldSample(): boolean {
    return true;
  }
}

class RateSampler implements Sampler {
  constructor(private rate: number) {} // 0.0 to 1.0

  shouldSample(): boolean {
    return Math.random() < this.rate;
  }
}

// Sample based on trace ID — deterministic across services
// Same trace ID always gets the same sampling decision
class TraceIdRatioSampler implements Sampler {
  private threshold: number;

  constructor(ratio: number) {
    this.threshold = Math.floor(ratio * 0xFFFFFFFF);
  }

  shouldSample(traceId: string): boolean {
    // Hash the last 8 chars of trace ID to get a deterministic number
    const hash = parseInt(traceId.slice(-8), 16);
    return hash < this.threshold;
  }
}

// Head-based + tail-based hybrid
class AdaptiveSampler implements Sampler {
  private errorTraces = new Set<string>();
  private baseSampler: Sampler;

  constructor(baseRate: number) {
    this.baseSampler = new TraceIdRatioSampler(baseRate);
  }

  shouldSample(traceId: string, name: string): boolean {
    // Always sample traces that had errors
    if (this.errorTraces.has(traceId)) return true;
    
    // Always sample slow-path operations
    if (name.includes('payment') || name.includes('checkout')) return true;
    
    return this.baseSampler.shouldSample(traceId, name);
  }

  markError(traceId: string): void {
    this.errorTraces.add(traceId);
    // Clean up after 5 minutes
    setTimeout(() => this.errorTraces.delete(traceId), 300000);
  }
}

Context Propagation (W3C Trace Context)

// --- W3C Trace Context Propagation ---
// Header: traceparent: 00-{traceId}-{spanId}-{flags}
// Header: tracestate: vendor1=value1,vendor2=value2

class W3CTraceContextPropagator {
  static readonly TRACEPARENT = 'traceparent';
  static readonly TRACESTATE = 'tracestate';

  // Inject trace context into outgoing request headers
  static inject(span: Span): Record<string, string> {
    const headers: Record<string, string> = {};
    
    // Format: {version}-{trace-id}-{span-id}-{trace-flags}
    headers[this.TRACEPARENT] = 
      `00-${span.traceId}-${span.spanId}-01`;
    
    return headers;
  }

  // Extract trace context from incoming request headers
  static extract(headers: Record<string, string>): TraceContext | null {
    const traceparent = headers[this.TRACEPARENT];
    if (!traceparent) return null;

    // Parse: 00-{traceId(32hex)}-{spanId(16hex)}-{flags(2hex)}
    const parts = traceparent.split('-');
    if (parts.length !== 4) return null;

    const [version, traceId, spanId, flags] = parts;
    
    if (version !== '00') return null;
    if (traceId.length !== 32) return null;
    if (spanId.length !== 16) return null;

    return {
      traceId,
      spanId,
      traceFlags: parseInt(flags, 16),
      traceState: headers[this.TRACESTATE],
    };
  }
}

// --- HTTP Middleware for Tracing ---

class TracingMiddleware {
  constructor(private tracer: Tracer) {}

  // Server-side: extract context from incoming request, create span
  handleIncoming(
    method: string,
    url: string,
    headers: Record<string, string>,
  ): Span {
    const parentContext = W3CTraceContextPropagator.extract(headers);
    
    let span: Span;
    if (parentContext) {
      span = this.tracer.startSpanFromContext(
        `${method} ${url}`,
        parentContext,
        SpanKind.SERVER,
      );
    } else {
      span = this.tracer.startSpan(`${method} ${url}`, SpanKind.SERVER);
    }

    span.setAttributes({
      'http.method': method,
      'http.url': url,
      'http.flavor': '1.1',
    });

    return span;
  }

  // Client-side: inject context into outgoing request
  handleOutgoing(
    span: Span,
    method: string,
    url: string,
  ): Record<string, string> {
    const childSpan = new Span(
      `${method} ${url}`,
      span.traceId,
      span.spanId,
      SpanKind.CLIENT,
    );

    childSpan.setAttributes({
      'http.method': method,
      'http.url': url,
    });

    return W3CTraceContextPropagator.inject(childSpan);
  }
}

Metrics Pipeline

Metrics are numeric measurements collected over time. They're cheap to store, fast to query, and the foundation of dashboards and alerts.

Metric Types:

1. Counter (monotonically increasing)
   http_requests_total{method="GET", status="200"} = 15234
   
   Only goes up. Rate = requests/second.

2. Gauge (current value, goes up or down)
   db_connections_active{pool="primary"} = 42
   
   Current state. Used for queue depth, memory usage.

3. Histogram (distribution of values)
   http_request_duration_seconds_bucket{le="0.1"} = 8000
   http_request_duration_seconds_bucket{le="0.5"} = 9500
   http_request_duration_seconds_bucket{le="1.0"} = 9900
   http_request_duration_seconds_bucket{le="+Inf"} = 10000
   
   Bucket counts. Compute percentiles: p50, p95, p99.

4. Summary (pre-computed percentiles)
   http_request_duration_seconds{quantile="0.99"} = 0.85
   
   Computed client-side. Not aggregatable across instances.

Pipeline:
┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│ App      │──→ │ Collector │──→ │  TSDB    │──→ │ Query /  │
│ (client) │    │ (OTel/    │    │ (Prom/   │    │ Dashboard│
│          │    │  StatsD)  │    │  Mimir)  │    │ (Grafana)│
│ counter  │    │           │    │          │    │          │
│ gauge    │    │ aggregate │    │ store    │    │ PromQL   │
│ histogram│    │ batch     │    │ compress │    │ alert    │
│          │    │ export    │    │ downsamp │    │ visualize│
└──────────┘    └───────────┘    └──────────┘    └──────────┘

// --- Metrics Collection Library ---

type Labels = Record<string, string>;

// Counter: monotonically increasing value
class Counter {
  private values = new Map<string, number>();

  constructor(
    public readonly name: string,
    public readonly help: string,
    private labelNames: string[] = [],
  ) {}

  inc(labels: Labels = {}, value = 1): void {
    const key = this.labelsToKey(labels);
    this.values.set(key, (this.values.get(key) ?? 0) + value);
  }

  get(labels: Labels = {}): number {
    return this.values.get(this.labelsToKey(labels)) ?? 0;
  }

  // Prometheus exposition format
  toPrometheus(): string {
    const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} counter`];
    
    for (const [key, value] of this.values) {
      const labelStr = key ? `{${key}}` : '';
      lines.push(`${this.name}${labelStr} ${value}`);
    }
    
    return lines.join('\n');
  }

  private labelsToKey(labels: Labels): string {
    return this.labelNames
      .map((name) => `${name}="${labels[name] ?? ''}"`)
      .join(',');
  }
}

// Gauge: value that goes up and down
class Gauge {
  private values = new Map<string, number>();

  constructor(
    public readonly name: string,
    public readonly help: string,
    private labelNames: string[] = [],
  ) {}

  set(labels: Labels, value: number): void {
    this.values.set(this.labelsToKey(labels), value);
  }

  inc(labels: Labels = {}, value = 1): void {
    const key = this.labelsToKey(labels);
    this.values.set(key, (this.values.get(key) ?? 0) + value);
  }

  dec(labels: Labels = {}, value = 1): void {
    const key = this.labelsToKey(labels);
    this.values.set(key, (this.values.get(key) ?? 0) - value);
  }

  toPrometheus(): string {
    const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} gauge`];
    
    for (const [key, value] of this.values) {
      const labelStr = key ? `{${key}}` : '';
      lines.push(`${this.name}${labelStr} ${value}`);
    }
    
    return lines.join('\n');
  }

  private labelsToKey(labels: Labels): string {
    return this.labelNames
      .map((name) => `${name}="${labels[name] ?? ''}"`)
      .join(',');
  }
}

// Histogram: distribution of values in configurable buckets
class Histogram {
  private buckets: Map<string, number[]>; // label key → bucket counts
  private sums = new Map<string, number>();
  private counts = new Map<string, number>();
  private boundaries: number[];

  constructor(
    public readonly name: string,
    public readonly help: string,
    private labelNames: string[] = [],
    boundaries: number[] = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  ) {
    this.boundaries = boundaries.sort((a, b) => a - b);
    this.buckets = new Map();
  }

  observe(labels: Labels, value: number): void {
    const key = this.labelsToKey(labels);
    
    // Initialize buckets if needed
    if (!this.buckets.has(key)) {
      this.buckets.set(key, new Array(this.boundaries.length + 1).fill(0));
      this.sums.set(key, 0);
      this.counts.set(key, 0);
    }

    const bucketCounts = this.buckets.get(key)!;
    
    // Increment all buckets where value <= boundary
    for (let i = 0; i < this.boundaries.length; i++) {
      if (value <= this.boundaries[i]) {
        bucketCounts[i]++;
      }
    }
    bucketCounts[this.boundaries.length]++; // +Inf bucket always increments

    this.sums.set(key, this.sums.get(key)! + value);
    this.counts.set(key, this.counts.get(key)! + 1);
  }

  // Timer helper
  startTimer(labels: Labels): () => void {
    const start = performance.now();
    return () => {
      const duration = (performance.now() - start) / 1000; // seconds
      this.observe(labels, duration);
    };
  }

  // Estimate percentile from histogram buckets
  percentile(labels: Labels, p: number): number {
    const key = this.labelsToKey(labels);
    const bucketCounts = this.buckets.get(key);
    const total = this.counts.get(key);
    
    if (!bucketCounts || !total || total === 0) return 0;

    const target = Math.ceil(p * total);
    let cumulative = 0;

    for (let i = 0; i < this.boundaries.length; i++) {
      cumulative = bucketCounts[i];
      if (cumulative >= target) {
        // Linear interpolation within this bucket
        const prevCumulative = i > 0 ? bucketCounts[i - 1] : 0;
        const prevBoundary = i > 0 ? this.boundaries[i - 1] : 0;
        const bucketWidth = this.boundaries[i] - prevBoundary;
        const bucketCount = cumulative - prevCumulative;
        
        if (bucketCount === 0) return this.boundaries[i];
        
        const ratio = (target - prevCumulative) / bucketCount;
        return prevBoundary + bucketWidth * ratio;
      }
    }

    return this.boundaries[this.boundaries.length - 1];
  }

  toPrometheus(): string {
    const lines = [`# HELP ${this.name} ${this.help}`, `# TYPE ${this.name} histogram`];

    for (const [key, bucketCounts] of this.buckets) {
      const labelStr = key ? `${key},` : '';
      
      for (let i = 0; i < this.boundaries.length; i++) {
        lines.push(`${this.name}_bucket{${labelStr}le="${this.boundaries[i]}"} ${bucketCounts[i]}`);
      }
      lines.push(`${this.name}_bucket{${labelStr}le="+Inf"} ${bucketCounts[this.boundaries.length]}`);
      lines.push(`${this.name}_sum{${key}} ${this.sums.get(key)}`);
      lines.push(`${this.name}_count{${key}} ${this.counts.get(key)}`);
    }

    return lines.join('\n');
  }

  private labelsToKey(labels: Labels): string {
    return this.labelNames
      .map((name) => `${name}="${labels[name] ?? ''}"`)
      .join(',');
  }
}

// --- Metrics Registry ---

class MetricsRegistry {
  private metrics = new Map<string, Counter | Gauge | Histogram>();

  createCounter(name: string, help: string, labelNames?: string[]): Counter {
    const counter = new Counter(name, help, labelNames);
    this.metrics.set(name, counter);
    return counter;
  }

  createGauge(name: string, help: string, labelNames?: string[]): Gauge {
    const gauge = new Gauge(name, help, labelNames);
    this.metrics.set(name, gauge);
    return gauge;
  }

  createHistogram(
    name: string,
    help: string,
    labelNames?: string[],
    boundaries?: number[],
  ): Histogram {
    const histogram = new Histogram(name, help, labelNames, boundaries);
    this.metrics.set(name, histogram);
    return histogram;
  }

  // Expose /metrics endpoint (Prometheus format)
  toPrometheus(): string {
    const sections: string[] = [];
    for (const metric of this.metrics.values()) {
      sections.push(metric.toPrometheus());
    }
    return sections.join('\n\n') + '\n';
  }
}

// --- Standard Backend Metrics ---

function createStandardMetrics(registry: MetricsRegistry) {
  return {
    // HTTP metrics
    httpRequestsTotal: registry.createCounter(
      'http_requests_total',
      'Total HTTP requests',
      ['method', 'path', 'status'],
    ),
    httpRequestDuration: registry.createHistogram(
      'http_request_duration_seconds',
      'HTTP request duration in seconds',
      ['method', 'path'],
      [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
    ),
    httpRequestSize: registry.createHistogram(
      'http_request_size_bytes',
      'HTTP request body size in bytes',
      ['method', 'path'],
    ),
    httpResponseSize: registry.createHistogram(
      'http_response_size_bytes',
      'HTTP response body size in bytes',
      ['method', 'path'],
    ),

    // Database metrics
    dbQueryDuration: registry.createHistogram(
      'db_query_duration_seconds',
      'Database query duration',
      ['operation', 'table'],
      [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
    ),
    dbConnectionsActive: registry.createGauge(
      'db_connections_active',
      'Active database connections',
      ['pool'],
    ),
    dbConnectionsIdle: registry.createGauge(
      'db_connections_idle',
      'Idle database connections',
      ['pool'],
    ),
    dbQueryErrors: registry.createCounter(
      'db_query_errors_total',
      'Database query errors',
      ['operation', 'error_type'],
    ),

    // Cache metrics
    cacheHits: registry.createCounter(
      'cache_hits_total',
      'Cache hit count',
      ['cache', 'operation'],
    ),
    cacheMisses: registry.createCounter(
      'cache_misses_total',
      'Cache miss count',
      ['cache', 'operation'],
    ),
    cacheLatency: registry.createHistogram(
      'cache_operation_duration_seconds',
      'Cache operation duration',
      ['cache', 'operation'],
      [0.0001, 0.0005, 0.001, 0.005, 0.01],
    ),

    // Queue metrics
    queueDepth: registry.createGauge(
      'queue_depth',
      'Current queue depth',
      ['queue'],
    ),
    queueProcessed: registry.createCounter(
      'queue_messages_processed_total',
      'Messages processed from queue',
      ['queue', 'status'],
    ),
    queueProcessingDuration: registry.createHistogram(
      'queue_processing_duration_seconds',
      'Queue message processing duration',
      ['queue'],
    ),

    // Runtime metrics
    processMemoryBytes: registry.createGauge(
      'process_memory_bytes',
      'Process memory usage',
      ['type'],
    ),
    processUptime: registry.createGauge(
      'process_uptime_seconds',
      'Process uptime in seconds',
    ),
    eventLoopLag: registry.createHistogram(
      'nodejs_event_loop_lag_seconds',
      'Node.js event loop lag',
      [],
      [0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
    ),
  };
}

SLO Monitoring & Error Budgets

SLI → SLO → Error Budget → Alerts

SLI (Service Level Indicator):
  The measurement. Example: "99.2% of requests completed in < 500ms"

SLO (Service Level Objective):
  The target. Example: "99.9% of requests must complete in < 500ms"

Error Budget:
  The slack. 99.9% SLO = 0.1% error budget = 43.2 minutes/month
  
  If you've used 30 minutes this month → 13 minutes remain
  If budget exhausted → freeze deployments, focus on reliability

┌─────────────────────────────────────────────────────────────┐
│  SLO: 99.9% availability (43.2 min downtime/month budget)  │
│                                                             │
│  Day 1-10: ████████████░░░░░░░░░░░░░░░░░░ 85% budget left │
│  Day 10-20: ████████████████████░░░░░░░░░░ 50% budget left │
│  Day 20-25: █████████████████████████████░ 10% budget left  │
│  Day 25-30: ██████████████████████████████ BUDGET EXHAUSTED │
│                                             ↑               │
│                                    Freeze deploys           │
│                                    Page on-call             │
│                                                             │
│  Multi-window alerts:                                       │
│  ┌────────────────┬──────────┬──────────┐                   │
│  │ Window         │ Burn Rate│ Action   │                   │
│  ├────────────────┼──────────┼──────────┤                   │
│  │ 1 hour         │ > 14.4x  │ Page NOW │                   │
│  │ 6 hours        │ > 6x     │ Page     │                   │
│  │ 3 days         │ > 1x     │ Ticket   │                   │
│  └────────────────┴──────────┴──────────┘                   │
└─────────────────────────────────────────────────────────────┘

// --- SLO Monitor ---

interface SLODefinition {
  name: string;
  description: string;
  target: number;           // e.g., 0.999 (99.9%)
  windowDays: number;       // e.g., 30 (rolling 30-day window)
  sliType: 'availability' | 'latency' | 'throughput' | 'correctness';
  
  // For latency SLOs
  latencyThresholdMs?: number;
}

interface SLOStatus {
  slo: SLODefinition;
  currentSLI: number;          // Current SLI value (e.g., 0.9985)
  errorBudgetTotal: number;    // Total budget (minutes or requests)
  errorBudgetConsumed: number; // Used budget
  errorBudgetRemaining: number;// Remaining budget
  burnRate: number;            // Current burn rate (1.0 = on track)
  isViolating: boolean;        // Are we currently below the SLO?
  projectedSLI: number;       // Where will we be at end of window?
}

class SLOMonitor {
  private slos = new Map<string, SLODefinition>();
  private measurements = new Map<string, {
    totalEvents: number;
    goodEvents: number;
    windowStart: number;
    buckets: Array<{ timestamp: number; total: number; good: number }>;
  }>();

  private alertCallbacks: Array<(alert: SLOAlert) => void> = [];

  registerSLO(slo: SLODefinition): void {
    this.slos.set(slo.name, slo);
    this.measurements.set(slo.name, {
      totalEvents: 0,
      goodEvents: 0,
      windowStart: Date.now(),
      buckets: [],
    });
  }

  // Record an event (request, operation, etc.)
  record(sloName: string, isGood: boolean): void {
    const measurement = this.measurements.get(sloName);
    if (!measurement) return;

    measurement.totalEvents++;
    if (isGood) measurement.goodEvents++;

    // Store in time buckets (1-minute granularity)
    const now = Date.now();
    const bucketKey = Math.floor(now / 60000) * 60000;
    
    let bucket = measurement.buckets.find((b) => b.timestamp === bucketKey);
    if (!bucket) {
      bucket = { timestamp: bucketKey, total: 0, good: 0 };
      measurement.buckets.push(bucket);
    }
    
    bucket.total++;
    if (isGood) bucket.good++;

    // Prune old buckets
    const windowMs = (this.slos.get(sloName)?.windowDays ?? 30) * 86400000;
    measurement.buckets = measurement.buckets.filter(
      (b) => now - b.timestamp < windowMs,
    );

    // Check alerts
    this.checkAlerts(sloName);
  }

  getStatus(sloName: string): SLOStatus | null {
    const slo = this.slos.get(sloName);
    const measurement = this.measurements.get(sloName);
    if (!slo || !measurement) return null;

    // Calculate SLI over the rolling window
    const windowBuckets = this.getWindowBuckets(sloName);
    const totalInWindow = windowBuckets.reduce((sum, b) => sum + b.total, 0);
    const goodInWindow = windowBuckets.reduce((sum, b) => sum + b.good, 0);
    
    const currentSLI = totalInWindow > 0 ? goodInWindow / totalInWindow : 1.0;

    // Error budget
    const errorBudgetTotal = (1 - slo.target) * totalInWindow;
    const badEvents = totalInWindow - goodInWindow;
    const errorBudgetRemaining = Math.max(0, errorBudgetTotal - badEvents);

    // Burn rate: how fast are we consuming the budget?
    // 1.0 = consuming at exactly the sustainable rate
    // 2.0 = consuming twice as fast as sustainable
    const elapsedFraction = this.getWindowElapsedFraction(slo);
    const expectedBudgetConsumed = errorBudgetTotal * elapsedFraction;
    const burnRate = expectedBudgetConsumed > 0
      ? badEvents / expectedBudgetConsumed
      : 0;

    // Project where we'll be at end of window
    const projectedBadEvents = badEvents / Math.max(elapsedFraction, 0.01);
    const projectedTotal = totalInWindow / Math.max(elapsedFraction, 0.01);
    const projectedSLI = projectedTotal > 0
      ? (projectedTotal - projectedBadEvents) / projectedTotal
      : 1.0;

    return {
      slo,
      currentSLI,
      errorBudgetTotal,
      errorBudgetConsumed: badEvents,
      errorBudgetRemaining,
      burnRate,
      isViolating: currentSLI < slo.target,
      projectedSLI,
    };
  }

  private getWindowBuckets(sloName: string): Array<{ total: number; good: number }> {
    const slo = this.slos.get(sloName)!;
    const measurement = this.measurements.get(sloName)!;
    const windowMs = slo.windowDays * 86400000;
    const now = Date.now();
    
    return measurement.buckets.filter((b) => now - b.timestamp < windowMs);
  }

  private getWindowElapsedFraction(slo: SLODefinition): number {
    const measurement = this.measurements.get(slo.name)!;
    const windowMs = slo.windowDays * 86400000;
    const elapsed = Date.now() - measurement.windowStart;
    return Math.min(1, elapsed / windowMs);
  }

  // Multi-window, multi-burn-rate alerting (Google SRE book)
  private checkAlerts(sloName: string): void {
    const status = this.getStatus(sloName);
    if (!status) return;

    // Fast burn: 14.4x rate over 1 hour → exhausts budget in 2 days
    // Checked with 5-minute short window
    const burn1h = this.getBurnRateForWindow(sloName, 60);
    const burn5m = this.getBurnRateForWindow(sloName, 5);
    
    if (burn1h > 14.4 && burn5m > 14.4) {
      this.emitAlert({
        sloName,
        severity: 'critical',
        message: `SLO ${sloName}: burn rate ${burn1h.toFixed(1)}x (1h window). Budget exhaustion in ${(status.errorBudgetRemaining / burn1h).toFixed(0)} minutes.`,
        burnRate: burn1h,
        windowMinutes: 60,
        budgetRemaining: status.errorBudgetRemaining,
      });
    }

    // Medium burn: 6x rate over 6 hours
    const burn6h = this.getBurnRateForWindow(sloName, 360);
    const burn30m = this.getBurnRateForWindow(sloName, 30);
    
    if (burn6h > 6 && burn30m > 6) {
      this.emitAlert({
        sloName,
        severity: 'warning',
        message: `SLO ${sloName}: burn rate ${burn6h.toFixed(1)}x (6h window). Budget at risk.`,
        burnRate: burn6h,
        windowMinutes: 360,
        budgetRemaining: status.errorBudgetRemaining,
      });
    }

    // Slow burn: 1x rate sustained over 3 days
    const burn3d = this.getBurnRateForWindow(sloName, 4320);
    
    if (burn3d > 1 && status.errorBudgetRemaining < status.errorBudgetTotal * 0.1) {
      this.emitAlert({
        sloName,
        severity: 'info',
        message: `SLO ${sloName}: <10% error budget remaining. Slow degradation.`,
        burnRate: burn3d,
        windowMinutes: 4320,
        budgetRemaining: status.errorBudgetRemaining,
      });
    }
  }

  private getBurnRateForWindow(sloName: string, windowMinutes: number): number {
    const slo = this.slos.get(sloName)!;
    const measurement = this.measurements.get(sloName)!;
    const windowMs = windowMinutes * 60000;
    const now = Date.now();

    const windowBuckets = measurement.buckets.filter(
      (b) => now - b.timestamp < windowMs,
    );

    const total = windowBuckets.reduce((sum, b) => sum + b.total, 0);
    const good = windowBuckets.reduce((sum, b) => sum + b.good, 0);
    const bad = total - good;

    if (total === 0) return 0;

    const expectedBudgetPerMinute = (1 - slo.target) * total / windowMinutes;
    const actualBudgetPerMinute = bad / windowMinutes;

    return expectedBudgetPerMinute > 0
      ? actualBudgetPerMinute / expectedBudgetPerMinute
      : 0;
  }

  private emitAlert(alert: SLOAlert): void {
    for (const callback of this.alertCallbacks) {
      callback(alert);
    }
  }

  onAlert(callback: (alert: SLOAlert) => void): void {
    this.alertCallbacks.push(callback);
  }
}

interface SLOAlert {
  sloName: string;
  severity: 'critical' | 'warning' | 'info';
  message: string;
  burnRate: number;
  windowMinutes: number;
  budgetRemaining: number;
}

Correlating Signals

The real power of observability is connecting logs, traces, and metrics through shared identifiers.

Correlation Strategy:

Every request gets a trace_id at the edge (API gateway).
This trace_id flows through:
  → Logs: every log line includes trace_id
  → Traces: the trace is identified by trace_id
  → Metrics: exemplars link metrics to trace_ids

User reports "my order was slow" → Support finds order_id
  → Logs: search trace_id from order_id log
  → Trace: view full waterfall for that trace_id
  → Metrics: check p99 latency spike at that timestamp
  → Alert: was there an active alert?
  → Deploy: was there a recent deployment?

┌─────────────────────────┐
│  Grafana Dashboard       │
│                          │
│  ┌─────────────────────┐ │
│  │ Metrics Panel        │ │  Click on spike → jump to traces
│  │ p99 latency = 2.3s  │─┼──→ Tempo: traces at t=14:32
│  │ ▲                    │ │        │
│  │ │   ╱╲              │ │        ▼
│  │ │  ╱  ╲             │ │    Span: Payment API (1.8s)
│  │ │─╱────╲──── time   │ │        │
│  └─────────────────────┘ │        ▼
│                          │    Loki: logs for trace_id xyz
│  ┌─────────────────────┐ │    {"trace_id":"xyz",
│  │ Log Panel            │ │     "error":"timeout",
│  │ ERROR: timeout       │─┼──→  "service":"payment"}
│  │ trace_id: xyz        │ │
│  └─────────────────────┘ │
└─────────────────────────┘

// --- Unified Observability Context ---

class ObservabilityContext {
  private logger: StructuredLogger;
  private tracer: Tracer;
  private metrics: ReturnType<typeof createStandardMetrics>;
  private sloMonitor: SLOMonitor;

  constructor(
    serviceName: string,
    config: {
      logLevel: LogLevel;
      traceSamplingRate: number;
      metricsRegistry: MetricsRegistry;
    },
  ) {
    this.logger = new StructuredLogger(serviceName, {
      level: config.logLevel,
      transports: [new ConsoleTransport()],
    });

    this.tracer = new Tracer(serviceName, {
      sampler: new TraceIdRatioSampler(config.traceSamplingRate),
    });

    this.metrics = createStandardMetrics(config.metricsRegistry);
    this.sloMonitor = new SLOMonitor();
  }

  // Instrument an HTTP handler with all three pillars
  async instrumentHTTP<T>(
    method: string,
    path: string,
    headers: Record<string, string>,
    handler: (ctx: RequestContext) => Promise<T>,
  ): Promise<{ result: T; statusCode: number }> {
    // 1. Extract or create trace context
    const parentContext = W3CTraceContextPropagator.extract(headers);
    const span = parentContext
      ? this.tracer.startSpanFromContext(`${method} ${path}`, parentContext, SpanKind.SERVER)
      : this.tracer.startSpan(`${method} ${path}`, SpanKind.SERVER);

    span.setAttributes({
      'http.method': method,
      'http.url': path,
    });

    // 2. Create correlated logger
    const requestLogger = this.logger.child({
      traceId: span.traceId,
      spanId: span.spanId,
      method,
      path,
    });

    // 3. Start metrics timer
    const timer = this.metrics.httpRequestDuration.startTimer({ method, path });

    // 4. Build request context
    const ctx: RequestContext = {
      traceId: span.traceId,
      spanId: span.spanId,
      logger: requestLogger,
      span,
    };

    let statusCode = 200;

    try {
      requestLogger.info('Request started');
      const result = await handler(ctx);
      
      span.setStatus(StatusCode.OK);
      requestLogger.info('Request completed', { statusCode });
      
      // Record SLO
      this.sloMonitor.record('availability', true);
      
      return { result, statusCode };
    } catch (error) {
      statusCode = 500;
      span.recordException(error as Error);
      requestLogger.error('Request failed', error as Error, { statusCode });
      
      // Record SLO failure
      this.sloMonitor.record('availability', false);
      
      throw error;
    } finally {
      // Finalize metrics
      timer(); // Records duration
      this.metrics.httpRequestsTotal.inc({ method, path, status: String(statusCode) });
      
      // End span
      this.tracer.endSpan(span);
    }
  }

  // Instrument a database query
  async instrumentDB<T>(
    ctx: RequestContext,
    operation: string,
    table: string,
    query: () => Promise<T>,
  ): Promise<T> {
    const childSpan = new Span(
      `DB ${operation} ${table}`,
      ctx.traceId,
      ctx.spanId,
      SpanKind.CLIENT,
    );

    childSpan.setAttributes({
      'db.system': 'postgresql',
      'db.operation': operation,
      'db.sql.table': table,
    });

    const timer = this.metrics.dbQueryDuration.startTimer({ operation, table });

    try {
      const result = await query();
      childSpan.setStatus(StatusCode.OK);
      return result;
    } catch (error) {
      childSpan.recordException(error as Error);
      this.metrics.dbQueryErrors.inc({ operation, error_type: (error as Error).name });
      throw error;
    } finally {
      timer();
      childSpan.end();
    }
  }

  // Instrument a cache operation
  async instrumentCache<T>(
    ctx: RequestContext,
    cacheName: string,
    operation: string,
    key: string,
    fn: () => Promise<T | null>,
  ): Promise<T | null> {
    const timer = this.metrics.cacheLatency.startTimer({ cache: cacheName, operation });
    
    try {
      const result = await fn();
      
      if (result !== null) {
        this.metrics.cacheHits.inc({ cache: cacheName, operation });
      } else {
        this.metrics.cacheMisses.inc({ cache: cacheName, operation });
      }
      
      return result;
    } finally {
      timer();
    }
  }

  getSLOMonitor(): SLOMonitor {
    return this.sloMonitor;
  }
}

interface RequestContext {
  traceId: string;
  spanId: string;
  logger: StructuredLogger;
  span: Span;
}

Alerting Strategy

Alert Quality Pyramid:

              ┌─────────┐
              │ SLO-based│  Best: Alert on error budget burn rate
              │ alerts   │  "We're consuming budget 14x faster than sustainable"
              ├──────────┤  
              │ Symptom  │  Good: Alert on user-facing symptoms
              │ alerts   │  "p99 latency > 2s for 5 minutes"
              ├──────────┤
              │ Cause    │  OK: Alert on known failure modes
              │ alerts   │  "Disk usage > 90%"
              ├──────────┤
              │ Threshold│  Bad: Static thresholds on raw metrics
              │ alerts   │  "CPU > 80%" (could be fine)
              └──────────┘

Alert Rules (best practices):

DO:
 ✓ Alert on SLO burn rate (multi-window)
 ✓ Alert on user-facing errors (5xx rate > 1%)
 ✓ Alert on latency degradation relative to baseline
 ✓ Page only for things that need immediate human action
 ✓ Include runbook links in every alert
 ✓ Auto-resolve when condition clears

DON'T:
 ✗ Alert on CPU/memory (symptoms, not causes)
 ✗ Alert per-instance (use aggregate metrics)
 ✗ Create non-actionable alerts
 ✗ Page for things that self-heal
 ✗ Set arbitrary static thresholds

// --- Alert Rule Engine ---

interface AlertRule {
  name: string;
  description: string;
  query: string;          // PromQL-like expression
  condition: AlertCondition;
  severity: 'critical' | 'warning' | 'info';
  for: number;            // Duration condition must hold (seconds)
  runbookUrl?: string;
  labels: Record<string, string>;
  annotations: Record<string, string>;
}

interface AlertCondition {
  type: 'threshold' | 'burn_rate' | 'anomaly';
  operator: '>' | '<' | '>=' | '<=' | '==' | '!=';
  value: number;
}

enum AlertState {
  INACTIVE = 'inactive',
  PENDING = 'pending',    // Condition true, waiting for `for` duration
  FIRING = 'firing',
  RESOLVED = 'resolved',
}

interface ActiveAlert {
  rule: AlertRule;
  state: AlertState;
  value: number;
  startedAt: number;
  firedAt?: number;
  resolvedAt?: number;
  fingerprint: string;
}

class AlertEngine {
  private rules: AlertRule[] = [];
  private activeAlerts = new Map<string, ActiveAlert>();
  private notifiers: AlertNotifier[] = [];
  private evaluationInterval = 15000; // 15 seconds
  private evaluationTimer?: ReturnType<typeof setInterval>;

  addRule(rule: AlertRule): void {
    this.rules.push(rule);
  }

  addNotifier(notifier: AlertNotifier): void {
    this.notifiers.push(notifier);
  }

  start(): void {
    this.evaluationTimer = setInterval(() => this.evaluate(), this.evaluationInterval);
  }

  stop(): void {
    if (this.evaluationTimer) clearInterval(this.evaluationTimer);
  }

  private async evaluate(): Promise<void> {
    for (const rule of this.rules) {
      const currentValue = this.evaluateQuery(rule.query);
      const conditionMet = this.checkCondition(currentValue, rule.condition);
      const fingerprint = this.computeFingerprint(rule);
      
      const existing = this.activeAlerts.get(fingerprint);

      if (conditionMet) {
        if (!existing) {
          // New alert — start pending
          this.activeAlerts.set(fingerprint, {
            rule,
            state: AlertState.PENDING,
            value: currentValue,
            startedAt: Date.now(),
            fingerprint,
          });
        } else if (
          existing.state === AlertState.PENDING && 
          Date.now() - existing.startedAt >= rule.for * 1000
        ) {
          // Pending long enough — fire
          existing.state = AlertState.FIRING;
          existing.firedAt = Date.now();
          existing.value = currentValue;
          await this.notify(existing, 'firing');
        } else if (existing.state === AlertState.RESOLVED) {
          // Was resolved, now firing again
          existing.state = AlertState.PENDING;
          existing.startedAt = Date.now();
          existing.resolvedAt = undefined;
        }
        // Update value
        if (existing) existing.value = currentValue;
      } else {
        if (existing && existing.state === AlertState.FIRING) {
          existing.state = AlertState.RESOLVED;
          existing.resolvedAt = Date.now();
          await this.notify(existing, 'resolved');
        } else if (existing) {
          // Condition not met before firing — remove
          this.activeAlerts.delete(fingerprint);
        }
      }
    }
  }

  private evaluateQuery(_query: string): number {
    // In production: evaluate PromQL against metric store
    // Simplified: return a random value for demonstration
    return Math.random() * 100;
  }

  private checkCondition(value: number, condition: AlertCondition): boolean {
    switch (condition.operator) {
      case '>':  return value > condition.value;
      case '<':  return value < condition.value;
      case '>=': return value >= condition.value;
      case '<=': return value <= condition.value;
      case '==': return value === condition.value;
      case '!=': return value !== condition.value;
    }
  }

  private async notify(alert: ActiveAlert, action: 'firing' | 'resolved'): Promise<void> {
    const notification = {
      alertName: alert.rule.name,
      description: alert.rule.description,
      severity: alert.rule.severity,
      state: action,
      value: alert.value,
      startedAt: new Date(alert.startedAt).toISOString(),
      firedAt: alert.firedAt ? new Date(alert.firedAt).toISOString() : undefined,
      resolvedAt: alert.resolvedAt ? new Date(alert.resolvedAt).toISOString() : undefined,
      labels: alert.rule.labels,
      annotations: alert.rule.annotations,
      runbookUrl: alert.rule.runbookUrl,
    };

    for (const notifier of this.notifiers) {
      try {
        await notifier.send(notification);
      } catch (error) {
        console.error(`Failed to send alert notification: ${error}`);
      }
    }
  }

  private computeFingerprint(rule: AlertRule): string {
    return `${rule.name}:${JSON.stringify(rule.labels)}`;
  }

  getActiveAlerts(): ActiveAlert[] {
    return Array.from(this.activeAlerts.values())
      .filter((a) => a.state === AlertState.FIRING);
  }
}

interface AlertNotifier {
  send(notification: Record<string, any>): Promise<void>;
}

// PagerDuty notifier
class PagerDutyNotifier implements AlertNotifier {
  constructor(private apiKey: string) {}

  async send(notification: Record<string, any>): Promise<void> {
    // POST to PagerDuty Events API v2
    if (notification.severity === 'critical' && notification.state === 'firing') {
      console.log(`[PagerDuty] TRIGGER: ${notification.alertName}`);
    } else if (notification.state === 'resolved') {
      console.log(`[PagerDuty] RESOLVE: ${notification.alertName}`);
    }
  }
}

// Slack notifier
class SlackNotifier implements AlertNotifier {
  constructor(private webhookUrl: string) {}

  async send(notification: Record<string, any>): Promise<void> {
    const emoji = notification.state === 'resolved' ? '✅' : 
                  notification.severity === 'critical' ? '🔴' : '🟡';
    
    console.log(
      `[Slack] ${emoji} ${notification.alertName}: ${notification.description} ` +
      `(${notification.state}, value=${notification.value.toFixed(2)})`,
    );
  }
}

Observability Stack Comparison

Component	Open Source	Commercial
Metrics	Prometheus + Thanos/Mimir	Datadog, New Relic
Logs	Loki, ELK (Elastic)	Splunk, Datadog Logs
Traces	Jaeger, Tempo	Honeycomb, Lightstep
Dashboards	Grafana	Datadog, New Relic
Alerting	Alertmanager	PagerDuty, OpsGenie
Collection	OpenTelemetry	Datadog Agent
APM	SigNoz, Uptrace	Dynatrace, AppDynamics

Approach	Monthly Cost (100 services)	Pros	Cons
Self-hosted OSS	$5-15K (infra)	Full control, no data egress, unlimited cardinality	Operational burden, needs dedicated team
Managed OSS	$10-30K	Lower ops burden, scalable	Still complex, multi-tool
Full SaaS	$30-100K+	Zero ops, integrated, easy onboard	Expensive at scale, vendor lock-in

Interview Questions & Answers

Q1: What is the difference between monitoring and observability?

A: Monitoring is predefined — you decide in advance what metrics and thresholds to watch. It answers "is the system healthy?" with yes/no. Observability is exploratory — it lets you ask arbitrary questions about the system's internal state from its external outputs. It answers "why is the system unhealthy?" without needing to have predicted the failure mode in advance. A system is observable when you can understand any internal state just by examining telemetry (logs, metrics, traces). Example: monitoring catches "error rate > 5%." Observability lets you drill down: "errors are 90% from the payment service, specifically the Stripe timeout path, only for requests from the EU region, starting 10 minutes after deploy v2.4.1." The distinction matters because in distributed systems, most failures are novel — you can't pre-define alerts for every possible failure mode.

Q2: How do you implement distributed tracing without significant performance overhead?

A: Three strategies: (1) Sampling — don't trace every request. Head-based sampling (decide at the edge, propagate the decision) with 1-10% sampling rate captures enough data for insight while reducing overhead 10-100x. Use deterministic sampling (trace ID hash) so all services agree on the same decision. (2) Async export — buffer spans in memory and export in batches to the collector. Don't block the request path for telemetry I/O. Use a bounded buffer with backpressure (drop oldest spans when full, never slow down requests). (3) Tail-based sampling — trace everything, but only keep interesting traces (errors, high latency, specific paths). This catches 100% of problems while storing <5% of data. Trade-off: requires a centralized collector to make sampling decisions after the trace completes. (4) Context propagation is cheap — passing a 32-byte trace ID in headers adds <0.01ms. The cost is in serializing and exporting span data. Keep span attributes minimal and export asynchronously.

Q3: How do you design effective SLOs and error budgets?

A: Start from user expectations, not system capabilities. (1) Choose SLIs — for a web service: availability (% of non-5xx responses), latency (% of requests under 500ms), correctness (% of correct results). (2) Set SLO targets — based on what users actually need, not "as high as possible." 99.9% (43 min/month downtime) is appropriate for most internal services. 99.95% for customer-facing. 99.99% only for critical infrastructure (auth, payments). (3) Calculate error budget — 99.9% SLO = 0.1% budget. For 1M requests/month, that's 1000 failed requests. (4) Alert on burn rate, not threshold — use multi-window burn rate alerting (Google SRE). A 14.4x burn rate over 1 hour means "budget will exhaust in 2 days" — page immediately. A 1x burn rate sustained over 3 days means "we'll barely miss" — create a ticket. (5) Policy — when budget is exhausted, freeze deployments and focus on reliability. When budget is healthy, deploy features faster. This aligns engineering velocity with reliability quantitatively.

Q4: How do you handle high-cardinality data in observability systems?

A: High cardinality means many unique label combinations (user_id, trace_id, request_id). Prometheus struggles past ~10M active time series. Solutions: (1) Don't use high-cardinality labels in metrics — user_id should be a log/trace attribute, not a metric label. Metrics are for aggregates (by endpoint, status code, region). (2) Use exemplars — attach a sample trace_id to a metric series. When you see a p99 spike, the exemplar links to an actual trace. You get cardinality in traces, not metrics. (3) Logs for unique identifiers — structured logs with user_id, order_id, etc. Indexed for search but not aggregated as time series. (4) Column-oriented storage — systems like ClickHouse handle high cardinality better than Prometheus (inverted indices + column compression). (5) Pre-aggregation — in the collector, combine per-instance metrics into per-service aggregates before storage. Reduces cardinality by 10-100x.

Q5: How do you correlate logs, metrics, and traces?

A: The key is a shared trace_id that flows through all signals. (1) Inject trace_id at the edge — API gateway generates or extracts a W3C traceparent header. Every downstream service propagates it. (2) Logs include trace_id — every structured log line has trace_id and span_id fields. In Grafana: click a trace → view all logs with that trace_id in Loki. (3) Metrics use exemplars — when recording a histogram observation, attach the trace_id as an exemplar. In Grafana: click a histogram bucket → see the exemplar trace_id → jump to Tempo trace. (4) Service graph — traces inherently contain the service call graph. Aggregate traces to build a live service topology showing request flow and error rates. (5) Unified timestamps — ensure all signals use the same time source (NTP-synced). Logs, traces, and metrics must align temporally for correlation to work. OpenTelemetry is the standard for generating all three signals from a single SDK with automatic correlation.

Real-World Problems & How to Solve Them

Problem 1: Prometheus memory explodes after a release

Symptom: Metrics backend OOMs, scrape latency increases, and dashboards time out.

Root cause: High-cardinality labels (user_id, request_id, trace_id) create massive unique time series. Metrics are aggregate signals, not per-entity records.

Fix — Enforce an allowlist of low-cardinality labels:

type Labels = Record<string, string>;

const ALLOWED_LABELS = new Set(["service", "route", "method", "status_code", "region"]);

function sanitizeMetricLabels(input: Labels): Labels {
  const output: Labels = {};
  for (const [key, value] of Object.entries(input)) {
    if (!ALLOWED_LABELS.has(key)) continue;
    output[key] = value;
  }
  return output;
}

function recordHttpRequest(durationMs: number, labels: Labels): void {
  const safe = sanitizeMetricLabels(labels);
  httpRequestDurationHistogram.observe(safe, durationMs / 1000);
}

Problem 2: Traces are fragmented across services

Symptom: A user request appears as multiple unrelated traces in your tracing UI.

Root cause: W3C Trace Context (traceparent) is not propagated over HTTP and async boundaries (queues/events).

Fix — Inject and extract context on every boundary:

import { context, propagation } from "@opentelemetry/api";

async function publishOrderEvent(topic: string, payload: unknown): Promise<void> {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);

  await messageBus.publish(topic, {
    payload,
    headers: carrier,
  });
}

async function handleOrderEvent(message: { payload: unknown; headers: Record<string, string> }): Promise<void> {
  const parent = propagation.extract(context.active(), message.headers);
  await context.with(parent, async () => {
    await orderConsumer.process(message.payload);
  });
}

Problem 3: Logs are impossible to query during incidents

Symptom: Engineers grep plain text logs for hours and still can’t isolate failing requests.

Root cause: Unstructured log strings don’t preserve typed fields (trace_id, error_code, duration_ms) needed for indexing and correlation.

Fix — Emit JSON logs with required fields and context inheritance:

type LogLevel = "info" | "warn" | "error";

function log(level: LogLevel, message: string, fields: Record<string, unknown> = {}): void {
  const entry = {
    timestamp: new Date().toISOString(),
    level,
    service: "checkout-service",
    trace_id: requestContext.get("traceId"),
    span_id: requestContext.get("spanId"),
    message,
    ...fields,
  };
  process.stdout.write(`${JSON.stringify(entry)}\n`);
}

log("error", "Payment declined", {
  error_code: "PAYMENT_DECLINED",
  order_id: "ord_123",
  duration_ms: 843,
});

Problem 4: p99 latency is invisible until customers complain

Symptom: Average latency looks stable, but tail latency spikes hurt real user experience.

Root cause: Latency was instrumented as a gauge or average counter instead of a histogram, so percentile math is impossible.

Fix — Use histogram buckets designed for your SLO range:

import { Histogram } from "prom-client";

const httpRequestDurationSeconds = new Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency",
  labelNames: ["service", "route", "method", "status_code"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5],
});

async function timedRequest(labels: Record<string, string>, fn: () => Promise<void>): Promise<void> {
  const start = process.hrtime.bigint();
  try {
    await fn();
  } finally {
    const elapsedSeconds = Number(process.hrtime.bigint() - start) / 1e9;
    httpRequestDurationSeconds.observe(labels, elapsedSeconds);
  }
}

Problem 5: Alert fatigue from noisy threshold pages

Symptom: On-call gets paged repeatedly for transient spikes that self-heal.

Root cause: Static threshold alerts ignore error budget consumption rate and time window context.

Fix — Alert on multi-window burn rate, not raw error count:

type BurnRateWindow = { lookbackMinutes: number; threshold: number; severity: "page" | "ticket" };

const windows: BurnRateWindow[] = [
  { lookbackMinutes: 60, threshold: 14.4, severity: "page" },
  { lookbackMinutes: 360, threshold: 6, severity: "page" },
  { lookbackMinutes: 1440, threshold: 2, severity: "ticket" },
];

function evaluateBurnRate(currentBurnRate: number, lookbackMinutes: number): "page" | "ticket" | "none" {
  const match = windows.find((w) => w.lookbackMinutes === lookbackMinutes);
  if (!match) return "none";
  if (currentBurnRate >= match.threshold) return match.severity;
  return "none";
}

Problem 6: You can see a bad metric point but can’t inspect the request

Symptom: Dashboards show latency spikes, but engineers can’t jump to any representative trace.

Root cause: Histogram observations are recorded without exemplars, so metrics and traces are disconnected at query time.

Fix — Attach trace IDs as exemplars for sampled requests:

import { trace } from "@opentelemetry/api";

function recordLatencyWithExemplar(latencySeconds: number, labels: Record<string, string>): void {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;

  httpRequestDurationHistogram.observe(labels, latencySeconds, {
    trace_id: traceId,
  });
}

Key Takeaways

Structured logging (JSON) enables querying and aggregation — unstructured string logs are useless at scale; every log entry needs trace_id, service name, and typed fields
Distributed tracing follows a single request across all services — each service creates a span with parent-child relationships forming a waterfall view
W3C Trace Context (traceparent header) is the standard for propagating trace IDs across service boundaries — deterministic sampling ensures all services agree
Metrics are counters, gauges, and histograms — counters for totals, gauges for current values, histograms for distributions (p50/p95/p99 latency)
SLOs define reliability targets quantitatively — 99.9% availability = 43 minutes/month error budget; burn rate alerts are better than threshold alerts
Multi-window burn rate alerting catches both fast and slow degradation — 14.4x burn over 1 hour pages immediately, 1x over 3 days creates a ticket
Correlation via trace_id is the glue — the same trace_id appears in logs, traces, and metric exemplars, enabling drill-down from any signal
Sample traces, not metrics — sampling at 1-10% reduces trace storage and overhead while maintaining statistical significance for debugging
High-cardinality labels (user_id, request_id) belong in logs and traces, not metrics — metrics should use low-cardinality labels (method, status, service)
OpenTelemetry is the convergence point — a single SDK for logs, metrics, and traces with automatic correlation, replacing vendor-specific agents

What did you think?