Backend Graceful Shutdown & Lifecycle: Signal Handling, Connection Draining, Health Checks & Kubernetes Pod Lifecycle

March 21, 20267 min read3 views

graceful shutdown

production engineering

Backend Graceful Shutdown & Lifecycle: Signal Handling, Connection Draining, Health Checks & Kubernetes Pod Lifecycle

Why Graceful Shutdown Matters

When a server process terminates abruptly, in-flight HTTP requests get dropped, database transactions are left incomplete, WebSocket connections die without close frames, and background jobs vanish mid-execution. Users see 502 errors. Data becomes inconsistent. The next deploy becomes a mini-outage.

Graceful shutdown means: when the process receives a termination signal, it stops accepting new work, finishes in-flight work, cleans up resources, and then exits. Zero dropped requests. Zero data corruption.

Ungraceful Shutdown:
┌───────────────┐
│ Server        │
│ ┌───────────┐ │    SIGTERM
│ │ Request A │ │←──────────────   kill -15 pid
│ │ (halfway) │ │
│ ├───────────┤ │    Process exits immediately
│ │ Request B │ │    ┌────────────────────────────┐
│ │ (waiting  │ │───→│ Request A: 502 Bad Gateway │
│ │  for DB)  │ │    │ Request B: Connection reset │
│ ├───────────┤ │    │ DB transaction: half-written │
│ │ Cron job  │ │    │ Cron job: partial execution  │
│ │ (running) │ │    └────────────────────────────┘
│ └───────────┘ │
└───────────────┘


Graceful Shutdown:
┌───────────────┐
│ Server        │    SIGTERM
│               │←──────────────   1. Stop accepting new connections
│ ┌───────────┐ │                  2. Health check → unhealthy
│ │ Request A │→│──→ completes     3. Finish in-flight requests
│ │           │ │                  4. Close DB connections
│ ├───────────┤ │                  5. Flush logs/metrics
│ │ Request B │→│──→ completes     6. Exit with code 0
│ ├───────────┤ │
│ │ Cron job  │→│──→ checkpoint
│ └───────────┘ │
└───────────────┘
    └──→ exit(0) after all work completes

Signal Handling in Node.js

Unix Signals for Process Lifecycle:

Signal     │ Default Action │ Can Catch? │ Typical Use
───────────┼────────────────┼────────────┼──────────────────────────
SIGTERM    │ Terminate      │ Yes        │ Polite "please shut down"
SIGINT     │ Terminate      │ Yes        │ Ctrl+C in terminal
SIGQUIT    │ Core dump      │ Yes        │ Ctrl+\ (debugging)
SIGKILL    │ Terminate      │ NO         │ Force kill (cannot catch)
SIGHUP     │ Terminate      │ Yes        │ Terminal closed / reload config
SIGUSR1    │ Terminate      │ Yes        │ Node: start debugger
SIGUSR2    │ Terminate      │ Yes        │ Custom (e.g., heap dump)

Kubernetes sends:
  1. SIGTERM → pod → terminationGracePeriodSeconds countdown starts (default: 30s)
  2. If pod still running after grace period → SIGKILL (uncatchable)

Docker sends:
  docker stop → SIGTERM → wait 10 seconds → SIGKILL

Signal Handler Implementation

type ShutdownHook = {
  name: string;
  handler: () => Promise<void>;
  priority: number;       // Lower = runs first
  timeoutMs: number;      // Max time for this hook
};

class ProcessLifecycleManager {
  private hooks: ShutdownHook[] = [];
  private isShuttingDown = false;
  private shutdownPromise: Promise<void> | null = null;
  private forceShutdownTimeoutMs: number;

  constructor(options: { forceShutdownTimeoutMs?: number } = {}) {
    this.forceShutdownTimeoutMs = options.forceShutdownTimeoutMs ?? 30000;
    this.registerSignalHandlers();
  }

  private registerSignalHandlers(): void {
    // Handle SIGTERM (Kubernetes, docker stop, kill)
    process.on('SIGTERM', () => {
      console.log('[lifecycle] Received SIGTERM');
      this.shutdown('SIGTERM');
    });

    // Handle SIGINT (Ctrl+C)
    process.on('SIGINT', () => {
      console.log('[lifecycle] Received SIGINT');
      this.shutdown('SIGINT');
    });

    // Handle uncaught exceptions — try to shut down gracefully
    process.on('uncaughtException', (error: Error) => {
      console.error('[lifecycle] Uncaught exception:', error);
      this.shutdown('uncaughtException', 1);
    });

    // Handle unhandled promise rejections
    process.on('unhandledRejection', (reason: any) => {
      console.error('[lifecycle] Unhandled rejection:', reason);
      this.shutdown('unhandledRejection', 1);
    });

    // Handle second SIGINT/SIGTERM as force shutdown
    let signalCount = 0;
    const secondSignalHandler = (signal: string) => {
      signalCount++;
      if (signalCount > 1) {
        console.log('[lifecycle] Force shutdown (second signal received)');
        process.exit(1);
      }
    };
    
    process.on('SIGTERM', secondSignalHandler);
    process.on('SIGINT', secondSignalHandler);
  }

  // Register a shutdown hook
  addHook(hook: ShutdownHook): void {
    this.hooks.push(hook);
    this.hooks.sort((a, b) => a.priority - b.priority);
  }

  // Convenience method for simple hooks
  onShutdown(name: string, handler: () => Promise<void>, priority = 100): void {
    this.addHook({ name, handler, priority, timeoutMs: 10000 });
  }

  async shutdown(reason: string, exitCode: number = 0): Promise<void> {
    // Prevent concurrent shutdowns
    if (this.isShuttingDown) {
      return this.shutdownPromise!;
    }
    
    this.isShuttingDown = true;
    console.log(`[lifecycle] Shutting down (reason: ${reason})`);
    
    this.shutdownPromise = this.executeShutdown(exitCode);
    return this.shutdownPromise;
  }

  private async executeShutdown(exitCode: number): Promise<void> {
    const startTime = Date.now();
    
    // Set a hard deadline — if hooks don't finish, force exit
    const forceTimer = setTimeout(() => {
      console.error('[lifecycle] Force shutdown — hooks did not complete in time');
      process.exit(1);
    }, this.forceShutdownTimeoutMs);
    
    // Don't let this timer keep the process alive
    forceTimer.unref();
    
    // Execute hooks in priority order
    for (const hook of this.hooks) {
      const hookStart = Date.now();
      
      try {
        console.log(`[lifecycle] Running hook: ${hook.name}`);
        
        await Promise.race([
          hook.handler(),
          new Promise<void>((_, reject) => 
            setTimeout(
              () => reject(new Error(`Hook ${hook.name} timed out`)), 
              hook.timeoutMs
            )
          )
        ]);
        
        const elapsed = Date.now() - hookStart;
        console.log(`[lifecycle] Hook ${hook.name} completed (${elapsed}ms)`);
      } catch (error) {
        console.error(`[lifecycle] Hook ${hook.name} failed:`, error);
        // Continue with other hooks even if one fails
      }
    }
    
    const totalElapsed = Date.now() - startTime;
    console.log(`[lifecycle] All hooks completed (${totalElapsed}ms). Exiting.`);
    
    clearTimeout(forceTimer);
    process.exit(exitCode);
  }

  get shuttingDown(): boolean {
    return this.isShuttingDown;
  }
}

HTTP Server Connection Draining

Connection Draining Timeline:

SIGTERM received
│
▼
t=0s: Stop listening for new connections
      server.close() — stops accepting new TCP connections
      │
      ▼
t=0s: Existing connections continue processing
      ┌─────────────────────────────────────────────┐
      │ In-flight Request A: processing → response  │
      │ In-flight Request B: processing → response  │
      │ Keep-Alive Conn C: idle → close              │
      │ WebSocket D: close frame → disconnect        │
      └─────────────────────────────────────────────┘
      │
      ▼
t=5s: All in-flight requests completed
      All connections drained
      │
      ▼
t=5s: server 'close' event fires
      └─→ proceed with remaining shutdown hooks


Edge Case: Long-Running Requests

What if a request takes 60s but grace period is 30s?
                                                        
t=0    t=25s                t=30s
│      │                    │
▼      ▼                    ▼
SIGTERM Connection-level     SIGKILL
       timeout fires        (Kubernetes)
       → 503 response       → Process dies
       → Socket destroyed

You must enforce a per-connection timeout shorter than the grace period.

HTTP Connection Drainer

import { Server, IncomingMessage, ServerResponse, Socket } from 'http';

class HttpConnectionDrainer {
  private server: Server;
  private activeConnections: Map<Socket, ConnectionInfo> = new Map();
  private requestCount = 0;
  private isClosing = false;

  constructor(server: Server, private options: DrainerOptions = {}) {
    this.server = server;
    this.options = {
      connectionTimeoutMs: 25000,    // Must be < Kubernetes grace period
      keepAliveTimeoutMs: 5000,      // Quickly close idle keep-alive connections
      ...options
    };
    
    this.trackConnections();
  }

  private trackConnections(): void {
    // Track every TCP connection
    this.server.on('connection', (socket: Socket) => {
      const info: ConnectionInfo = {
        socket,
        activeRequests: 0,
        connectedAt: Date.now(),
        lastActivity: Date.now()
      };
      
      this.activeConnections.set(socket, info);
      
      socket.on('close', () => {
        this.activeConnections.delete(socket);
      });
    });

    // Track request lifecycle on each connection
    this.server.on('request', (req: IncomingMessage, res: ServerResponse) => {
      const socket = req.socket;
      const info = this.activeConnections.get(socket);
      
      if (info) {
        info.activeRequests++;
        info.lastActivity = Date.now();
      }
      
      this.requestCount++;
      
      // When response finishes, decrement active request count
      res.on('finish', () => {
        if (info) {
          info.activeRequests--;
          info.lastActivity = Date.now();
        }
        
        // If we're closing and this connection has no more requests, close it
        if (this.isClosing && info && info.activeRequests === 0) {
          socket.end(); // Gracefully close the TCP connection
        }
      });

      // If shutting down, add "Connection: close" header
      // This tells the client not to reuse this connection
      if (this.isClosing) {
        res.setHeader('Connection', 'close');
      }
    });
  }

  async drain(): Promise<void> {
    this.isClosing = true;
    
    return new Promise<void>((resolve) => {
      // 1. Stop accepting new connections
      this.server.close(() => {
        // All connections closed
        resolve();
      });
      
      // 2. Close idle keep-alive connections immediately
      for (const [socket, info] of this.activeConnections) {
        if (info.activeRequests === 0) {
          socket.end(); // Graceful close for idle connections
        }
      }
      
      // 3. Set connection-level timeout for in-flight requests
      const timeout = setTimeout(() => {
        // Force-close remaining connections
        for (const [socket, info] of this.activeConnections) {
          console.warn(
            `[drain] Force-closing connection with ${info.activeRequests} active requests`
          );
          socket.destroy(); // Hard close — response lost
        }
        resolve();
      }, this.options.connectionTimeoutMs);
      
      timeout.unref();
    });
  }

  getStats(): ConnectionStats {
    let activeRequests = 0;
    let idleConnections = 0;
    
    for (const [, info] of this.activeConnections) {
      activeRequests += info.activeRequests;
      if (info.activeRequests === 0) idleConnections++;
    }
    
    return {
      totalConnections: this.activeConnections.size,
      activeRequests,
      idleConnections,
      isClosing: this.isClosing,
      totalRequestsServed: this.requestCount
    };
  }
}

interface ConnectionInfo {
  socket: Socket;
  activeRequests: number;
  connectedAt: number;
  lastActivity: number;
}

interface DrainerOptions {
  connectionTimeoutMs?: number;
  keepAliveTimeoutMs?: number;
}

interface ConnectionStats {
  totalConnections: number;
  activeRequests: number;
  idleConnections: number;
  isClosing: boolean;
  totalRequestsServed: number;
}

Health Check Integration

Readiness vs Liveness:

Liveness Probe:  "Is the process alive?"
                  Failed → Kubernetes RESTARTS the pod
                  Example: process deadlocked, out of memory

Readiness Probe: "Can the process handle traffic?"
                  Failed → Kubernetes REMOVES pod from service endpoints
                  Example: shutting down, warming cache, DB connecting

During Graceful Shutdown:
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  t=0: SIGTERM received                                       │
│       readiness → FAIL                                       │
│       liveness → PASS (process is still alive!)              │
│                                                              │
│  t=0-5s: Kubernetes sees failed readiness                    │
│          Removes pod from Service endpoints                  │
│          Load balancer stops sending NEW requests            │
│          But in-flight requests continue processing          │
│                                                              │
│  t=5s: All in-flight requests complete                       │
│        DB connections closed                                 │
│        Logs flushed                                          │
│                                                              │
│  t=5s: Process exits with code 0                             │
│                                                              │
│  If t > terminationGracePeriodSeconds:                       │
│        SIGKILL — forced termination                          │
│                                                              │
└──────────────────────────────────────────────────────────────┘

CRITICAL RACE CONDITION:

Kubernetes endpoint update propagation is NOT instant.
After readiness fails, it takes ~1-5 seconds for the
kube-proxy / iptables / IPVS rules to update across
all nodes. During this window, new requests may still
arrive at the pod.

Solution: preStop hook with sleep

pod.spec.containers[].lifecycle.preStop:
  exec:
    command: ["sh", "-c", "sleep 5"]

This delays SIGTERM by 5 seconds, giving time for
endpoints to propagate before the app starts draining.

Health Check Controller

interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  timestamp: string;
  uptime: number;
}

interface CheckResult {
  status: 'pass' | 'warn' | 'fail';
  message?: string;
  latencyMs?: number;
}

type HealthCheck = {
  name: string;
  check: () => Promise<CheckResult>;
  critical: boolean;      // If critical check fails → unhealthy
  intervalMs?: number;    // Background check interval
};

class HealthCheckController {
  private checks: HealthCheck[] = [];
  private cachedResults: Map<string, CheckResult> = new Map();
  private isReady = true;
  private isAlive = true;
  private startTime = Date.now();
  private backgroundTimers: NodeJS.Timeout[] = [];

  addCheck(check: HealthCheck): void {
    this.checks.push(check);
    
    // Run background checks periodically
    if (check.intervalMs) {
      const timer = setInterval(async () => {
        try {
          const result = await Promise.race([
            check.check(),
            new Promise<CheckResult>((_, reject) =>
              setTimeout(() => reject(new Error('Timeout')), 5000)
            )
          ]);
          this.cachedResults.set(check.name, result);
        } catch (error) {
          this.cachedResults.set(check.name, {
            status: 'fail',
            message: String(error)
          });
        }
      }, check.intervalMs);
      
      timer.unref();
      this.backgroundTimers.push(timer);
    }
  }

  // Called during shutdown
  markUnready(): void {
    this.isReady = false;
  }

  markUnhealthy(): void {
    this.isAlive = false;
  }

  // Liveness endpoint: GET /health/live
  async getLiveness(): Promise<HealthStatus> {
    return {
      status: this.isAlive ? 'healthy' : 'unhealthy',
      checks: {},
      timestamp: new Date().toISOString(),
      uptime: Date.now() - this.startTime
    };
  }

  // Readiness endpoint: GET /health/ready
  async getReadiness(): Promise<HealthStatus> {
    if (!this.isReady) {
      return {
        status: 'unhealthy',
        checks: { shutdown: { status: 'fail', message: 'Shutting down' } },
        timestamp: new Date().toISOString(),
        uptime: Date.now() - this.startTime
      };
    }

    const results: Record<string, CheckResult> = {};
    let overallHealthy = true;

    for (const check of this.checks) {
      // Use cached result if available
      const cached = this.cachedResults.get(check.name);
      
      if (cached) {
        results[check.name] = cached;
      } else {
        try {
          const start = Date.now();
          const result = await check.check();
          result.latencyMs = Date.now() - start;
          results[check.name] = result;
        } catch (error) {
          results[check.name] = {
            status: 'fail',
            message: String(error)
          };
        }
      }

      if (results[check.name].status === 'fail' && check.critical) {
        overallHealthy = false;
      }
    }

    return {
      status: overallHealthy ? 'healthy' : 'unhealthy',
      checks: results,
      timestamp: new Date().toISOString(),
      uptime: Date.now() - this.startTime
    };
  }

  // Startup endpoint: GET /health/startup
  // Used for slow-starting apps (Kubernetes startupProbe)
  async getStartup(): Promise<HealthStatus> {
    // Could check: DB connected, cache warmed, config loaded, etc.
    return this.getReadiness();
  }

  dispose(): void {
    for (const timer of this.backgroundTimers) {
      clearInterval(timer);
    }
  }
}

// Common health checks
function databaseHealthCheck(pool: any): HealthCheck {
  return {
    name: 'database',
    critical: true,
    intervalMs: 10000,
    check: async () => {
      try {
        const start = Date.now();
        await pool.query('SELECT 1');
        return {
          status: 'pass',
          latencyMs: Date.now() - start
        };
      } catch (error) {
        return {
          status: 'fail',
          message: `Database unreachable: ${error}`
        };
      }
    }
  };
}

function redisHealthCheck(redis: any): HealthCheck {
  return {
    name: 'redis',
    critical: false,   // Non-critical — app can work without cache
    intervalMs: 15000,
    check: async () => {
      try {
        const start = Date.now();
        await redis.ping();
        return {
          status: 'pass',
          latencyMs: Date.now() - start
        };
      } catch {
        return {
          status: 'warn',
          message: 'Redis unavailable — falling back to in-memory cache'
        };
      }
    }
  };
}

function diskSpaceHealthCheck(thresholdPercent = 90): HealthCheck {
  return {
    name: 'disk',
    critical: true,
    intervalMs: 60000,
    check: async () => {
      const { execSync } = require('child_process');
      const output = execSync("df -h / | tail -1 | awk '{print $5}'").toString().trim();
      const usedPercent = parseInt(output.replace('%', ''), 10);
      
      if (usedPercent > thresholdPercent) {
        return {
          status: 'fail',
          message: `Disk usage at ${usedPercent}% (threshold: ${thresholdPercent}%)`
        };
      }
      return {
        status: usedPercent > thresholdPercent - 10 ? 'warn' : 'pass',
        message: `Disk usage: ${usedPercent}%`
      };
    }
  };
}

Complete Graceful Shutdown Orchestrator

Shutdown Phase Ordering:

Phase 1: Health Check → Unready
          (Load balancer stops sending new traffic)
          Wait for endpoint propagation (~5 seconds)

Phase 2: Stop HTTP Server
          server.close() — stops accepting new TCP connections

Phase 3: Drain Connections
          Wait for in-flight requests to complete
          Force-close after timeout

Phase 4: Stop Background Workers
          Cron jobs checkpoint and halt
          Message consumers stop consuming
          Worker threads send completion signal

Phase 5: Close External Connections
          Database connection pools drain
          Redis/cache connections close
          Message broker connections close

Phase 6: Flush Buffers
          Logs flushed to disk/remote
          Metrics pushed to collector
          Analytics events sent

Phase 7: Exit
          process.exit(0)


Priority Map:
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ P:10       │ │ P:20       │ │ P:50       │ │ P:80       │ │ P:90       │
│ Health →   │→│ HTTP       │→│ Background │→│ External   │→│ Flush      │
│ Unready    │ │ Server     │ │ Workers    │ │ Connections│ │ Buffers    │
│            │ │ Close      │ │ Halt       │ │ Close      │ │ & Exit     │
└────────────┘ └────────────┘ └────────────┘ └────────────┘ └────────────┘

import http from 'http';
import { EventEmitter } from 'events';

interface ServerComponents {
  httpServer: http.Server;
  healthCheck: HealthCheckController;
  connectionDrainer: HttpConnectionDrainer;
  dbPool?: any;               // Database connection pool
  redis?: any;                // Redis client
  messageConsumer?: any;      // Message queue consumer
  workerManager?: WorkerManager;
  metricsCollector?: MetricsCollector;
  logger?: FlushableLogger;
}

class GracefulShutdownOrchestrator {
  private lifecycle: ProcessLifecycleManager;

  constructor(private components: ServerComponents) {
    this.lifecycle = new ProcessLifecycleManager({
      forceShutdownTimeoutMs: 28000  // Must be < K8s terminationGracePeriodSeconds
    });

    this.registerShutdownHooks();
  }

  private registerShutdownHooks(): void {
    const { 
      healthCheck, httpServer, connectionDrainer,
      dbPool, redis, messageConsumer, 
      workerManager, metricsCollector, logger 
    } = this.components;

    // Phase 1: Mark as unready (Priority 10)
    this.lifecycle.addHook({
      name: 'mark-unready',
      priority: 10,
      timeoutMs: 1000,
      handler: async () => {
        healthCheck.markUnready();
        
        // Wait for endpoint propagation
        // In Kubernetes, use preStop hook instead
        await sleep(5000);
      }
    });

    // Phase 2: Stop accepting connections (Priority 20)
    this.lifecycle.addHook({
      name: 'stop-http-server',
      priority: 20,
      timeoutMs: 2000,
      handler: async () => {
        await connectionDrainer.drain();
      }
    });

    // Phase 3: Stop message consumers (Priority 30)
    if (messageConsumer) {
      this.lifecycle.addHook({
        name: 'stop-message-consumer',
        priority: 30,
        timeoutMs: 10000,
        handler: async () => {
          // Stop consuming but finish processing current messages
          await messageConsumer.close();
        }
      });
    }

    // Phase 4: Stop background workers (Priority 50)
    if (workerManager) {
      this.lifecycle.addHook({
        name: 'stop-workers',
        priority: 50,
        timeoutMs: 15000,
        handler: async () => {
          await workerManager.shutdownAll();
        }
      });
    }

    // Phase 5: Close database pool (Priority 80)
    if (dbPool) {
      this.lifecycle.addHook({
        name: 'close-database',
        priority: 80,
        timeoutMs: 5000,
        handler: async () => {
          await dbPool.end();
        }
      });
    }

    // Phase 5: Close Redis (Priority 80)
    if (redis) {
      this.lifecycle.addHook({
        name: 'close-redis',
        priority: 80,
        timeoutMs: 3000,
        handler: async () => {
          await redis.quit();
        }
      });
    }

    // Phase 6: Flush metrics (Priority 90)
    if (metricsCollector) {
      this.lifecycle.addHook({
        name: 'flush-metrics',
        priority: 90,
        timeoutMs: 5000,
        handler: async () => {
          await metricsCollector.flush();
        }
      });
    }

    // Phase 6: Flush logs (Priority 95)
    if (logger) {
      this.lifecycle.addHook({
        name: 'flush-logs',
        priority: 95,
        timeoutMs: 3000,
        handler: async () => {
          await logger.flush();
        }
      });
    }

    // Dispose health check timers
    this.lifecycle.addHook({
      name: 'dispose-healthcheck',
      priority: 99,
      timeoutMs: 1000,
      handler: async () => {
        healthCheck.dispose();
      }
    });
  }

  // Check if system is shutting down (for middleware use)
  get isShuttingDown(): boolean {
    return this.lifecycle.shuttingDown;
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Worker Manager — manages background task workers
class WorkerManager {
  private workers: Map<string, ManagedWorker> = new Map();

  register(name: string, worker: ManagedWorker): void {
    this.workers.set(name, worker);
  }

  async shutdownAll(): Promise<void> {
    const shutdownPromises = Array.from(this.workers.entries()).map(
      async ([name, worker]) => {
        try {
          console.log(`[worker] Stopping worker: ${name}`);
          
          // Signal worker to stop after current task
          worker.stop();
          
          // Wait for the current task to finish
          await Promise.race([
            worker.waitForIdle(),
            sleep(10000) // Max wait per worker
          ]);
          
          console.log(`[worker] Worker stopped: ${name}`);
        } catch (error) {
          console.error(`[worker] Error stopping ${name}:`, error);
        }
      }
    );
    
    await Promise.all(shutdownPromises);
  }
}

interface ManagedWorker {
  stop(): void;                       // Signal to stop after current task
  waitForIdle(): Promise<void>;       // Resolves when no task is running
}

class IntervalWorker implements ManagedWorker {
  private timer?: NodeJS.Timeout;
  private running = false;
  private shouldStop = false;
  private idleResolvers: Array<() => void> = [];

  constructor(
    private name: string,
    private task: () => Promise<void>,
    private intervalMs: number
  ) {
    this.start();
  }

  private start(): void {
    this.timer = setInterval(async () => {
      if (this.shouldStop || this.running) return;
      
      this.running = true;
      try {
        await this.task();
      } catch (error) {
        console.error(`[worker:${this.name}] Error:`, error);
      } finally {
        this.running = false;
        
        if (this.shouldStop) {
          // Notify waiters that we're idle
          for (const resolve of this.idleResolvers) {
            resolve();
          }
          this.idleResolvers = [];
        }
      }
    }, this.intervalMs);
  }

  stop(): void {
    this.shouldStop = true;
    if (this.timer) {
      clearInterval(this.timer);
    }
    
    // If already idle, resolve immediately
    if (!this.running) {
      for (const resolve of this.idleResolvers) {
        resolve();
      }
      this.idleResolvers = [];
    }
  }

  waitForIdle(): Promise<void> {
    if (!this.running) return Promise.resolve();
    
    return new Promise(resolve => {
      this.idleResolvers.push(resolve);
    });
  }
}

interface MetricsCollector {
  flush(): Promise<void>;
}

interface FlushableLogger {
  flush(): Promise<void>;
}

Kubernetes Pod Lifecycle Integration

Kubernetes Pod Termination Sequence:

1. API Server marks pod as "Terminating"
   │
2. Endpoints controller REMOVES pod from Service endpoints
   │                                                        IN PARALLEL
3. kubelet sees pod state, runs preStop hook ──────────────────────┐
   │                                                               │
   │  preStop: sleep 5    ← Gives time for endpoints to propagate │
   │                                                               │
4. kubelet sends SIGTERM to container PID 1                        │
   │                                                               │
5. terminationGracePeriodSeconds countdown starts (default: 30s)   │
   │                                                               │
   │  App: mark unready → drain connections → cleanup              │
   │                                                               │
6. If still running when grace period expires: SIGKILL             │
   └───────────────────────────────────────────────────────────────┘


CRITICAL: Steps 2 and 3 happen IN PARALLEL, not sequentially.
The preStop hook delays SIGTERM so that by the time the app
starts draining, the Service endpoints are already updated.


Kubernetes Manifest:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30       # Total grace period
      containers:
      - name: api-server
        lifecycle:
          preStop:                             # Runs BEFORE SIGTERM
            exec:
              command: ["sh", "-c", "sleep 5"] # Wait for endpoint propagation
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          periodSeconds: 5
          failureThreshold: 1                 # Fail fast on shutdown
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          periodSeconds: 10
          failureThreshold: 3                 # Allow temporary issues
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          periodSeconds: 5
          failureThreshold: 30                # Allow 150s for slow startup


Timeline With preStop:
─────────────────────────────────────────────────
t=0s   Pod marked Terminating + preStop starts
t=0-5s Endpoints propagating (pod still receiving traffic)
       preStop: sleeping
t=5s   preStop completes → SIGTERM sent to app
t=5s   App: readiness → fail, drain starts
t=5-25s App: draining connections, cleanup
t=25s  App exits with code 0
─────────────────────────────────────────────────
t=30s  (deadline — SIGKILL would fire if app hadn't exited)

Putting It All Together: Express/Fastify Server

import http from 'http';
import express from 'express';

async function main() {
  const app = express();
  
  // Create HTTP server
  const server = http.createServer(app);
  
  // Initialize components
  const healthCheck = new HealthCheckController();
  const connectionDrainer = new HttpConnectionDrainer(server, {
    connectionTimeoutMs: 20000
  });
  
  // Simulate external dependencies
  const dbPool = await createDatabasePool();
  const redis = await createRedisClient();
  
  // Register health checks
  healthCheck.addCheck(databaseHealthCheck(dbPool));
  healthCheck.addCheck(redisHealthCheck(redis));
  
  // Background workers
  const workerManager = new WorkerManager();
  workerManager.register('metrics-aggregation', new IntervalWorker(
    'metrics-aggregation',
    async () => { /* aggregate metrics */ },
    60000
  ));
  
  // Setup graceful shutdown
  const orchestrator = new GracefulShutdownOrchestrator({
    httpServer: server,
    healthCheck,
    connectionDrainer,
    dbPool,
    redis,
    workerManager
  });
  
  // Health endpoints
  app.get('/health/live', async (req, res) => {
    const health = await healthCheck.getLiveness();
    res.status(health.status === 'healthy' ? 200 : 503).json(health);
  });
  
  app.get('/health/ready', async (req, res) => {
    const health = await healthCheck.getReadiness();
    res.status(health.status === 'healthy' ? 200 : 503).json(health);
  });
  
  // Middleware: reject requests during shutdown
  app.use((req, res, next) => {
    if (orchestrator.isShuttingDown) {
      res.setHeader('Connection', 'close');
      res.status(503).json({ 
        error: 'Service shutting down',
        retryAfter: 5 
      });
      return;
    }
    next();
  });
  
  // Application routes
  app.get('/api/data', async (req, res) => {
    const data = await dbPool.query('SELECT * FROM data LIMIT 10');
    res.json(data.rows);
  });
  
  // Start server
  const PORT = process.env.PORT || 3000;
  server.listen(PORT, () => {
    console.log(`Server listening on port ${PORT}`);
    console.log('Readiness probe: GET /health/ready');
    console.log('Liveness probe:  GET /health/live');
    console.log('Graceful shutdown configured (SIGTERM/SIGINT handled)');
  });
}

// Placeholder factory functions
async function createDatabasePool(): Promise<any> {
  return { 
    query: async (sql: string) => ({ rows: [] }),
    end: async () => console.log('[db] Pool closed')
  };
}

async function createRedisClient(): Promise<any> {
  return {
    ping: async () => 'PONG',
    quit: async () => console.log('[redis] Connection closed')
  };
}

main().catch(console.error);

Common Pitfalls

Pitfall 1: PID 1 Problem in Docker
──────────────────────────────────

# BAD: npm start is PID 1 — Node doesn't handle signals properly
CMD ["npm", "start"]

# npm spawns node as a child process. SIGTERM goes to npm (PID 1),
# which does NOT forward it to the node process.

# GOOD: Node is PID 1
CMD ["node", "server.js"]

# BETTER: Use tini as init process
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "server.js"]

# tini properly forwards signals and reaps zombie processes.


Pitfall 2: Keep-Alive Connections After server.close()
──────────────────────────────────────────────────────

server.close() only stops ACCEPTING new connections.
Existing keep-alive connections remain open indefinitely!

Without connection tracking:
  server.close() → callback never fires (keep-alive connections persist)
  → process never exits → SIGKILL after grace period

With connection tracking:
  server.close() + idle connection cleanup → callback fires
  → clean exit


Pitfall 3: Ignoring the preStop / Endpoint Propagation Race
────────────────────────────────────────────────────────────

Without preStop delay:
  t=0: SIGTERM → app marks unready → starts draining
  t=0-3s: Kubernetes STILL routing traffic to pod (endpoints not yet updated)
  t=0-3s: New requests arrive → get rejected with 503
  Result: Spike of 503 errors during deploys

With preStop delay:
  t=0: preStop runs (sleep 5)
  t=0-5s: Kubernetes endpoints update, traffic stops arriving
  t=5s: SIGTERM → app starts draining (only in-flight, no new traffic)
  Result: Zero 503 errors during deploys


Pitfall 4: Database Transactions During Shutdown
────────────────────────────────────────────────

If shutdown closes the DB pool while a transaction is in-progress:
  BEGIN → INSERT → [pool.end() called] → COMMIT fails → data lost

Solution: Drain HTTP first (priority 20), THEN close DB (priority 80).
In-flight requests complete their transactions before the pool closes.

Comparing Shutdown Strategies

┌────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ Strategy           │ Zero Downtime?     │ Complexity         │ Best For           │
├────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ Kill + Restart     │ No (dropped reqs)  │ None               │ Dev only           │
│                    │                    │                    │                    │
│ server.close()     │ Partial (keep-     │ Low                │ Simple HTTP        │
│ only               │ alive connections  │                    │ servers            │
│                    │ may hang)          │                    │                    │
│ server.close() +   │ Yes for HTTP       │ Medium             │ Stateless HTTP     │
│ connection drain   │ (no WebSocket/     │                    │ APIs               │
│                    │  workers)          │                    │                    │
│ Full orchestrated  │ Yes (all work      │ High               │ Production         │
│ shutdown           │ types drained)     │                    │ services           │
│                    │                    │                    │                    │
│ + K8s preStop      │ Yes (zero 503s     │ High + K8s config  │ K8s deployments    │
│                    │ during deploys)    │                    │                    │
└────────────────────┴────────────────────┴────────────────────┴────────────────────┘

Interview Questions

Q1: Walk through what happens when Kubernetes sends SIGTERM to a pod during a rolling deployment. How do you ensure zero dropped requests?

When Kubernetes initiates a rolling deployment, the sequence is: (1) API server marks the pod as Terminating. (2) In parallel: the endpoints controller removes the pod from Service endpoints AND the kubelet runs the preStop hook. This is the critical detail — endpoint removal and preStop run concurrently. (3) The preStop hook should sleep for 5-10 seconds. This delay ensures all kube-proxy iptables/IPVS rules across all nodes are updated before the app starts draining. Without this, new traffic can arrive at the pod AFTER it starts rejecting requests. (4) After preStop completes, kubelet sends SIGTERM to PID 1. (5) The app catches SIGTERM, marks readiness probe as failing, stops accepting new connections via server.close(), and waits for in-flight requests to complete. (6) It then drains background workers, closes database pools, flushes logs, and calls process.exit(0). (7) If the process doesn't exit within terminationGracePeriodSeconds (default 30s), Kubernetes sends SIGKILL. To ensure zero drops: the connection drain timeout must be less than terminationGracePeriodSeconds minus the preStop delay. So with 30s grace and 5s preStop, you have 25s max for draining.

Q2: Why does Node.js need special handling for signals in Docker containers? What's the PID 1 problem?

In a Docker container, the main process runs as PID 1. The Linux kernel treats PID 1 specially: it doesn't set up default signal handlers. Normal processes get a default SIGTERM handler that terminates them, but PID 1 ignores SIGTERM unless it explicitly registers a handler. Node.js does register a SIGTERM handler, so running node server.js as PID 1 works. The real problem is with npm: CMD ["npm", "start"] makes npm PID 1 and npm spawns node as a child process. When Docker sends SIGTERM to PID 1 (npm), npm doesn't forward it to the child node process. After the grace period, Docker sends SIGKILL, killing everything abruptly. Solutions: (1) Run Node directly: CMD ["node", "server.js"]. (2) Use tini as an init process — tini properly forwards signals and reaps zombie processes. (3) Use Docker's --init flag which injects tini automatically. This is particularly important because without proper signal forwarding, graceful shutdown logic never executes.

Q3: What's the difference between liveness, readiness, and startup probes in Kubernetes? How do each behave during graceful shutdown?

Liveness probe: answers "is the process functioning?" A failure triggers a container restart. This is for detecting deadlocks, infinite loops, or corrupted state where the process is alive but non-functional. During shutdown, the liveness probe should PASS — the process is still running and intentionally shutting down. If liveness fails, Kubernetes restarts the container, which defeats graceful shutdown. Readiness probe: answers "can this instance handle traffic?" A failure removes the pod from Service endpoints so no new traffic is sent to it. During shutdown, readiness should immediately FAIL. The app sets a shutdown flag and the readiness endpoint returns 503. This tells Kubernetes to stop routing traffic while in-flight requests complete. Startup probe: answers "has the application finished initializing?" Used for slow-starting apps. While the startup probe is failing, liveness and readiness probes are suspended. Once it passes, the other probes take over. During shutdown, the startup probe is irrelevant — it already passed during initialization.

Q4: How do you handle long-running WebSocket connections during graceful shutdown?

WebSocket connections are fundamentally different from HTTP requests — they can last minutes, hours, or days. For graceful shutdown: (1) Send a close frame (opcode 0x8) to each connected client with a status code indicating server shutdown (1001 — "going away"). (2) Give clients a short window (5-10 seconds) to reconnect to another instance. (3) After the window, force-close remaining connections. On the client side: the WebSocket client should have automatic reconnection logic. When it receives a 1001 close frame, it reconnects immediately — the load balancer routes it to a healthy pod. For critical operations (like a collaborative editing session), the server should persist uncommitted state to a shared store (Redis, database) before closing the connection, so another instance can pick up where it left off. For truly long-lived connections (chat, live dashboards), consider implementing a "drain mode" where the server stops sending updates and tells the client to reconnect, rather than abruptly closing.

Q5: How do you test graceful shutdown behavior? How can you verify zero dropped requests during deployments?

Testing strategies: (1) Unit test shutdown hooks: Call shutdown() programmatically in tests and verify that DB pool, Redis, and message consumers are closed in the correct order. Assert that hooks execute by priority. (2) Integration test with load: Use a load testing tool (k6, artillery) to send continuous requests while triggering shutdown. Verify that no requests get 502/503 errors and all responses are complete. (3) Signal test: In CI, start the server, send traffic, then kill -SIGTERM $PID. Assert the process exits with code 0 and all in-flight requests completed. (4) Kubernetes test: During a rolling deployment (kubectl rollout), run concurrent requests and count non-200 responses. Zero 5xx errors means graceful shutdown is working. (5) Chaos testing: Use tools like Chaos Mesh or LitmusChaos to randomly kill pods while under load. Monitor error rates in the observability stack. (6) Connection tracking verification: During tests, log connection states (active, idle, draining) and verify that idle connections are closed immediately, active connections complete naturally, and force-close only happens at the timeout deadline.

Key Takeaways

Graceful shutdown prevents data loss and user-facing errors: Stop accepting new work, finish in-flight work, clean up resources, then exit. The difference between 502 errors and zero-downtime deploys.
SIGTERM is the polite termination signal: Kubernetes, Docker, and process managers send SIGTERM first. Your app must catch it and initiate shutdown. SIGKILL cannot be caught — it's the fallback.
server.close() is necessary but not sufficient: It stops accepting new TCP connections but doesn't close existing keep-alive connections. You must track and drain active connections explicitly.
Shutdown hooks must be prioritized: Health check → unready first, then stop accepting traffic, then drain workers, then close external connections, then flush logs. Order prevents resource conflicts.
preStop hook solves the Kubernetes endpoint race condition: Endpoint removal and SIGTERM happen in parallel. A 5-second preStop delay ensures traffic stops arriving before the app starts draining.
PID 1 matters in Docker: Don't use CMD ["npm", "start"] — npm doesn't forward signals. Use CMD ["node", "server.js"] or tini as an init process.
Connection drain timeout must respect the grace period: If Kubernetes gives 30 seconds and preStop takes 5 seconds, your drain timeout must be under 25 seconds. Otherwise SIGKILL interrupts the drain.
Readiness probe should fail immediately on shutdown; liveness should keep passing: Failing readiness removes the pod from traffic. Failing liveness restarts the pod, which disrupts graceful shutdown.
Per-hook timeouts prevent one slow hook from blocking shutdown: If the database pool takes 60 seconds to close, the force-exit timer kills the process. Use individual timeouts plus a global deadline.
Test shutdown under load: Graceful shutdown bugs only surface when the server is actively processing requests. A load test during SIGTERM is the only way to verify zero-downtime behavior.

What did you think?