Backend Graceful Shutdown & Lifecycle: Signal Handling, Connection Draining, Health Checks & Kubernetes Pod Lifecycle
Backend Graceful Shutdown & Lifecycle: Signal Handling, Connection Draining, Health Checks & Kubernetes Pod Lifecycle
Why Graceful Shutdown Matters
When a server process terminates abruptly, in-flight HTTP requests get dropped, database transactions are left incomplete, WebSocket connections die without close frames, and background jobs vanish mid-execution. Users see 502 errors. Data becomes inconsistent. The next deploy becomes a mini-outage.
Graceful shutdown means: when the process receives a termination signal, it stops accepting new work, finishes in-flight work, cleans up resources, and then exits. Zero dropped requests. Zero data corruption.
Ungraceful Shutdown:
┌───────────────┐
│ Server │
│ ┌───────────┐ │ SIGTERM
│ │ Request A │ │←────────────── kill -15 pid
│ │ (halfway) │ │
│ ├───────────┤ │ Process exits immediately
│ │ Request B │ │ ┌────────────────────────────┐
│ │ (waiting │ │───→│ Request A: 502 Bad Gateway │
│ │ for DB) │ │ │ Request B: Connection reset │
│ ├───────────┤ │ │ DB transaction: half-written │
│ │ Cron job │ │ │ Cron job: partial execution │
│ │ (running) │ │ └────────────────────────────┘
│ └───────────┘ │
└───────────────┘
Graceful Shutdown:
┌───────────────┐
│ Server │ SIGTERM
│ │←────────────── 1. Stop accepting new connections
│ ┌───────────┐ │ 2. Health check → unhealthy
│ │ Request A │→│──→ completes 3. Finish in-flight requests
│ │ │ │ 4. Close DB connections
│ ├───────────┤ │ 5. Flush logs/metrics
│ │ Request B │→│──→ completes 6. Exit with code 0
│ ├───────────┤ │
│ │ Cron job │→│──→ checkpoint
│ └───────────┘ │
└───────────────┘
└──→ exit(0) after all work completes
Signal Handling in Node.js
Unix Signals for Process Lifecycle:
Signal │ Default Action │ Can Catch? │ Typical Use
───────────┼────────────────┼────────────┼──────────────────────────
SIGTERM │ Terminate │ Yes │ Polite "please shut down"
SIGINT │ Terminate │ Yes │ Ctrl+C in terminal
SIGQUIT │ Core dump │ Yes │ Ctrl+\ (debugging)
SIGKILL │ Terminate │ NO │ Force kill (cannot catch)
SIGHUP │ Terminate │ Yes │ Terminal closed / reload config
SIGUSR1 │ Terminate │ Yes │ Node: start debugger
SIGUSR2 │ Terminate │ Yes │ Custom (e.g., heap dump)
Kubernetes sends:
1. SIGTERM → pod → terminationGracePeriodSeconds countdown starts (default: 30s)
2. If pod still running after grace period → SIGKILL (uncatchable)
Docker sends:
docker stop → SIGTERM → wait 10 seconds → SIGKILL
Signal Handler Implementation
type ShutdownHook = {
name: string;
handler: () => Promise<void>;
priority: number; // Lower = runs first
timeoutMs: number; // Max time for this hook
};
class ProcessLifecycleManager {
private hooks: ShutdownHook[] = [];
private isShuttingDown = false;
private shutdownPromise: Promise<void> | null = null;
private forceShutdownTimeoutMs: number;
constructor(options: { forceShutdownTimeoutMs?: number } = {}) {
this.forceShutdownTimeoutMs = options.forceShutdownTimeoutMs ?? 30000;
this.registerSignalHandlers();
}
private registerSignalHandlers(): void {
// Handle SIGTERM (Kubernetes, docker stop, kill)
process.on('SIGTERM', () => {
console.log('[lifecycle] Received SIGTERM');
this.shutdown('SIGTERM');
});
// Handle SIGINT (Ctrl+C)
process.on('SIGINT', () => {
console.log('[lifecycle] Received SIGINT');
this.shutdown('SIGINT');
});
// Handle uncaught exceptions — try to shut down gracefully
process.on('uncaughtException', (error: Error) => {
console.error('[lifecycle] Uncaught exception:', error);
this.shutdown('uncaughtException', 1);
});
// Handle unhandled promise rejections
process.on('unhandledRejection', (reason: any) => {
console.error('[lifecycle] Unhandled rejection:', reason);
this.shutdown('unhandledRejection', 1);
});
// Handle second SIGINT/SIGTERM as force shutdown
let signalCount = 0;
const secondSignalHandler = (signal: string) => {
signalCount++;
if (signalCount > 1) {
console.log('[lifecycle] Force shutdown (second signal received)');
process.exit(1);
}
};
process.on('SIGTERM', secondSignalHandler);
process.on('SIGINT', secondSignalHandler);
}
// Register a shutdown hook
addHook(hook: ShutdownHook): void {
this.hooks.push(hook);
this.hooks.sort((a, b) => a.priority - b.priority);
}
// Convenience method for simple hooks
onShutdown(name: string, handler: () => Promise<void>, priority = 100): void {
this.addHook({ name, handler, priority, timeoutMs: 10000 });
}
async shutdown(reason: string, exitCode: number = 0): Promise<void> {
// Prevent concurrent shutdowns
if (this.isShuttingDown) {
return this.shutdownPromise!;
}
this.isShuttingDown = true;
console.log(`[lifecycle] Shutting down (reason: ${reason})`);
this.shutdownPromise = this.executeShutdown(exitCode);
return this.shutdownPromise;
}
private async executeShutdown(exitCode: number): Promise<void> {
const startTime = Date.now();
// Set a hard deadline — if hooks don't finish, force exit
const forceTimer = setTimeout(() => {
console.error('[lifecycle] Force shutdown — hooks did not complete in time');
process.exit(1);
}, this.forceShutdownTimeoutMs);
// Don't let this timer keep the process alive
forceTimer.unref();
// Execute hooks in priority order
for (const hook of this.hooks) {
const hookStart = Date.now();
try {
console.log(`[lifecycle] Running hook: ${hook.name}`);
await Promise.race([
hook.handler(),
new Promise<void>((_, reject) =>
setTimeout(
() => reject(new Error(`Hook ${hook.name} timed out`)),
hook.timeoutMs
)
)
]);
const elapsed = Date.now() - hookStart;
console.log(`[lifecycle] Hook ${hook.name} completed (${elapsed}ms)`);
} catch (error) {
console.error(`[lifecycle] Hook ${hook.name} failed:`, error);
// Continue with other hooks even if one fails
}
}
const totalElapsed = Date.now() - startTime;
console.log(`[lifecycle] All hooks completed (${totalElapsed}ms). Exiting.`);
clearTimeout(forceTimer);
process.exit(exitCode);
}
get shuttingDown(): boolean {
return this.isShuttingDown;
}
}
HTTP Server Connection Draining
Connection Draining Timeline:
SIGTERM received
│
▼
t=0s: Stop listening for new connections
server.close() — stops accepting new TCP connections
│
▼
t=0s: Existing connections continue processing
┌─────────────────────────────────────────────┐
│ In-flight Request A: processing → response │
│ In-flight Request B: processing → response │
│ Keep-Alive Conn C: idle → close │
│ WebSocket D: close frame → disconnect │
└─────────────────────────────────────────────┘
│
▼
t=5s: All in-flight requests completed
All connections drained
│
▼
t=5s: server 'close' event fires
└─→ proceed with remaining shutdown hooks
Edge Case: Long-Running Requests
What if a request takes 60s but grace period is 30s?
t=0 t=25s t=30s
│ │ │
▼ ▼ ▼
SIGTERM Connection-level SIGKILL
timeout fires (Kubernetes)
→ 503 response → Process dies
→ Socket destroyed
You must enforce a per-connection timeout shorter than the grace period.
HTTP Connection Drainer
import { Server, IncomingMessage, ServerResponse, Socket } from 'http';
class HttpConnectionDrainer {
private server: Server;
private activeConnections: Map<Socket, ConnectionInfo> = new Map();
private requestCount = 0;
private isClosing = false;
constructor(server: Server, private options: DrainerOptions = {}) {
this.server = server;
this.options = {
connectionTimeoutMs: 25000, // Must be < Kubernetes grace period
keepAliveTimeoutMs: 5000, // Quickly close idle keep-alive connections
...options
};
this.trackConnections();
}
private trackConnections(): void {
// Track every TCP connection
this.server.on('connection', (socket: Socket) => {
const info: ConnectionInfo = {
socket,
activeRequests: 0,
connectedAt: Date.now(),
lastActivity: Date.now()
};
this.activeConnections.set(socket, info);
socket.on('close', () => {
this.activeConnections.delete(socket);
});
});
// Track request lifecycle on each connection
this.server.on('request', (req: IncomingMessage, res: ServerResponse) => {
const socket = req.socket;
const info = this.activeConnections.get(socket);
if (info) {
info.activeRequests++;
info.lastActivity = Date.now();
}
this.requestCount++;
// When response finishes, decrement active request count
res.on('finish', () => {
if (info) {
info.activeRequests--;
info.lastActivity = Date.now();
}
// If we're closing and this connection has no more requests, close it
if (this.isClosing && info && info.activeRequests === 0) {
socket.end(); // Gracefully close the TCP connection
}
});
// If shutting down, add "Connection: close" header
// This tells the client not to reuse this connection
if (this.isClosing) {
res.setHeader('Connection', 'close');
}
});
}
async drain(): Promise<void> {
this.isClosing = true;
return new Promise<void>((resolve) => {
// 1. Stop accepting new connections
this.server.close(() => {
// All connections closed
resolve();
});
// 2. Close idle keep-alive connections immediately
for (const [socket, info] of this.activeConnections) {
if (info.activeRequests === 0) {
socket.end(); // Graceful close for idle connections
}
}
// 3. Set connection-level timeout for in-flight requests
const timeout = setTimeout(() => {
// Force-close remaining connections
for (const [socket, info] of this.activeConnections) {
console.warn(
`[drain] Force-closing connection with ${info.activeRequests} active requests`
);
socket.destroy(); // Hard close — response lost
}
resolve();
}, this.options.connectionTimeoutMs);
timeout.unref();
});
}
getStats(): ConnectionStats {
let activeRequests = 0;
let idleConnections = 0;
for (const [, info] of this.activeConnections) {
activeRequests += info.activeRequests;
if (info.activeRequests === 0) idleConnections++;
}
return {
totalConnections: this.activeConnections.size,
activeRequests,
idleConnections,
isClosing: this.isClosing,
totalRequestsServed: this.requestCount
};
}
}
interface ConnectionInfo {
socket: Socket;
activeRequests: number;
connectedAt: number;
lastActivity: number;
}
interface DrainerOptions {
connectionTimeoutMs?: number;
keepAliveTimeoutMs?: number;
}
interface ConnectionStats {
totalConnections: number;
activeRequests: number;
idleConnections: number;
isClosing: boolean;
totalRequestsServed: number;
}
Health Check Integration
Readiness vs Liveness:
Liveness Probe: "Is the process alive?"
Failed → Kubernetes RESTARTS the pod
Example: process deadlocked, out of memory
Readiness Probe: "Can the process handle traffic?"
Failed → Kubernetes REMOVES pod from service endpoints
Example: shutting down, warming cache, DB connecting
During Graceful Shutdown:
┌──────────────────────────────────────────────────────────────┐
│ │
│ t=0: SIGTERM received │
│ readiness → FAIL │
│ liveness → PASS (process is still alive!) │
│ │
│ t=0-5s: Kubernetes sees failed readiness │
│ Removes pod from Service endpoints │
│ Load balancer stops sending NEW requests │
│ But in-flight requests continue processing │
│ │
│ t=5s: All in-flight requests complete │
│ DB connections closed │
│ Logs flushed │
│ │
│ t=5s: Process exits with code 0 │
│ │
│ If t > terminationGracePeriodSeconds: │
│ SIGKILL — forced termination │
│ │
└──────────────────────────────────────────────────────────────┘
CRITICAL RACE CONDITION:
Kubernetes endpoint update propagation is NOT instant.
After readiness fails, it takes ~1-5 seconds for the
kube-proxy / iptables / IPVS rules to update across
all nodes. During this window, new requests may still
arrive at the pod.
Solution: preStop hook with sleep
pod.spec.containers[].lifecycle.preStop:
exec:
command: ["sh", "-c", "sleep 5"]
This delays SIGTERM by 5 seconds, giving time for
endpoints to propagate before the app starts draining.
Health Check Controller
interface HealthStatus {
status: 'healthy' | 'degraded' | 'unhealthy';
checks: Record<string, CheckResult>;
timestamp: string;
uptime: number;
}
interface CheckResult {
status: 'pass' | 'warn' | 'fail';
message?: string;
latencyMs?: number;
}
type HealthCheck = {
name: string;
check: () => Promise<CheckResult>;
critical: boolean; // If critical check fails → unhealthy
intervalMs?: number; // Background check interval
};
class HealthCheckController {
private checks: HealthCheck[] = [];
private cachedResults: Map<string, CheckResult> = new Map();
private isReady = true;
private isAlive = true;
private startTime = Date.now();
private backgroundTimers: NodeJS.Timeout[] = [];
addCheck(check: HealthCheck): void {
this.checks.push(check);
// Run background checks periodically
if (check.intervalMs) {
const timer = setInterval(async () => {
try {
const result = await Promise.race([
check.check(),
new Promise<CheckResult>((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 5000)
)
]);
this.cachedResults.set(check.name, result);
} catch (error) {
this.cachedResults.set(check.name, {
status: 'fail',
message: String(error)
});
}
}, check.intervalMs);
timer.unref();
this.backgroundTimers.push(timer);
}
}
// Called during shutdown
markUnready(): void {
this.isReady = false;
}
markUnhealthy(): void {
this.isAlive = false;
}
// Liveness endpoint: GET /health/live
async getLiveness(): Promise<HealthStatus> {
return {
status: this.isAlive ? 'healthy' : 'unhealthy',
checks: {},
timestamp: new Date().toISOString(),
uptime: Date.now() - this.startTime
};
}
// Readiness endpoint: GET /health/ready
async getReadiness(): Promise<HealthStatus> {
if (!this.isReady) {
return {
status: 'unhealthy',
checks: { shutdown: { status: 'fail', message: 'Shutting down' } },
timestamp: new Date().toISOString(),
uptime: Date.now() - this.startTime
};
}
const results: Record<string, CheckResult> = {};
let overallHealthy = true;
for (const check of this.checks) {
// Use cached result if available
const cached = this.cachedResults.get(check.name);
if (cached) {
results[check.name] = cached;
} else {
try {
const start = Date.now();
const result = await check.check();
result.latencyMs = Date.now() - start;
results[check.name] = result;
} catch (error) {
results[check.name] = {
status: 'fail',
message: String(error)
};
}
}
if (results[check.name].status === 'fail' && check.critical) {
overallHealthy = false;
}
}
return {
status: overallHealthy ? 'healthy' : 'unhealthy',
checks: results,
timestamp: new Date().toISOString(),
uptime: Date.now() - this.startTime
};
}
// Startup endpoint: GET /health/startup
// Used for slow-starting apps (Kubernetes startupProbe)
async getStartup(): Promise<HealthStatus> {
// Could check: DB connected, cache warmed, config loaded, etc.
return this.getReadiness();
}
dispose(): void {
for (const timer of this.backgroundTimers) {
clearInterval(timer);
}
}
}
// Common health checks
function databaseHealthCheck(pool: any): HealthCheck {
return {
name: 'database',
critical: true,
intervalMs: 10000,
check: async () => {
try {
const start = Date.now();
await pool.query('SELECT 1');
return {
status: 'pass',
latencyMs: Date.now() - start
};
} catch (error) {
return {
status: 'fail',
message: `Database unreachable: ${error}`
};
}
}
};
}
function redisHealthCheck(redis: any): HealthCheck {
return {
name: 'redis',
critical: false, // Non-critical — app can work without cache
intervalMs: 15000,
check: async () => {
try {
const start = Date.now();
await redis.ping();
return {
status: 'pass',
latencyMs: Date.now() - start
};
} catch {
return {
status: 'warn',
message: 'Redis unavailable — falling back to in-memory cache'
};
}
}
};
}
function diskSpaceHealthCheck(thresholdPercent = 90): HealthCheck {
return {
name: 'disk',
critical: true,
intervalMs: 60000,
check: async () => {
const { execSync } = require('child_process');
const output = execSync("df -h / | tail -1 | awk '{print $5}'").toString().trim();
const usedPercent = parseInt(output.replace('%', ''), 10);
if (usedPercent > thresholdPercent) {
return {
status: 'fail',
message: `Disk usage at ${usedPercent}% (threshold: ${thresholdPercent}%)`
};
}
return {
status: usedPercent > thresholdPercent - 10 ? 'warn' : 'pass',
message: `Disk usage: ${usedPercent}%`
};
}
};
}
Complete Graceful Shutdown Orchestrator
Shutdown Phase Ordering:
Phase 1: Health Check → Unready
(Load balancer stops sending new traffic)
Wait for endpoint propagation (~5 seconds)
Phase 2: Stop HTTP Server
server.close() — stops accepting new TCP connections
Phase 3: Drain Connections
Wait for in-flight requests to complete
Force-close after timeout
Phase 4: Stop Background Workers
Cron jobs checkpoint and halt
Message consumers stop consuming
Worker threads send completion signal
Phase 5: Close External Connections
Database connection pools drain
Redis/cache connections close
Message broker connections close
Phase 6: Flush Buffers
Logs flushed to disk/remote
Metrics pushed to collector
Analytics events sent
Phase 7: Exit
process.exit(0)
Priority Map:
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ P:10 │ │ P:20 │ │ P:50 │ │ P:80 │ │ P:90 │
│ Health → │→│ HTTP │→│ Background │→│ External │→│ Flush │
│ Unready │ │ Server │ │ Workers │ │ Connections│ │ Buffers │
│ │ │ Close │ │ Halt │ │ Close │ │ & Exit │
└────────────┘ └────────────┘ └────────────┘ └────────────┘ └────────────┘
import http from 'http';
import { EventEmitter } from 'events';
interface ServerComponents {
httpServer: http.Server;
healthCheck: HealthCheckController;
connectionDrainer: HttpConnectionDrainer;
dbPool?: any; // Database connection pool
redis?: any; // Redis client
messageConsumer?: any; // Message queue consumer
workerManager?: WorkerManager;
metricsCollector?: MetricsCollector;
logger?: FlushableLogger;
}
class GracefulShutdownOrchestrator {
private lifecycle: ProcessLifecycleManager;
constructor(private components: ServerComponents) {
this.lifecycle = new ProcessLifecycleManager({
forceShutdownTimeoutMs: 28000 // Must be < K8s terminationGracePeriodSeconds
});
this.registerShutdownHooks();
}
private registerShutdownHooks(): void {
const {
healthCheck, httpServer, connectionDrainer,
dbPool, redis, messageConsumer,
workerManager, metricsCollector, logger
} = this.components;
// Phase 1: Mark as unready (Priority 10)
this.lifecycle.addHook({
name: 'mark-unready',
priority: 10,
timeoutMs: 1000,
handler: async () => {
healthCheck.markUnready();
// Wait for endpoint propagation
// In Kubernetes, use preStop hook instead
await sleep(5000);
}
});
// Phase 2: Stop accepting connections (Priority 20)
this.lifecycle.addHook({
name: 'stop-http-server',
priority: 20,
timeoutMs: 2000,
handler: async () => {
await connectionDrainer.drain();
}
});
// Phase 3: Stop message consumers (Priority 30)
if (messageConsumer) {
this.lifecycle.addHook({
name: 'stop-message-consumer',
priority: 30,
timeoutMs: 10000,
handler: async () => {
// Stop consuming but finish processing current messages
await messageConsumer.close();
}
});
}
// Phase 4: Stop background workers (Priority 50)
if (workerManager) {
this.lifecycle.addHook({
name: 'stop-workers',
priority: 50,
timeoutMs: 15000,
handler: async () => {
await workerManager.shutdownAll();
}
});
}
// Phase 5: Close database pool (Priority 80)
if (dbPool) {
this.lifecycle.addHook({
name: 'close-database',
priority: 80,
timeoutMs: 5000,
handler: async () => {
await dbPool.end();
}
});
}
// Phase 5: Close Redis (Priority 80)
if (redis) {
this.lifecycle.addHook({
name: 'close-redis',
priority: 80,
timeoutMs: 3000,
handler: async () => {
await redis.quit();
}
});
}
// Phase 6: Flush metrics (Priority 90)
if (metricsCollector) {
this.lifecycle.addHook({
name: 'flush-metrics',
priority: 90,
timeoutMs: 5000,
handler: async () => {
await metricsCollector.flush();
}
});
}
// Phase 6: Flush logs (Priority 95)
if (logger) {
this.lifecycle.addHook({
name: 'flush-logs',
priority: 95,
timeoutMs: 3000,
handler: async () => {
await logger.flush();
}
});
}
// Dispose health check timers
this.lifecycle.addHook({
name: 'dispose-healthcheck',
priority: 99,
timeoutMs: 1000,
handler: async () => {
healthCheck.dispose();
}
});
}
// Check if system is shutting down (for middleware use)
get isShuttingDown(): boolean {
return this.lifecycle.shuttingDown;
}
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Worker Manager — manages background task workers
class WorkerManager {
private workers: Map<string, ManagedWorker> = new Map();
register(name: string, worker: ManagedWorker): void {
this.workers.set(name, worker);
}
async shutdownAll(): Promise<void> {
const shutdownPromises = Array.from(this.workers.entries()).map(
async ([name, worker]) => {
try {
console.log(`[worker] Stopping worker: ${name}`);
// Signal worker to stop after current task
worker.stop();
// Wait for the current task to finish
await Promise.race([
worker.waitForIdle(),
sleep(10000) // Max wait per worker
]);
console.log(`[worker] Worker stopped: ${name}`);
} catch (error) {
console.error(`[worker] Error stopping ${name}:`, error);
}
}
);
await Promise.all(shutdownPromises);
}
}
interface ManagedWorker {
stop(): void; // Signal to stop after current task
waitForIdle(): Promise<void>; // Resolves when no task is running
}
class IntervalWorker implements ManagedWorker {
private timer?: NodeJS.Timeout;
private running = false;
private shouldStop = false;
private idleResolvers: Array<() => void> = [];
constructor(
private name: string,
private task: () => Promise<void>,
private intervalMs: number
) {
this.start();
}
private start(): void {
this.timer = setInterval(async () => {
if (this.shouldStop || this.running) return;
this.running = true;
try {
await this.task();
} catch (error) {
console.error(`[worker:${this.name}] Error:`, error);
} finally {
this.running = false;
if (this.shouldStop) {
// Notify waiters that we're idle
for (const resolve of this.idleResolvers) {
resolve();
}
this.idleResolvers = [];
}
}
}, this.intervalMs);
}
stop(): void {
this.shouldStop = true;
if (this.timer) {
clearInterval(this.timer);
}
// If already idle, resolve immediately
if (!this.running) {
for (const resolve of this.idleResolvers) {
resolve();
}
this.idleResolvers = [];
}
}
waitForIdle(): Promise<void> {
if (!this.running) return Promise.resolve();
return new Promise(resolve => {
this.idleResolvers.push(resolve);
});
}
}
interface MetricsCollector {
flush(): Promise<void>;
}
interface FlushableLogger {
flush(): Promise<void>;
}
Kubernetes Pod Lifecycle Integration
Kubernetes Pod Termination Sequence:
1. API Server marks pod as "Terminating"
│
2. Endpoints controller REMOVES pod from Service endpoints
│ IN PARALLEL
3. kubelet sees pod state, runs preStop hook ──────────────────────┐
│ │
│ preStop: sleep 5 ← Gives time for endpoints to propagate │
│ │
4. kubelet sends SIGTERM to container PID 1 │
│ │
5. terminationGracePeriodSeconds countdown starts (default: 30s) │
│ │
│ App: mark unready → drain connections → cleanup │
│ │
6. If still running when grace period expires: SIGKILL │
└───────────────────────────────────────────────────────────────┘
CRITICAL: Steps 2 and 3 happen IN PARALLEL, not sequentially.
The preStop hook delays SIGTERM so that by the time the app
starts draining, the Service endpoints are already updated.
Kubernetes Manifest:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 30 # Total grace period
containers:
- name: api-server
lifecycle:
preStop: # Runs BEFORE SIGTERM
exec:
command: ["sh", "-c", "sleep 5"] # Wait for endpoint propagation
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
failureThreshold: 1 # Fail fast on shutdown
livenessProbe:
httpGet:
path: /health/live
port: 8080
periodSeconds: 10
failureThreshold: 3 # Allow temporary issues
startupProbe:
httpGet:
path: /health/startup
port: 8080
periodSeconds: 5
failureThreshold: 30 # Allow 150s for slow startup
Timeline With preStop:
─────────────────────────────────────────────────
t=0s Pod marked Terminating + preStop starts
t=0-5s Endpoints propagating (pod still receiving traffic)
preStop: sleeping
t=5s preStop completes → SIGTERM sent to app
t=5s App: readiness → fail, drain starts
t=5-25s App: draining connections, cleanup
t=25s App exits with code 0
─────────────────────────────────────────────────
t=30s (deadline — SIGKILL would fire if app hadn't exited)
Putting It All Together: Express/Fastify Server
import http from 'http';
import express from 'express';
async function main() {
const app = express();
// Create HTTP server
const server = http.createServer(app);
// Initialize components
const healthCheck = new HealthCheckController();
const connectionDrainer = new HttpConnectionDrainer(server, {
connectionTimeoutMs: 20000
});
// Simulate external dependencies
const dbPool = await createDatabasePool();
const redis = await createRedisClient();
// Register health checks
healthCheck.addCheck(databaseHealthCheck(dbPool));
healthCheck.addCheck(redisHealthCheck(redis));
// Background workers
const workerManager = new WorkerManager();
workerManager.register('metrics-aggregation', new IntervalWorker(
'metrics-aggregation',
async () => { /* aggregate metrics */ },
60000
));
// Setup graceful shutdown
const orchestrator = new GracefulShutdownOrchestrator({
httpServer: server,
healthCheck,
connectionDrainer,
dbPool,
redis,
workerManager
});
// Health endpoints
app.get('/health/live', async (req, res) => {
const health = await healthCheck.getLiveness();
res.status(health.status === 'healthy' ? 200 : 503).json(health);
});
app.get('/health/ready', async (req, res) => {
const health = await healthCheck.getReadiness();
res.status(health.status === 'healthy' ? 200 : 503).json(health);
});
// Middleware: reject requests during shutdown
app.use((req, res, next) => {
if (orchestrator.isShuttingDown) {
res.setHeader('Connection', 'close');
res.status(503).json({
error: 'Service shutting down',
retryAfter: 5
});
return;
}
next();
});
// Application routes
app.get('/api/data', async (req, res) => {
const data = await dbPool.query('SELECT * FROM data LIMIT 10');
res.json(data.rows);
});
// Start server
const PORT = process.env.PORT || 3000;
server.listen(PORT, () => {
console.log(`Server listening on port ${PORT}`);
console.log('Readiness probe: GET /health/ready');
console.log('Liveness probe: GET /health/live');
console.log('Graceful shutdown configured (SIGTERM/SIGINT handled)');
});
}
// Placeholder factory functions
async function createDatabasePool(): Promise<any> {
return {
query: async (sql: string) => ({ rows: [] }),
end: async () => console.log('[db] Pool closed')
};
}
async function createRedisClient(): Promise<any> {
return {
ping: async () => 'PONG',
quit: async () => console.log('[redis] Connection closed')
};
}
main().catch(console.error);
Common Pitfalls
Pitfall 1: PID 1 Problem in Docker
──────────────────────────────────
# BAD: npm start is PID 1 — Node doesn't handle signals properly
CMD ["npm", "start"]
# npm spawns node as a child process. SIGTERM goes to npm (PID 1),
# which does NOT forward it to the node process.
# GOOD: Node is PID 1
CMD ["node", "server.js"]
# BETTER: Use tini as init process
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "server.js"]
# tini properly forwards signals and reaps zombie processes.
Pitfall 2: Keep-Alive Connections After server.close()
──────────────────────────────────────────────────────
server.close() only stops ACCEPTING new connections.
Existing keep-alive connections remain open indefinitely!
Without connection tracking:
server.close() → callback never fires (keep-alive connections persist)
→ process never exits → SIGKILL after grace period
With connection tracking:
server.close() + idle connection cleanup → callback fires
→ clean exit
Pitfall 3: Ignoring the preStop / Endpoint Propagation Race
────────────────────────────────────────────────────────────
Without preStop delay:
t=0: SIGTERM → app marks unready → starts draining
t=0-3s: Kubernetes STILL routing traffic to pod (endpoints not yet updated)
t=0-3s: New requests arrive → get rejected with 503
Result: Spike of 503 errors during deploys
With preStop delay:
t=0: preStop runs (sleep 5)
t=0-5s: Kubernetes endpoints update, traffic stops arriving
t=5s: SIGTERM → app starts draining (only in-flight, no new traffic)
Result: Zero 503 errors during deploys
Pitfall 4: Database Transactions During Shutdown
────────────────────────────────────────────────
If shutdown closes the DB pool while a transaction is in-progress:
BEGIN → INSERT → [pool.end() called] → COMMIT fails → data lost
Solution: Drain HTTP first (priority 20), THEN close DB (priority 80).
In-flight requests complete their transactions before the pool closes.
Comparing Shutdown Strategies
┌────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ Strategy │ Zero Downtime? │ Complexity │ Best For │
├────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ Kill + Restart │ No (dropped reqs) │ None │ Dev only │
│ │ │ │ │
│ server.close() │ Partial (keep- │ Low │ Simple HTTP │
│ only │ alive connections │ │ servers │
│ │ may hang) │ │ │
│ server.close() + │ Yes for HTTP │ Medium │ Stateless HTTP │
│ connection drain │ (no WebSocket/ │ │ APIs │
│ │ workers) │ │ │
│ Full orchestrated │ Yes (all work │ High │ Production │
│ shutdown │ types drained) │ │ services │
│ │ │ │ │
│ + K8s preStop │ Yes (zero 503s │ High + K8s config │ K8s deployments │
│ │ during deploys) │ │ │
└────────────────────┴────────────────────┴────────────────────┴────────────────────┘
Interview Questions
Q1: Walk through what happens when Kubernetes sends SIGTERM to a pod during a rolling deployment. How do you ensure zero dropped requests?
When Kubernetes initiates a rolling deployment, the sequence is: (1) API server marks the pod as Terminating. (2) In parallel: the endpoints controller removes the pod from Service endpoints AND the kubelet runs the preStop hook. This is the critical detail — endpoint removal and preStop run concurrently. (3) The preStop hook should sleep for 5-10 seconds. This delay ensures all kube-proxy iptables/IPVS rules across all nodes are updated before the app starts draining. Without this, new traffic can arrive at the pod AFTER it starts rejecting requests. (4) After preStop completes, kubelet sends SIGTERM to PID 1. (5) The app catches SIGTERM, marks readiness probe as failing, stops accepting new connections via server.close(), and waits for in-flight requests to complete. (6) It then drains background workers, closes database pools, flushes logs, and calls process.exit(0). (7) If the process doesn't exit within terminationGracePeriodSeconds (default 30s), Kubernetes sends SIGKILL. To ensure zero drops: the connection drain timeout must be less than terminationGracePeriodSeconds minus the preStop delay. So with 30s grace and 5s preStop, you have 25s max for draining.
Q2: Why does Node.js need special handling for signals in Docker containers? What's the PID 1 problem?
In a Docker container, the main process runs as PID 1. The Linux kernel treats PID 1 specially: it doesn't set up default signal handlers. Normal processes get a default SIGTERM handler that terminates them, but PID 1 ignores SIGTERM unless it explicitly registers a handler. Node.js does register a SIGTERM handler, so running node server.js as PID 1 works. The real problem is with npm: CMD ["npm", "start"] makes npm PID 1 and npm spawns node as a child process. When Docker sends SIGTERM to PID 1 (npm), npm doesn't forward it to the child node process. After the grace period, Docker sends SIGKILL, killing everything abruptly. Solutions: (1) Run Node directly: CMD ["node", "server.js"]. (2) Use tini as an init process — tini properly forwards signals and reaps zombie processes. (3) Use Docker's --init flag which injects tini automatically. This is particularly important because without proper signal forwarding, graceful shutdown logic never executes.
Q3: What's the difference between liveness, readiness, and startup probes in Kubernetes? How do each behave during graceful shutdown?
Liveness probe: answers "is the process functioning?" A failure triggers a container restart. This is for detecting deadlocks, infinite loops, or corrupted state where the process is alive but non-functional. During shutdown, the liveness probe should PASS — the process is still running and intentionally shutting down. If liveness fails, Kubernetes restarts the container, which defeats graceful shutdown. Readiness probe: answers "can this instance handle traffic?" A failure removes the pod from Service endpoints so no new traffic is sent to it. During shutdown, readiness should immediately FAIL. The app sets a shutdown flag and the readiness endpoint returns 503. This tells Kubernetes to stop routing traffic while in-flight requests complete. Startup probe: answers "has the application finished initializing?" Used for slow-starting apps. While the startup probe is failing, liveness and readiness probes are suspended. Once it passes, the other probes take over. During shutdown, the startup probe is irrelevant — it already passed during initialization.
Q4: How do you handle long-running WebSocket connections during graceful shutdown?
WebSocket connections are fundamentally different from HTTP requests — they can last minutes, hours, or days. For graceful shutdown: (1) Send a close frame (opcode 0x8) to each connected client with a status code indicating server shutdown (1001 — "going away"). (2) Give clients a short window (5-10 seconds) to reconnect to another instance. (3) After the window, force-close remaining connections. On the client side: the WebSocket client should have automatic reconnection logic. When it receives a 1001 close frame, it reconnects immediately — the load balancer routes it to a healthy pod. For critical operations (like a collaborative editing session), the server should persist uncommitted state to a shared store (Redis, database) before closing the connection, so another instance can pick up where it left off. For truly long-lived connections (chat, live dashboards), consider implementing a "drain mode" where the server stops sending updates and tells the client to reconnect, rather than abruptly closing.
Q5: How do you test graceful shutdown behavior? How can you verify zero dropped requests during deployments?
Testing strategies: (1) Unit test shutdown hooks: Call shutdown() programmatically in tests and verify that DB pool, Redis, and message consumers are closed in the correct order. Assert that hooks execute by priority. (2) Integration test with load: Use a load testing tool (k6, artillery) to send continuous requests while triggering shutdown. Verify that no requests get 502/503 errors and all responses are complete. (3) Signal test: In CI, start the server, send traffic, then kill -SIGTERM $PID. Assert the process exits with code 0 and all in-flight requests completed. (4) Kubernetes test: During a rolling deployment (kubectl rollout), run concurrent requests and count non-200 responses. Zero 5xx errors means graceful shutdown is working. (5) Chaos testing: Use tools like Chaos Mesh or LitmusChaos to randomly kill pods while under load. Monitor error rates in the observability stack. (6) Connection tracking verification: During tests, log connection states (active, idle, draining) and verify that idle connections are closed immediately, active connections complete naturally, and force-close only happens at the timeout deadline.
Key Takeaways
-
Graceful shutdown prevents data loss and user-facing errors: Stop accepting new work, finish in-flight work, clean up resources, then exit. The difference between 502 errors and zero-downtime deploys.
-
SIGTERM is the polite termination signal: Kubernetes, Docker, and process managers send SIGTERM first. Your app must catch it and initiate shutdown. SIGKILL cannot be caught — it's the fallback.
-
server.close() is necessary but not sufficient: It stops accepting new TCP connections but doesn't close existing keep-alive connections. You must track and drain active connections explicitly.
-
Shutdown hooks must be prioritized: Health check → unready first, then stop accepting traffic, then drain workers, then close external connections, then flush logs. Order prevents resource conflicts.
-
preStop hook solves the Kubernetes endpoint race condition: Endpoint removal and SIGTERM happen in parallel. A 5-second preStop delay ensures traffic stops arriving before the app starts draining.
-
PID 1 matters in Docker: Don't use
CMD ["npm", "start"]— npm doesn't forward signals. UseCMD ["node", "server.js"]or tini as an init process. -
Connection drain timeout must respect the grace period: If Kubernetes gives 30 seconds and preStop takes 5 seconds, your drain timeout must be under 25 seconds. Otherwise SIGKILL interrupts the drain.
-
Readiness probe should fail immediately on shutdown; liveness should keep passing: Failing readiness removes the pod from traffic. Failing liveness restarts the pod, which disrupts graceful shutdown.
-
Per-hook timeouts prevent one slow hook from blocking shutdown: If the database pool takes 60 seconds to close, the force-exit timer kills the process. Use individual timeouts plus a global deadline.
-
Test shutdown under load: Graceful shutdown bugs only surface when the server is actively processing requests. A load test during
SIGTERMis the only way to verify zero-downtime behavior.
What did you think?