Zero-Downtime Deployments: The Unglamorous Details
Zero-Downtime Deployments: The Unglamorous Details
Introduction
Every blog post about zero-downtime deployments shows the same diagram: blue boxes, green boxes, an arrow, and "zero downtime!" The reality is messier. Much messier.
This guide covers what those diagrams leave out—the database migrations that break everything, the health checks that lie, the connections that won't drain, and the hundred other details that determine whether your "zero-downtime" deployment actually has zero downtime.
What "Zero Downtime" Actually Means
Let's start by defining our terms, because "zero downtime" means different things to different people.
┌─────────────────────────────────────────────────────────────┐
│ DOWNTIME DEFINITION SPECTRUM │
├─────────────────────────────────────────────────────────────┤
│ │
│ DEFINITION WHAT IT MEANS IN PRACTICE │
│ ───────────────────────────────────────────────────────── │
│ │
│ "Zero errors" No user sees any error │
│ (Extremely hard to achieve) │
│ │
│ "No failed requests" Every request completes │
│ (Some may be slow or degraded) │
│ │
│ "No visible impact" User doesn't notice anything │
│ (Background tasks might fail) │
│ │
│ "No maintenance page" Site stays up, might be degraded │
│ (Most common definition) │
│ │
│ "Acceptable errors" <0.01% error rate during deploy │
│ (Pragmatic target) │
│ │
└─────────────────────────────────────────────────────────────┘
The Honest Truth:
True zero-downtime (no errors, no degradation, no impact) during deployments of stateful systems is extraordinarily difficult. Most organizations target "no visible user impact" with "acceptable error rate" as the realistic goal.
What Marketing Says: What Engineering Knows:
──────────────────── ───────────────────────
"Zero downtime!" "Less than 0.01% error rate
during the 3-minute deploy
window, excluding background
jobs which may retry, and
assuming no database
migrations this release"
The Deployment Strategies
Strategy Comparison
┌─────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT STRATEGY COMPARISON │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ STRATEGY COMPLEXITY RESOURCE ROLLBACK RISK │
│ OVERHEAD SPEED EXPOSURE │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ Rolling Medium Low Slow Gradual │
│ (+0-25%) (minutes) (% of fleet) │
│ │
│ Blue-Green Low High Fast All-or-nothing │
│ (+100%) (seconds) (instant flip) │
│ │
│ Canary High Medium Fast Very low │
│ (+5-10%) (seconds) (1-5% traffic) │
│ │
│ Rolling Very High Medium Medium Controlled │
│ + Canary (+10-25%) (minutes) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Rolling Deployment: The Details
Rolling Deployment Timeline (10 instances, 2 at a time):
═══════════════════════════════════════════════════════════════════════
Time Instance State Traffic Distribution
──── ────────────────────────────────────────── ────────────────────
T+0 [v1][v1][v1][v1][v1][v1][v1][v1][v1][v1] 100% v1
T+1 [v1][v1][v1][v1][v1][v1][v1][v1][──][──] Draining 2 instances
Instances 9,10: Connection draining 80% v1, 20% draining
T+2 [v1][v1][v1][v1][v1][v1][v1][v1][v2][v2] Starting new
Instances 9,10: Starting v2 80% v1, 0% serving
T+3 [v1][v1][v1][v1][v1][v1][v1][v1][v2][v2] Health checking
Instances 9,10: Health checks 80% v1, 0% serving
T+4 [v1][v1][v1][v1][v1][v1][v1][v1][v2][v2] 80% v1, 20% v2
Instances 9,10: Receiving traffic Mixed traffic!
... repeat for remaining instances ...
T+20 [v2][v2][v2][v2][v2][v2][v2][v2][v2][v2] 100% v2
CRITICAL PERIOD: T+4 through T+16
─────────────────────────────────
During this window, BOTH versions serve traffic simultaneously.
Your code MUST handle:
• Old code calling new code (via internal APIs)
• New code calling old code
• Shared database with both versions
• Shared cache with both versions
• Shared queues with both versions
The Rolling Deployment Gotchas:
# Kubernetes rolling deployment - looks simple
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2 # Take down 2 at a time
maxSurge: 2 # Can temporarily have 12 pods
# What this DOESN'T tell you:
# 1. How long does your app take to start?
# 2. How long until it's ACTUALLY ready (not just passing health checks)?
# 3. What happens to in-flight requests on terminated pods?
# 4. Are your health checks actually meaningful?
# 5. Can v1 and v2 coexist safely?
Blue-Green Deployment: The Details
Blue-Green Architecture:
════════════════════════════════════════════════════════════════════
┌─────────────────┐
│ Load Balancer │
│ │
│ Points to: BLUE│
└────────┬────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ BLUE (v1) │ │ GREEN (v2) │
│ ════════ │ │ ═════════ │
│ │ │ │
│ [v1][v1][v1] │ │ [v2][v2][v2] │
│ [v1][v1][v1] │ │ [v2][v2][v2] │
│ │ │ │
│ SERVING TRAFFIC │ │ IDLE/TESTING │
└──────────────────┘ └──────────────────┘
│ │
└─────────────┬─────────────┘
│
▼
┌─────────────────┐
│ DATABASE │
│ (shared!) │
└─────────────────┘
The Switch (looks instant, isn't):
─────────────────────────────────
Before: LB → BLUE (100%) GREEN (0%)
DNS TTL: 60s remaining
Action: Switch LB target to GREEN
After: LB → BLUE (0%) GREEN (100%)
BUT WAIT:
• Active connections on BLUE don't instantly move
• Client-side connection pools still point to BLUE
• DNS caches at various levels
• CDN edge nodes may cache BLUE's IP
• Mobile apps with poor connectivity still talking to BLUE
Blue-Green Hidden Complexity:
┌─────────────────────────────────────────────────────────────┐
│ BLUE-GREEN: WHAT THEY DON'T TELL YOU │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. DATABASE SCHEMA │
│ Both blue and green use the SAME database. │
│ Schema changes must be backward compatible. │
│ You can't just "flip back" if migrations ran. │
│ │
│ 2. INFRASTRUCTURE COST │
│ You need 2x the compute capacity. │
│ That's not 2x the cost (idle green is cheap) │
│ but it's not free either. │
│ │
│ 3. STATE SYNCHRONIZATION │
│ Sessions created on blue won't exist on green. │
│ Caches are cold on green. │
│ In-memory state is lost. │
│ │
│ 4. THE "INSTANT" SWITCH ISN'T INSTANT │
│ - Load balancer propagation: 1-30 seconds │
│ - Health check intervals: 10-30 seconds │
│ - Connection draining: 30-300 seconds │
│ - Client reconnection: varies wildly │
│ │
│ 5. SHARED DEPENDENCIES │
│ Both environments share: │
│ - Database │
│ - Message queues │
│ - External APIs │
│ - File storage │
│ These can't be "blue" or "green" │
│ │
└─────────────────────────────────────────────────────────────┘
Canary Deployment: The Details
Canary Traffic Progression:
════════════════════════════════════════════════════════════════════
Phase 1: Deploy Canary (1% traffic)
─────────────────────────────────────
┌─────────────────────────────────────────────────────┐
│ Production Pool (v1) │
│ [v1][v1][v1][v1][v1][v1][v1][v1][v1][v1] 99% │
├─────────────────────────────────────────────────────┤
│ Canary Pool (v2) │
│ [v2] 1% │
└─────────────────────────────────────────────────────┘
Monitor for: 10-30 minutes
Success criteria: Error rate < 0.1%, latency p99 < 200ms
Phase 2: Expand Canary (10% traffic)
────────────────────────────────────
┌─────────────────────────────────────────────────────┐
│ Production Pool (v1) │
│ [v1][v1][v1][v1][v1][v1][v1][v1][v1] 90% │
├─────────────────────────────────────────────────────┤
│ Canary Pool (v2) │
│ [v2][v2][v2] 10% │
└─────────────────────────────────────────────────────┘
Monitor for: 15-30 minutes
Success criteria: Same as phase 1 + business metrics normal
Phase 3: Expand Canary (50% traffic)
────────────────────────────────────
┌─────────────────────────────────────────────────────┐
│ Production Pool (v1) │
│ [v1][v1][v1][v1][v1] 50% │
├─────────────────────────────────────────────────────┤
│ Canary Pool (v2) │
│ [v2][v2][v2][v2][v2] 50% │
└─────────────────────────────────────────────────────┘
Monitor for: 30-60 minutes
Phase 4: Complete Rollout (100% traffic)
───────────────────────────────────────
┌─────────────────────────────────────────────────────┐
│ Production Pool (v2) │
│ [v2][v2][v2][v2][v2][v2][v2][v2][v2][v2] 100% │
└─────────────────────────────────────────────────────┘
Canary Deployment Challenges:
┌─────────────────────────────────────────────────────────────┐
│ CANARY: THE HARD PROBLEMS │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. TRAFFIC SPLITTING │
│ How do you actually route 1% of traffic? │
│ • Random: Each request 1% chance to canary │
│ → Same user might flip between versions │
│ • Sticky: Hash user ID, 1% of users to canary │
│ → Better UX, but what if that 1% is different? │
│ • Feature-based: Specific users/accounts │
│ → Most controlled, but not representative │
│ │
│ 2. STATEFUL OPERATIONS │
│ User starts checkout on v1, continues on v2? │
│ Session data format changed between versions? │
│ Cart in v1 format, v2 can't read it? │
│ │
│ 3. METRIC SIGNIFICANCE │
│ 1% traffic = low sample size │
│ Is 0.5% error rate real or statistical noise? │
│ Need: Statistical significance calculations │
│ │
│ 4. SLOW-BURN BUGS │
│ Memory leak that takes 4 hours to manifest │
│ Connection pool exhaustion over time │
│ Cache pollution that builds gradually │
│ → Canary period might be too short to catch │
│ │
│ 5. INFRASTRUCTURE COMPLEXITY │
│ Need sophisticated traffic management │
│ Need real-time metrics comparison │
│ Need automated rollback triggers │
│ This isn't free to build or operate │
│ │
└─────────────────────────────────────────────────────────────┘
Database Migrations: Where Dreams Die
Database migrations are the #1 cause of "zero-downtime deployment" failures. This is where the real complexity lives.
The Fundamental Problem
The Impossible Triangle:
════════════════════════════════════════════════════════════════════
ZERO DOWNTIME
/\
/ \
/ \
/ \
/ ?? \
/ \
/ \
/______________\
SCHEMA CHANGE DATA INTEGRITY
You can have any two:
• Zero Downtime + Schema Change = Risk data integrity
(What if old code writes to removed column?)
• Zero Downtime + Data Integrity = No schema changes
(Just don't change the schema... not practical)
• Schema Change + Data Integrity = Downtime
(Take the app down, migrate, bring up new code)
The Expand-Contract Pattern
The only reliable way to do zero-downtime schema changes:
PHASE 1: EXPAND
═══════════════════════════════════════════════════════════════════
Goal: Add new schema elements without breaking old code
Example: Rename column 'user_name' to 'username'
Step 1.1: Add new column (nullable or with default)
──────────────────────────────────────────────────
-- Migration 001_add_username_column.sql
ALTER TABLE users ADD COLUMN username VARCHAR(255);
-- Create trigger to sync data (bidirectional!)
CREATE OR REPLACE FUNCTION sync_username() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
IF NEW.username IS NULL AND NEW.user_name IS NOT NULL THEN
NEW.username := NEW.user_name;
ELSIF NEW.user_name IS NULL AND NEW.username IS NOT NULL THEN
NEW.user_name := NEW.username;
END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_sync_username
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_username();
Step 1.2: Backfill existing data
───────────────────────────────
-- Migration 002_backfill_username.sql
-- Do this in batches to avoid locking!
UPDATE users SET username = user_name
WHERE username IS NULL
AND id BETWEEN 1 AND 10000;
-- Repeat for all ID ranges
State After Phase 1:
┌─────────────────────────────────────────┐
│ users table │
├─────────────────────────────────────────┤
│ id │ user_name │ username │ email │
│────┼───────────┼──────────┼────────────│
│ 1 │ alice │ alice │ a@test.com │
│ 2 │ bob │ bob │ b@test.com │
└─────────────────────────────────────────┘
Both columns exist, both have data, trigger keeps them in sync.
Old code (using user_name) works.
New code (using username) works.
PHASE 2: MIGRATE CODE
═══════════════════════════════════════════════════════════════════
Deploy new application code that uses 'username' instead of 'user_name'.
The trigger ensures both columns stay synchronized.
Old code still running: Writes to user_name → trigger copies to username
New code running: Writes to username → trigger copies to user_name
This is the mixed-version period. Both work.
PHASE 3: CONTRACT
═══════════════════════════════════════════════════════════════════
Goal: Remove old schema elements once all code is updated
Step 3.1: Verify no code uses old column
───────────────────────────────────────
-- Check for queries using user_name
-- Review application logs, query logs
-- Wait sufficient time (days/weeks, not hours)
Step 3.2: Remove trigger
───────────────────────
-- Migration 003_remove_sync_trigger.sql
DROP TRIGGER users_sync_username ON users;
DROP FUNCTION sync_username();
Step 3.3: Remove old column
──────────────────────────
-- Migration 004_remove_user_name_column.sql
ALTER TABLE users DROP COLUMN user_name;
Final State:
┌─────────────────────────────────────────┐
│ users table │
├─────────────────────────────────────────┤
│ id │ username │ email │
│────┼──────────┼─────────────────────────│
│ 1 │ alice │ a@test.com │
│ 2 │ bob │ b@test.com │
└─────────────────────────────────────────┘
Common Migration Scenarios
┌─────────────────────────────────────────────────────────────────────┐
│ MIGRATION SCENARIO PLAYBOOK │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: ADD NON-NULLABLE COLUMN │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: ALTER TABLE ADD COLUMN foo NOT NULL; │
│ → Fails if table has data, or locks table │
│ │
│ Right: │
│ 1. Add column as nullable: ADD COLUMN foo VARCHAR(255); │
│ 2. Deploy code that writes to new column │
│ 3. Backfill in batches: UPDATE ... WHERE foo IS NULL LIMIT 1000; │
│ 4. Add NOT NULL: ALTER TABLE ALTER COLUMN foo SET NOT NULL; │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ SCENARIO: ADD INDEX │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: CREATE INDEX idx_foo ON large_table(foo); │
│ → Locks table for duration of index build │
│ │
│ Right (PostgreSQL): │
│ CREATE INDEX CONCURRENTLY idx_foo ON large_table(foo); │
│ → Takes longer but doesn't lock │
│ │
│ Right (MySQL 5.6+): │
│ ALTER TABLE large_table ADD INDEX idx_foo(foo), ALGORITHM=INPLACE; │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ SCENARIO: CHANGE COLUMN TYPE │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: ALTER TABLE users ALTER COLUMN age TYPE BIGINT; │
│ → Rewrites entire table, locks during rewrite │
│ │
│ Right: │
│ 1. Add new column: ADD COLUMN age_new BIGINT; │
│ 2. Add trigger to sync old → new │
│ 3. Backfill in batches │
│ 4. Deploy code using new column │
│ 5. Remove trigger │
│ 6. Drop old column │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ SCENARIO: DROP COLUMN │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: Just drop it │
│ → Old code still running might reference it │
│ │
│ Right: │
│ 1. Remove all code references (deploy) │
│ 2. Wait for all old versions to drain │
│ 3. Drop column in separate deployment │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ SCENARIO: RENAME TABLE │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: RENAME TABLE old_name TO new_name; │
│ → Old code immediately breaks │
│ │
│ Right: │
│ 1. Create view: CREATE VIEW new_name AS SELECT * FROM old_name; │
│ 2. Deploy code using new_name │
│ 3. Once all code migrated, can restructure │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ SCENARIO: ADD FOREIGN KEY │
│ ───────────────────────────────────────────────────────────────── │
│ Wrong: ALTER TABLE ADD CONSTRAINT fk_foo FOREIGN KEY... │
│ → Validates all existing rows, locks both tables │
│ │
│ Right (PostgreSQL): │
│ 1. Add constraint as NOT VALID: │
│ ALTER TABLE ADD CONSTRAINT fk_foo FOREIGN KEY...NOT VALID; │
│ 2. Validate separately (allows concurrent access): │
│ ALTER TABLE VALIDATE CONSTRAINT fk_foo; │
│ │
└─────────────────────────────────────────────────────────────────────┘
Migration Timing
When to Run Migrations:
════════════════════════════════════════════════════════════════════
BEFORE deployment (expand phase):
┌──────────────────────────────────────────────────────────────────┐
│ │
│ Run │ Deploy │ Traffic │ Monitor │
│ Migration │ New Code │ Shifts │ & Verify │
│ │ │ │ │
│ [████]──────┼───[████]──────┼────[████]──────┼────[████] │
│ │ │ │ │
│ Schema │ Code can │ Rolling/ │ Both │
│ supports │ handle │ blue-green │ versions │
│ both │ both │ happens │ work │
│ versions │ schemas │ │ │
│ │
└──────────────────────────────────────────────────────────────────┘
AFTER deployment (contract phase):
┌──────────────────────────────────────────────────────────────────┐
│ │
│ Verify │ Wait │ Run │ Monitor │
│ All Code │ Period │ Cleanup │ & Verify │
│ Deployed │ │ Migration │ │
│ │ │ │ │
│ [████]──────┼───[████]──────┼────[████]──────┼────[████] │
│ │ │ │ │
│ No old │ Days to │ Remove │ Old │
│ code │ weeks, │ old │ columns │
│ running │ not hours │ columns │ gone │
│ │
└──────────────────────────────────────────────────────────────────┘
Critical Rule: NEVER run expand and contract in the same deployment.
Health Checks: The Lies They Tell
Health checks are critical for zero-downtime deployments. Bad health checks are worse than no health checks.
The Health Check Hierarchy
┌─────────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK TYPES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ LIVENESS CHECK: "Is the process alive?" │
│ ─────────────────────────────────────── │
│ Purpose: Should we restart this container? │
│ Checks: Process responding, not deadlocked │
│ Failure: Kill and restart the container │
│ Example: GET /health/live → 200 OK │
│ │
│ DO: Return 200 if process can respond │
│ DON'T: Check dependencies (DB, Redis, etc.) │
│ DON'T: Do expensive operations │
│ │
│ READINESS CHECK: "Can this instance serve traffic?" │
│ ────────────────────────────────────────────────── │
│ Purpose: Should we route traffic here? │
│ Checks: Dependencies available, warmed up, ready │
│ Failure: Remove from load balancer, DON'T restart │
│ Example: GET /health/ready → 200 OK or 503 Not Ready │
│ │
│ DO: Verify critical dependencies (DB connection) │
│ DO: Check if warm-up is complete │
│ DON'T: Make it so strict one dep failure fails all │
│ │
│ STARTUP CHECK: "Has the app finished starting?" │
│ ───────────────────────────────────────────────── │
│ Purpose: Is initial startup complete? │
│ Checks: Migrations done, caches warmed, ready to serve │
│ Failure: Keep waiting (up to timeout) │
│ Example: GET /health/startup → 200 OK │
│ │
│ DO: Account for slow startup (cache warming) │
│ DO: Set appropriate timeout │
│ DON'T: Conflate with liveness (different failure modes) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Health Check Implementation
// health-checks.ts
interface HealthStatus {
status: 'healthy' | 'degraded' | 'unhealthy';
checks: Record<string, CheckResult>;
timestamp: string;
}
interface CheckResult {
status: 'pass' | 'fail' | 'warn';
latency_ms?: number;
message?: string;
}
class HealthChecker {
private startupComplete = false;
private lastSuccessfulDbCheck = 0;
private dbCheckCache: CheckResult | null = null;
// LIVENESS: Is the process alive and not deadlocked?
// This should be FAST and have NO external dependencies
async checkLiveness(): Promise<HealthStatus> {
return {
status: 'healthy',
checks: {
process: { status: 'pass' }
},
timestamp: new Date().toISOString()
};
}
// READINESS: Can this instance serve traffic?
async checkReadiness(): Promise<HealthStatus> {
const checks: Record<string, CheckResult> = {};
// Check database (with caching to prevent thundering herd)
checks.database = await this.checkDatabaseCached();
// Check if startup is complete
checks.startup = {
status: this.startupComplete ? 'pass' : 'fail',
message: this.startupComplete ? 'Ready' : 'Still warming up'
};
// Determine overall status
const hasFailure = Object.values(checks).some(c => c.status === 'fail');
const hasWarning = Object.values(checks).some(c => c.status === 'warn');
return {
status: hasFailure ? 'unhealthy' : hasWarning ? 'degraded' : 'healthy',
checks,
timestamp: new Date().toISOString()
};
}
// Cache DB checks to prevent health check storms
private async checkDatabaseCached(): Promise<CheckResult> {
const now = Date.now();
const cacheAge = now - this.lastSuccessfulDbCheck;
// Return cached result if recent and was successful
if (this.dbCheckCache?.status === 'pass' && cacheAge < 5000) {
return this.dbCheckCache;
}
try {
const start = Date.now();
await db.query('SELECT 1');
const latency = Date.now() - start;
this.dbCheckCache = {
status: latency > 100 ? 'warn' : 'pass',
latency_ms: latency,
message: latency > 100 ? 'Slow response' : 'OK'
};
this.lastSuccessfulDbCheck = now;
} catch (error) {
this.dbCheckCache = {
status: 'fail',
message: `Connection failed: ${error.message}`
};
}
return this.dbCheckCache;
}
// Called when startup tasks complete
markStartupComplete() {
this.startupComplete = true;
}
// DEEP CHECK: For debugging, not for LB health checks
async checkDeep(): Promise<HealthStatus> {
const checks: Record<string, CheckResult> = {};
// All dependencies
checks.database = await this.checkDatabase();
checks.redis = await this.checkRedis();
checks.elasticsearch = await this.checkElasticsearch();
checks.externalApi = await this.checkExternalApi();
// Resource checks
checks.memory = this.checkMemory();
checks.diskSpace = await this.checkDiskSpace();
checks.connectionPool = this.checkConnectionPool();
const hasFailure = Object.values(checks).some(c => c.status === 'fail');
const hasWarning = Object.values(checks).some(c => c.status === 'warn');
return {
status: hasFailure ? 'unhealthy' : hasWarning ? 'degraded' : 'healthy',
checks,
timestamp: new Date().toISOString()
};
}
}
Health Check Anti-Patterns
┌─────────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK ANTI-PATTERNS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ANTI-PATTERN: THE EXPENSIVE CHECK │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Health check queries entire database │
│ Impact: Health checks themselves cause load/timeouts │
│ Example: SELECT COUNT(*) FROM large_table │
│ Fix: SELECT 1; or ping with connection pool │
│ │
│ ANTI-PATTERN: THE DEPENDENCY CASCADE │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Check ALL dependencies for readiness │
│ Impact: One failed dependency = entire fleet "unhealthy" │
│ Example: Analytics service down → all pods not ready │
│ Fix: Only check CRITICAL dependencies │
│ │
│ ANTI-PATTERN: THE THUNDERING HERD │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Many instances check same dependency simultaneously │
│ Impact: Health checks DDoS your own database │
│ Example: 100 pods × 1 check/sec = 100 QPS just for health │
│ Fix: Cache checks, jitter timing, sample checking │
│ │
│ ANTI-PATTERN: THE ALWAYS-HEALTHY CHECK │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Health check returns 200 no matter what │
│ Impact: Broken instances keep receiving traffic │
│ Example: return res.status(200).send('OK'); │
│ Fix: Actually verify critical functionality │
│ │
│ ANTI-PATTERN: THE RESTART LOOP │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Liveness check fails during startup │
│ Impact: Container never finishes starting │
│ Example: Liveness starts at T+0, app needs 60s to start │
│ Fix: Use startupProbe, or delay livenessProbe start │
│ │
│ ANTI-PATTERN: THE OPTIMISTIC CHECK │
│ ───────────────────────────────────────────────────────────────── │
│ Problem: Check says ready before actually ready │
│ Impact: Traffic routes to instance serving errors │
│ Example: Returns ready before cache is warmed │
│ Fix: Include warmup completion in readiness │
│ │
└─────────────────────────────────────────────────────────────────────┘
Kubernetes Health Check Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
containers:
- name: api
image: myapp:v2
# STARTUP PROBE
# Used during startup. Until it succeeds, liveness/readiness disabled.
# Prevents slow-starting containers from being killed.
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 5 # Wait before first check
periodSeconds: 5 # Check every 5s
timeoutSeconds: 3 # Each check times out after 3s
failureThreshold: 30 # Fail after 30 failures (150s total)
# Total startup budget: 5 + (30 * 5) = 155 seconds
# LIVENESS PROBE
# Is the container alive? If not, kill and restart it.
# Should be simple - don't check dependencies here.
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 0 # Starts after startup probe succeeds
periodSeconds: 10 # Check every 10s
timeoutSeconds: 2 # Quick timeout
failureThreshold: 3 # Restart after 3 consecutive failures
# READINESS PROBE
# Can this container serve traffic? If not, remove from service.
# Check dependencies here - temporary issues don't require restart.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 0 # Starts after startup probe succeeds
periodSeconds: 5 # Check frequently for fast recovery
timeoutSeconds: 3
failureThreshold: 2 # Remove after 2 failures (10s)
successThreshold: 1 # Add back after 1 success
# Lifecycle hooks for graceful shutdown
lifecycle:
preStop:
exec:
# Give time for load balancer to remove us
command: ["/bin/sh", "-c", "sleep 15"]
Connection Draining: The Silent Killer
When you remove an instance from a load balancer, existing connections don't magically disappear. Connection draining (also called deregistration delay) is critical.
The Connection Draining Problem
WITHOUT PROPER DRAINING:
════════════════════════════════════════════════════════════════════
Timeline:
─────────
T+0: Instance serving requests normally
Client A: Long-running request started
Client B: WebSocket connection active
T+1: Deployment starts, instance marked for removal
Load balancer stops sending NEW requests
T+2: Instance terminated ← PROBLEM!
Client A: Request killed mid-response → 502 ERROR
Client B: WebSocket dropped → DISCONNECT
Background job: Terminated mid-processing → DATA CORRUPTION?
WITH PROPER DRAINING:
════════════════════════════════════════════════════════════════════
Timeline:
─────────
T+0: Instance serving requests normally
Client A: Long-running request started
Client B: WebSocket connection active
T+1: Deployment starts, instance marked for removal
Load balancer stops sending NEW requests
Instance receives SIGTERM
T+2: Draining period begins
Instance stops accepting NEW connections
Instance continues serving IN-FLIGHT requests
Client A: Request completes normally → 200 OK
Client B: WebSocket server sends close frame
Background job: Completes current work, stops accepting new jobs
T+30: Draining period ends (configurable)
All connections closed or timed out
Instance terminates cleanly
Implementing Graceful Shutdown
// graceful-shutdown.ts
class GracefulShutdown {
private isShuttingDown = false;
private server: http.Server;
private connections = new Set<net.Socket>();
private activeRequests = 0;
constructor(server: http.Server) {
this.server = server;
this.trackConnections();
this.setupSignalHandlers();
}
private trackConnections() {
this.server.on('connection', (socket: net.Socket) => {
this.connections.add(socket);
socket.on('close', () => this.connections.delete(socket));
});
}
private setupSignalHandlers() {
// SIGTERM: Kubernetes sends this before killing pod
process.on('SIGTERM', () => this.shutdown('SIGTERM'));
// SIGINT: Ctrl+C in development
process.on('SIGINT', () => this.shutdown('SIGINT'));
}
// Middleware to track active requests
requestTracker() {
return (req: Request, res: Response, next: NextFunction) => {
// Reject new requests during shutdown
if (this.isShuttingDown) {
res.setHeader('Connection', 'close');
return res.status(503).json({
error: 'Service shutting down',
retryAfter: 5
});
}
this.activeRequests++;
res.on('finish', () => {
this.activeRequests--;
});
next();
};
}
// Health check endpoint respects shutdown state
healthCheck() {
return (req: Request, res: Response) => {
if (this.isShuttingDown) {
return res.status(503).json({ status: 'shutting_down' });
}
return res.status(200).json({ status: 'healthy' });
};
}
private async shutdown(signal: string) {
console.log(`Received ${signal}, starting graceful shutdown...`);
this.isShuttingDown = true;
// 1. Stop accepting new connections
this.server.close();
// 2. Close idle keep-alive connections
for (const socket of this.connections) {
// Only destroy idle connections
// Active request connections will close when response completes
if (!socket.destroyed) {
socket.end();
}
}
// 3. Wait for active requests to complete (with timeout)
const drainTimeout = 25000; // 25 seconds (leave buffer before SIGKILL)
const startTime = Date.now();
while (this.activeRequests > 0) {
if (Date.now() - startTime > drainTimeout) {
console.warn(`Drain timeout reached with ${this.activeRequests} active requests`);
break;
}
console.log(`Waiting for ${this.activeRequests} requests to complete...`);
await sleep(1000);
}
// 4. Cleanup background processes
await this.stopBackgroundJobs();
await this.closeDbConnections();
await this.flushMetrics();
console.log('Graceful shutdown complete');
process.exit(0);
}
private async stopBackgroundJobs() {
// Signal job processors to stop accepting new jobs
// Wait for current job to complete
await jobQueue.close();
}
private async closeDbConnections() {
// Drain connection pool
await db.end();
}
private async flushMetrics() {
// Ensure metrics are shipped before shutdown
await metrics.flush();
}
}
// Usage
const server = app.listen(8080);
const shutdown = new GracefulShutdown(server);
app.use(shutdown.requestTracker());
app.get('/health', shutdown.healthCheck());
Load Balancer Deregistration
AWS ALB Deregistration:
════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────┐
│ ALB DEREGISTRATION TIMELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ T+0 Target marked for deregistration │
│ ALB state: "draining" │
│ New requests: NOT sent to this target │
│ Existing connections: Continue to work │
│ │
│ T+0 ALB stops sending health checks to target │
│ to Target receives NO new requests │
│ T+300 Existing connections complete or timeout │
│ (deregistration_delay setting, default 300s) │
│ │
│ T+300 Deregistration complete │
│ All connections closed │
│ Safe to terminate instance │
│ │
└─────────────────────────────────────────────────────────────────┘
Configuration (Terraform):
resource "aws_lb_target_group" "api" {
name = "api-targets"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
# Deregistration delay - time to drain connections
deregistration_delay = 30 # 30 seconds, not 300!
# Health check configuration
health_check {
enabled = true
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
path = "/health/ready"
matcher = "200"
}
# Stickiness - can cause problems during deployments
stickiness {
enabled = false # Disable for easier draining
type = "lb_cookie"
cookie_duration = 86400
}
}
IMPORTANT TIMING:
Kubernetes terminationGracePeriodSeconds: 30
+ ALB deregistration_delay: 30
+ preStop hook sleep: 15
─────────────────────────────────────
Total shutdown budget: Needs coordination!
Recommended setup:
• ALB deregistration_delay: 30 seconds
• Pod terminationGracePeriodSeconds: 45 seconds
• preStop sleep: 15 seconds (for LB to remove target)
• App graceful shutdown timeout: 25 seconds
Cache Invalidation During Deployments
Cache invalidation is hard. Cache invalidation during deployments is harder.
The Version Mismatch Problem
THE PROBLEM:
════════════════════════════════════════════════════════════════════
T+0: All instances on v1
Cache: { user:123: {format: 'v1', name: 'Alice'} }
T+5: Rolling deployment starts
v1 instances: 8 (reading/writing v1 cache format)
v2 instances: 2 (expecting v2 cache format!)
T+6: v2 instance reads cache
Gets: {format: 'v1', name: 'Alice'}
Expects: {format: 'v2', username: 'alice', displayName: 'Alice'}
Result: CRASH or WRONG BEHAVIOR
T+7: v2 instance writes cache
Writes: {format: 'v2', username: 'alice', displayName: 'Alice'}
T+8: v1 instance reads that same cache key
Gets: {format: 'v2', username: 'alice', displayName: 'Alice'}
Expects: {format: 'v1', name: 'Alice'}
Result: CRASH or WRONG BEHAVIOR
Cache Compatibility Strategies
// Strategy 1: Versioned Cache Keys
// ─────────────────────────────────────────────────────────────
// Simple but wastes cache space during transition
const CACHE_VERSION = 'v2';
function cacheKey(type: string, id: string): string {
return `${CACHE_VERSION}:${type}:${id}`;
}
// v1 uses: 'v1:user:123'
// v2 uses: 'v2:user:123'
// No conflicts, but cache is cold for new version
// Strategy 2: Backward-Compatible Reads
// ─────────────────────────────────────────────────────────────
// Read both formats, write only new format
interface UserCacheV1 {
version?: undefined; // v1 didn't have version field
name: string;
}
interface UserCacheV2 {
version: 2;
username: string;
displayName: string;
}
type UserCache = UserCacheV1 | UserCacheV2;
function deserializeUser(cached: UserCache): User {
if (!cached.version || cached.version < 2) {
// Handle v1 format
return {
username: cached.name.toLowerCase(),
displayName: cached.name
};
}
// Handle v2 format
return {
username: cached.username,
displayName: cached.displayName
};
}
function serializeUser(user: User): UserCacheV2 {
// Always write latest format
return {
version: 2,
username: user.username,
displayName: user.displayName
};
}
// Strategy 3: Cache-Aside with Graceful Degradation
// ─────────────────────────────────────────────────────────────
// If cache format is wrong, treat as cache miss
async function getUser(id: string): Promise<User> {
const cacheKey = `user:${id}`;
try {
const cached = await cache.get(cacheKey);
if (cached) {
const parsed = JSON.parse(cached);
// Validate expected format
if (isValidUserCacheFormat(parsed)) {
return deserializeUser(parsed);
}
// Wrong format = treat as miss, don't crash
console.warn(`Cache format mismatch for ${cacheKey}, treating as miss`);
await cache.del(cacheKey); // Clear stale format
}
} catch (error) {
// Cache errors shouldn't break the app
console.error(`Cache read error: ${error}`);
}
// Cache miss - fetch from source
const user = await db.users.findById(id);
// Write to cache (best effort)
try {
await cache.setex(cacheKey, 3600, JSON.stringify(serializeUser(user)));
} catch (error) {
console.error(`Cache write error: ${error}`);
}
return user;
}
Full Cache Clear Strategy
WHEN TO CLEAR CACHE DURING DEPLOYMENT:
════════════════════════════════════════════════════════════════════
Option 1: Progressive Invalidation (Preferred)
──────────────────────────────────────────────
- Use versioned keys: No invalidation needed
- Use TTL: Let old entries expire naturally
- Use backward-compatible readers
Option 2: Pre-deployment Cache Warming
──────────────────────────────────────
1. Before deployment: Warm cache with new format
2. Deploy with code that reads both formats
3. Old format entries expire over time
Timeline:
T-10min: Start cache warming script (writes v2 format)
T+0: Deploy v2 code (reads v1 and v2, writes v2)
T+1hr: v1 cache entries have expired
All cache entries now v2 format
Option 3: Atomic Cache Flip (Blue-Green Style)
──────────────────────────────────────────────
1. Blue environment uses cache prefix "blue:"
2. Warm green cache with prefix "green:"
3. Deploy and flip
4. Green now uses prefix "green:"
Downsides:
- Doubles cache memory during transition
- Need to coordinate prefix with deployment
- Not always practical with shared cache
Option 4: Clear on Deploy (Last Resort)
───────────────────────────────────────
- Flush cache at start of deployment
- Accept cold cache performance hit
- Only for small caches or non-critical paths
redis-cli FLUSHDB # Nuclear option
Downsides:
- Performance degradation
- Thundering herd to database
- Only acceptable for small datasets
Queue and Worker Deployments
Deploying workers that process queues has unique challenges.
The Worker Deployment Problem
THE PROBLEM:
════════════════════════════════════════════════════════════════════
T+0: Worker v1 picks up job from queue
Job payload: { version: 1, userId: 123, action: 'process' }
Worker starts processing...
T+1: Deployment kills worker v1
Job processing: INTERRUPTED
Job status: UNKNOWN (partially processed? Failed?)
T+2: Worker v2 starts
Same job re-delivered (retry)
Worker v2: Expects version 2 payload format
Result: CRASH or WRONG BEHAVIOR
SOLUTIONS:
════════════════════════════════════════════════════════════════════
Solution 1: Graceful Worker Shutdown
────────────────────────────────────
1. SIGTERM received
2. Stop accepting NEW jobs
3. Complete CURRENT job (with timeout)
4. Exit cleanly
Solution 2: Idempotent Job Processing
────────────────────────────────────
1. Job should be safe to process multiple times
2. Track job progress externally
3. Resume-able processing
Solution 3: Job Versioning
────────────────────────────────────
1. Include version in job payload
2. Workers handle multiple versions
3. Eventually deprecate old versions
Worker Graceful Shutdown
// worker.ts
class QueueWorker {
private isShuttingDown = false;
private currentJob: Job | null = null;
private processingTimeout = 25000; // 25 seconds max per job during shutdown
constructor(private queue: Queue) {
this.setupSignalHandlers();
}
private setupSignalHandlers() {
process.on('SIGTERM', async () => {
console.log('SIGTERM received, initiating graceful shutdown');
await this.shutdown();
});
}
async start() {
while (!this.isShuttingDown) {
try {
// Blocking pop with timeout
// Returns null if timeout, allowing shutdown check
const job = await this.queue.pop({ timeout: 5000 });
if (job && !this.isShuttingDown) {
await this.processJob(job);
}
} catch (error) {
console.error('Error processing job:', error);
await sleep(1000); // Back off on error
}
}
console.log('Worker stopped accepting jobs');
}
private async processJob(job: Job) {
this.currentJob = job;
try {
// Process based on job version
const processor = this.getProcessor(job.version);
await processor(job);
// Acknowledge successful processing
await this.queue.ack(job.id);
} catch (error) {
// Handle failure
if (job.attempts < job.maxAttempts) {
// Re-queue with backoff
await this.queue.nack(job.id, { delay: this.calculateBackoff(job.attempts) });
} else {
// Move to dead letter queue
await this.queue.moveToDeadLetter(job);
}
} finally {
this.currentJob = null;
}
}
private getProcessor(version: number): JobProcessor {
const processors: Record<number, JobProcessor> = {
1: this.processV1.bind(this),
2: this.processV2.bind(this),
};
const processor = processors[version];
if (!processor) {
throw new Error(`Unknown job version: ${version}`);
}
return processor;
}
private async shutdown() {
this.isShuttingDown = true;
if (this.currentJob) {
console.log(`Waiting for current job ${this.currentJob.id} to complete...`);
// Wait for current job with timeout
const startTime = Date.now();
while (this.currentJob && Date.now() - startTime < this.processingTimeout) {
await sleep(100);
}
if (this.currentJob) {
console.warn(`Job ${this.currentJob.id} did not complete in time`);
// Job will be redelivered after visibility timeout
}
}
// Close queue connection
await this.queue.close();
console.log('Worker shutdown complete');
process.exit(0);
}
}
// Idempotent job processing example
class IdempotentJobProcessor {
async process(job: OrderFulfillmentJob) {
const orderId = job.orderId;
// Use distributed lock to prevent duplicate processing
const lock = await this.acquireLock(`order:${orderId}`, 300000); // 5 min lock
if (!lock) {
console.log(`Order ${orderId} is being processed by another worker`);
return; // Will be retried if other worker fails
}
try {
// Check if already processed
const order = await db.orders.findById(orderId);
if (order.status === 'fulfilled') {
console.log(`Order ${orderId} already fulfilled, skipping`);
return;
}
// Process with idempotent steps
if (!order.paymentCaptured) {
await this.capturePayment(order);
await db.orders.update(orderId, { paymentCaptured: true });
}
if (!order.inventoryReserved) {
await this.reserveInventory(order);
await db.orders.update(orderId, { inventoryReserved: true });
}
if (!order.shipmentCreated) {
await this.createShipment(order);
await db.orders.update(orderId, { shipmentCreated: true });
}
// Mark complete
await db.orders.update(orderId, { status: 'fulfilled' });
} finally {
await this.releaseLock(lock);
}
}
}
Queue Deployment Coordination
DEPLOYMENT SEQUENCE FOR QUEUE-BASED SYSTEMS:
════════════════════════════════════════════════════════════════════
Scenario: Changing job payload format from v1 to v2
PHASE 1: PREPARE (No deployment yet)
────────────────────────────────────
1. Drain queue if possible (stop producers temporarily)
2. Or: Ensure all jobs can complete before new code deploys
PHASE 2: DEPLOY CONSUMERS FIRST
───────────────────────────────
1. Deploy workers that understand BOTH v1 and v2 formats
2. Verify workers can process existing v1 jobs
3. Verify workers can process v2 jobs (test in staging)
State:
┌──────────────┐ ┌─────────────────────────────┐
│ Producers │ │ Workers │
│ (v1 jobs) │────►│ (handles v1 AND v2) │
│ │ │ │
└──────────────┘ └─────────────────────────────┘
PHASE 3: DEPLOY PRODUCERS
─────────────────────────
1. Deploy producers that emit v2 format jobs
2. Workers continue processing both formats
3. Old v1 jobs drain from queue
State:
┌──────────────┐ ┌─────────────────────────────┐
│ Producers │ │ Workers │
│ (v2 jobs) │────►│ (handles v1 AND v2) │
│ │ │ │
└──────────────┘ └─────────────────────────────┘
PHASE 4: CLEANUP (Later)
────────────────────────
1. Verify no v1 jobs remain in queue
2. Deploy workers that only handle v2 (optional cleanup)
Critical Rules:
• ALWAYS deploy consumers before producers for new formats
• ALWAYS support backward compatibility during transition
• NEVER assume queue is empty
• ALWAYS handle job format mismatch gracefully
Feature Flags: The Safety Net
Feature flags provide a safety net that decouples deployment from release.
Feature Flag Architecture
DEPLOYMENT VS RELEASE:
════════════════════════════════════════════════════════════════════
Traditional:
Deploy = Release (same moment)
Risk: Problems affect all users immediately
With Feature Flags:
Deploy: Code goes to production (flag off)
Release: Flag turned on (gradual, controlled)
Risk: Can release to 1% first, observe, expand
FEATURE FLAG DECISION FLOW:
════════════════════════════════════════════════════════════════════
Request comes in
│
▼
┌─────────────────┐
│ Check flag state│
│ for this user │
└────────┬────────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
Flag: OFF Flag: ON
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Execute old │ │ Execute new │
│ code path │ │ code path │
└─────────────────┘ └─────────────────┘
FLAG TYPES:
════════════════════════════════════════════════════════════════════
1. RELEASE FLAGS (temporary)
Purpose: Safely release new features
Lifecycle: Remove after feature is stable
Example: new_checkout_flow_enabled
2. OPS FLAGS (temporary)
Purpose: Operational control (kill switches)
Lifecycle: Remove when not needed
Example: disable_external_api_calls
3. EXPERIMENT FLAGS (temporary)
Purpose: A/B testing
Lifecycle: Remove after experiment concludes
Example: pricing_page_variant
4. PERMISSION FLAGS (long-lived)
Purpose: Feature access control
Lifecycle: Permanent (tied to entitlements)
Example: premium_analytics_enabled
Feature Flag Implementation
// feature-flags.ts
interface FeatureFlag {
key: string;
type: 'release' | 'ops' | 'experiment' | 'permission';
defaultValue: boolean;
rules: FlagRule[];
killSwitch?: boolean; // Override to always return false
}
interface FlagRule {
conditions: FlagCondition[];
percentage?: number; // Percentage rollout
value: boolean;
}
interface FlagCondition {
attribute: string; // user.id, user.email, user.plan, etc.
operator: 'equals' | 'contains' | 'in' | 'regex';
value: any;
}
interface FlagContext {
userId?: string;
email?: string;
accountId?: string;
plan?: string;
country?: string;
userAgent?: string;
// ... other attributes
}
class FeatureFlagService {
private flags: Map<string, FeatureFlag> = new Map();
private cache: Map<string, Map<string, boolean>> = new Map(); // flag -> userId -> value
async isEnabled(flagKey: string, context: FlagContext): Promise<boolean> {
const flag = await this.getFlag(flagKey);
if (!flag) {
console.warn(`Unknown flag: ${flagKey}, returning false`);
return false;
}
// Kill switch overrides everything
if (flag.killSwitch) {
return false;
}
// Check cache for this user
const cacheKey = this.getCacheKey(context);
if (this.cache.get(flagKey)?.has(cacheKey)) {
return this.cache.get(flagKey)!.get(cacheKey)!;
}
// Evaluate rules
const result = this.evaluateFlag(flag, context);
// Cache result
if (!this.cache.has(flagKey)) {
this.cache.set(flagKey, new Map());
}
this.cache.get(flagKey)!.set(cacheKey, result);
return result;
}
private evaluateFlag(flag: FeatureFlag, context: FlagContext): boolean {
// Check each rule in order
for (const rule of flag.rules) {
if (this.evaluateConditions(rule.conditions, context)) {
// Conditions match, check percentage rollout
if (rule.percentage !== undefined) {
return this.isInPercentage(context, flag.key, rule.percentage);
}
return rule.value;
}
}
// No rules matched, return default
return flag.defaultValue;
}
private evaluateConditions(conditions: FlagCondition[], context: FlagContext): boolean {
return conditions.every(condition => {
const contextValue = this.getContextValue(context, condition.attribute);
switch (condition.operator) {
case 'equals':
return contextValue === condition.value;
case 'contains':
return String(contextValue).includes(condition.value);
case 'in':
return condition.value.includes(contextValue);
case 'regex':
return new RegExp(condition.value).test(String(contextValue));
default:
return false;
}
});
}
private isInPercentage(context: FlagContext, flagKey: string, percentage: number): boolean {
// Deterministic: Same user always gets same result for same flag
const hash = this.hashString(`${flagKey}:${context.userId || context.accountId || 'anonymous'}`);
const bucket = hash % 100;
return bucket < percentage;
}
private hashString(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash; // Convert to 32bit integer
}
return Math.abs(hash);
}
}
// Usage in deployment
class PaymentService {
constructor(private flags: FeatureFlagService) {}
async processPayment(order: Order, user: User) {
const context = { userId: user.id, plan: user.plan };
// Feature flag controls which code path runs
if (await this.flags.isEnabled('new_payment_processor', context)) {
return this.processWithStripe(order); // New code
} else {
return this.processWithBraintree(order); // Old code
}
}
}
// Gradual rollout example
const newFeatureFlag: FeatureFlag = {
key: 'new_checkout_flow',
type: 'release',
defaultValue: false,
rules: [
// Internal users always get new feature
{
conditions: [
{ attribute: 'email', operator: 'contains', value: '@ourcompany.com' }
],
value: true
},
// Beta users always get new feature
{
conditions: [
{ attribute: 'plan', operator: 'in', value: ['beta', 'early_access'] }
],
value: true
},
// 10% of regular users
{
conditions: [], // All other users
percentage: 10,
value: true
}
]
};
Feature Flag Deployment Pattern
SAFE DEPLOYMENT WITH FEATURE FLAGS:
════════════════════════════════════════════════════════════════════
Day 1: Deploy with flag OFF
────────────────────────────
1. Deploy code with new feature behind flag
2. Flag default: OFF (0% of users)
3. Verify deployment successful
4. Monitor error rates (should be unchanged)
Day 1: Enable for internal users
────────────────────────────────
1. Add rule: @company.com emails → ON
2. Internal testing in production
3. Monitor for issues
Day 2: Enable for beta users (1%)
─────────────────────────────────
1. Add rule: beta plan → ON
2. Or: 1% rollout to all users
3. Monitor: errors, latency, business metrics
4. Wait 24 hours minimum
Day 3-4: Gradual rollout
────────────────────────
1. Increase to 5% → monitor
2. Increase to 25% → monitor
3. Increase to 50% → monitor
4. Increase to 100%
Day 5+: Cleanup
───────────────
1. Remove feature flag code
2. Delete flag from system
3. CRITICAL: Don't leave flag code forever!
FLAG ROLLBACK (if issues found):
════════════════════════════════════════════════════════════════════
Option 1: Kill switch
─────────────────────
Set flag.killSwitch = true
→ Immediately returns false for all users
→ No deployment needed
→ Seconds to execute
Option 2: Set to 0%
───────────────────
Set rollout percentage to 0
→ All new requests use old code
→ No deployment needed
Option 3: Specific targeting
───────────────────────────
Add rule to exclude affected users
→ Surgical fix while investigating
IMPORTANT: Flag rollback is NOT a substitute for code rollback.
If the bug is severe, roll back the deployment too.
Static Assets and CDN Considerations
The Cache Busting Problem
THE PROBLEM:
════════════════════════════════════════════════════════════════════
T+0: v1 deployed
- index.html references app.js
- CDN caches app.js (v1 code)
- Users load app.js from CDN edge
T+1: v2 deployed
- Server returns new index.html
- index.html references app.js (same URL!)
- User's browser has app.js cached locally
- CDN edge might still have v1 of app.js
Result: User gets new HTML with old JavaScript
→ Application breaks
SOLUTION: CONTENT-HASHED FILENAMES
════════════════════════════════════════════════════════════════════
v1 deployed:
index.html → references app.a1b2c3.js
CDN caches: app.a1b2c3.js
v2 deployed:
index.html → references app.d4e5f6.js (different hash!)
User requests app.d4e5f6.js
Not in cache → fetches from origin
Gets new code!
Implementation (webpack):
// webpack.config.js
module.exports = {
output: {
filename: '[name].[contenthash].js',
chunkFilename: '[name].[contenthash].chunk.js',
assetModuleFilename: 'assets/[name].[contenthash][ext]',
clean: true // Remove old files
},
optimization: {
moduleIds: 'deterministic', // Consistent chunk hashes
runtimeChunk: 'single',
splitChunks: {
cacheGroups: {
vendor: {
test: /[\\/]node_modules[\\/]/,
name: 'vendors',
chunks: 'all',
},
},
},
},
};
CDN Deployment Strategy
CDN DEPLOYMENT SEQUENCE:
════════════════════════════════════════════════════════════════════
WRONG ORDER (causes broken experiences):
────────────────────────────────────────
1. Update backend API
2. Deploy new HTML to CDN
3. Users get new HTML but old static assets (still cached)
4. App breaks
CORRECT ORDER:
──────────────
1. Deploy new static assets to CDN (new filenames)
app.d4e5f6.js now exists alongside app.a1b2c3.js
CDN: /assets/app.a1b2c3.js (old, still served)
/assets/app.d4e5f6.js (new, available)
2. Update backend API (new code deployed)
3. Update index.html (references new assets)
Users fetching index.html get reference to app.d4e5f6.js
4. Wait for old index.html cache to expire
Or: Short/no cache on index.html
5. Optionally cleanup old assets (after cache expiry)
CDN CACHE STRATEGY:
════════════════════════════════════════════════════════════════════
File Type Cache Control Why
──────────────────────────────────────────────────────────────────
index.html no-cache, max-age=0 Always fresh
(or short max-age=60)
*.{hash}.js max-age=31536000 Immutable (hash
*.{hash}.css (1 year) changes on change)
immutable
/api/* no-store Dynamic content
service-worker.js max-age=0 Must be fresh for
update checks
Nginx example:
# nginx CDN origin configuration
# HTML - always revalidate
location ~* \.html$ {
add_header Cache-Control "no-cache, no-store, must-revalidate";
add_header Pragma "no-cache";
expires 0;
}
# Hashed assets - cache forever
location ~* \.[a-f0-9]{8,}\.(js|css|woff2|png|jpg|svg)$ {
add_header Cache-Control "public, max-age=31536000, immutable";
}
# Non-hashed assets - short cache
location ~* \.(js|css|woff2|png|jpg|svg)$ {
add_header Cache-Control "public, max-age=3600";
}
# API - no cache
location /api/ {
add_header Cache-Control "no-store";
}
Multi-Version Asset Support
SUPPORTING MULTIPLE VERSIONS DURING ROLLOUT:
════════════════════════════════════════════════════════════════════
Scenario: Canary deployment with frontend changes
Problem:
- 10% of users get v2 backend
- v2 backend might expect v2 frontend
- But user might have v1 frontend cached
Solutions:
1. ASSET VERSION IN API RESPONSE
─────────────────────────────────
Backend returns expected asset version in API response
Frontend checks if its version matches
If mismatch: Force refresh
// Frontend code
const response = await fetch('/api/user');
const data = await response.json();
if (data.meta.expectedFrontendVersion !== window.APP_VERSION) {
// Clear cache and reload
if ('caches' in window) {
const keys = await caches.keys();
await Promise.all(keys.map(key => caches.delete(key)));
}
window.location.reload(true);
}
2. VERSION-MATCHED ROUTING
──────────────────────────
Route requests based on frontend version
// Request header from frontend
X-Frontend-Version: 2.3.1
// Backend routing logic
if (request.headers['x-frontend-version'] === '2.3.1') {
routeToV2Backend();
} else {
routeToV1Backend();
}
3. BACKWARD COMPATIBLE APIs
───────────────────────────
APIs support both old and new frontend expectations
(This is the safest approach)
// API response includes both old and new field names
{
"user_name": "alice", // v1 frontend uses this
"username": "alice", // v2 frontend uses this
"displayName": "Alice" // v2 frontend uses this
}
Monitoring During Deployments
Deployment Observability
WHAT TO MONITOR DURING DEPLOYMENT:
════════════════════════════════════════════════════════════════════
┌──────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT DASHBOARD │
├──────────────────────────────────────────────────────────────────┤
│ │
│ DEPLOYMENT STATUS │
│ ├── Current version: v2.3.1 │
│ ├── Previous version: v2.3.0 │
│ ├── Instances: 8/10 running v2.3.1 │
│ └── Status: ROLLING (80% complete) │
│ │
│ ERROR RATES (compare to baseline) │
│ ├── HTTP 5xx: 0.02% (baseline: 0.01%) ⚠ +100% │
│ ├── HTTP 4xx: 2.1% (baseline: 2.0%) ✓ normal │
│ └── Exceptions: 5/min (baseline: 3/min) ⚠ elevated │
│ │
│ LATENCY (compare to baseline) │
│ ├── p50: 45ms (baseline: 42ms) ✓ +7% │
│ ├── p95: 180ms (baseline: 165ms) ✓ +9% │
│ └── p99: 450ms (baseline: 350ms) ⚠ +29% │
│ │
│ SATURATION │
│ ├── CPU: 45% (baseline: 40%) ✓ normal │
│ ├── Memory: 68% (baseline: 65%) ✓ normal │
│ └── DB connections: 80/100 ⚠ elevated │
│ │
│ BUSINESS METRICS │
│ ├── Checkout completion: 3.2% (baseline: 3.1%) ✓ │
│ ├── Search success: 94% (baseline: 95%) ⚠ -1% │
│ └── API calls/sec: 1,250 (baseline: 1,200) ✓ │
│ │
└──────────────────────────────────────────────────────────────────┘
Automated Rollback Triggers
// deployment-monitor.ts
interface DeploymentMetrics {
errorRate5xx: number; // percentage
errorRateExceptions: number;
latencyP99Ms: number;
cpuPercent: number;
memoryPercent: number;
}
interface RollbackThresholds {
errorRate5xxMax: number; // e.g., 1%
errorRate5xxIncreaseMax: number; // e.g., 5x baseline
latencyP99IncreaseMax: number; // e.g., 2x baseline
evaluationWindowSeconds: number; // e.g., 300 (5 min)
}
class DeploymentMonitor {
private baseline: DeploymentMetrics;
private thresholds: RollbackThresholds;
async monitorDeployment(deploymentId: string): Promise<void> {
// Get baseline metrics (before deployment)
this.baseline = await this.getBaselineMetrics();
// Monitor during deployment
const startTime = Date.now();
const monitorDuration = 30 * 60 * 1000; // 30 minutes
while (Date.now() - startTime < monitorDuration) {
const current = await this.getCurrentMetrics();
const evaluation = this.evaluateMetrics(current);
if (evaluation.shouldRollback) {
console.error('Rollback triggered:', evaluation.reason);
await this.triggerRollback(deploymentId, evaluation.reason);
return;
}
if (evaluation.warnings.length > 0) {
await this.alertTeam(evaluation.warnings);
}
await sleep(10000); // Check every 10 seconds
}
console.log('Deployment monitoring completed successfully');
}
private evaluateMetrics(current: DeploymentMetrics): EvaluationResult {
const warnings: string[] = [];
// Check absolute thresholds
if (current.errorRate5xx > this.thresholds.errorRate5xxMax) {
return {
shouldRollback: true,
reason: `5xx error rate ${current.errorRate5xx}% exceeds maximum ${this.thresholds.errorRate5xxMax}%`
};
}
// Check relative thresholds (compared to baseline)
const errorRateIncrease = current.errorRate5xx / Math.max(this.baseline.errorRate5xx, 0.001);
if (errorRateIncrease > this.thresholds.errorRate5xxIncreaseMax) {
return {
shouldRollback: true,
reason: `5xx error rate increased ${errorRateIncrease.toFixed(1)}x from baseline`
};
}
const latencyIncrease = current.latencyP99Ms / this.baseline.latencyP99Ms;
if (latencyIncrease > this.thresholds.latencyP99IncreaseMax) {
return {
shouldRollback: true,
reason: `p99 latency increased ${latencyIncrease.toFixed(1)}x from baseline`
};
}
// Warnings (don't rollback, but alert)
if (errorRateIncrease > 2) {
warnings.push(`5xx error rate elevated: ${errorRateIncrease.toFixed(1)}x baseline`);
}
if (latencyIncrease > 1.5) {
warnings.push(`p99 latency elevated: ${latencyIncrease.toFixed(1)}x baseline`);
}
return { shouldRollback: false, warnings };
}
private async triggerRollback(deploymentId: string, reason: string) {
// Notify team immediately
await this.sendAlert({
severity: 'critical',
title: 'Automatic Rollback Triggered',
message: reason,
deploymentId
});
// Execute rollback
await this.deploymentService.rollback(deploymentId);
// Log for post-mortem
await this.logRollback({
deploymentId,
reason,
metrics: await this.getCurrentMetrics(),
baseline: this.baseline,
timestamp: new Date().toISOString()
});
}
}
Version-Aware Logging
// logging.ts
// Add version to all log entries
const logger = winston.createLogger({
defaultMeta: {
version: process.env.APP_VERSION,
deploymentId: process.env.DEPLOYMENT_ID,
instance: process.env.HOSTNAME
},
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [/* ... */]
});
// Log entry example
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"message": "Payment processing failed",
"version": "2.3.1", // Which version produced this log
"deploymentId": "deploy-abc", // Which deployment
"instance": "api-pod-xyz", // Which instance
"error": {
"code": "STRIPE_ERROR",
"message": "Card declined"
}
}
// Query in logging platform:
// version:2.3.1 AND level:error AND timestamp:[now-30m TO now]
// Compare error rates between versions during deployment
Common Failure Modes
┌─────────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT FAILURE MODES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ FAILURE: Database migration locks table │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Queries timeout during deployment │
│ Cause: ALTER TABLE without CONCURRENTLY/ONLINE │
│ Fix: Use non-locking migration strategies │
│ Prevent: Test migrations against production-size data │
│ │
│ FAILURE: New code, old database schema │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: New instances crash or error immediately │
│ Cause: Migration didn't run, or ran out of order │
│ Fix: Roll back code, investigate migration │
│ Prevent: Migration as separate deployment step │
│ │
│ FAILURE: Health check passes but app broken │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: 200 OK health checks, but 500s on real requests │
│ Cause: Health check too simple │
│ Fix: Add request path to health check, verify dependencies │
│ Prevent: Health checks that exercise critical paths │
│ │
│ FAILURE: Connection pool exhaustion │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Errors after deployment, even though code is fine │
│ Cause: Old instances held connections, new instances can't get │
│ Fix: Wait for old instances to fully drain │
│ Prevent: Connection pool size < max connections / instances │
│ │
│ FAILURE: Cache format mismatch │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Random errors during mixed-version period │
│ Cause: v1 wrote cache, v2 can't read it (or vice versa) │
│ Fix: Clear cache or deploy backward-compatible readers │
│ Prevent: Versioned cache keys or compatible formats │
│ │
│ FAILURE: Thundering herd on restart │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Downstream services overwhelmed during deployment │
│ Cause: All new instances hit cold caches simultaneously │
│ Fix: Stagger instance startup, warm caches before traffic │
│ Prevent: Startup jitter, cache warming in readiness check │
│ │
│ FAILURE: Long-running requests killed │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Report generation, file uploads fail during deploy │
│ Cause: Drain timeout shorter than request duration │
│ Fix: Increase drain timeout or move to async processing │
│ Prevent: Know your max request duration, set drain accordingly │
│ │
│ FAILURE: WebSocket connections dropped │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Real-time features break during deployment │
│ Cause: No graceful WebSocket shutdown │
│ Fix: Send close frames, client reconnect logic │
│ Prevent: WebSocket server graceful shutdown, client retry │
│ │
│ FAILURE: Background job left in bad state │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Jobs stuck, duplicated, or data inconsistent │
│ Cause: Worker killed mid-job │
│ Fix: Idempotent jobs, proper job status tracking │
│ Prevent: Graceful worker shutdown, distributed locks │
│ │
│ FAILURE: API version mismatch │
│ ───────────────────────────────────────────────────────────────── │
│ Symptom: Mobile app or frontend breaks during deploy │
│ Cause: API changed in non-backward-compatible way │
│ Fix: Roll back, maintain backward compatibility │
│ Prevent: API versioning, backward compatible changes only │
│ │
└─────────────────────────────────────────────────────────────────────┘
The Zero-Downtime Checklist
┌─────────────────────────────────────────────────────────────────────┐
│ PRE-DEPLOYMENT CHECKLIST │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ DATABASE │
│ □ Migration is backward compatible (old code works) │
│ □ Migration tested against production-size data │
│ □ No table locks (CONCURRENTLY, ONLINE, etc.) │
│ □ Expand phase separate from contract phase │
│ │
│ CODE │
│ □ Feature flagged if risky │
│ □ Backward compatible with old instances │
│ □ Handles both old and new data formats │
│ □ API changes are additive (not breaking) │
│ │
│ INFRASTRUCTURE │
│ □ Health checks actually verify functionality │
│ □ Graceful shutdown implemented │
│ □ Drain timeout > max request duration │
│ □ Connection pool size appropriate │
│ │
│ CACHING │
│ □ Cache format compatible or versioned keys │
│ □ No thundering herd on cold cache │
│ □ Cache invalidation strategy defined │
│ │
│ QUEUES/WORKERS │
│ □ Job format compatible │
│ □ Workers can gracefully shutdown │
│ □ Jobs are idempotent │
│ │
│ STATIC ASSETS │
│ □ Content-hashed filenames │
│ □ Deployed before backend │
│ □ CDN cache headers correct │
│ │
│ MONITORING │
│ □ Deployment metrics dashboard ready │
│ □ Rollback triggers configured │
│ □ Alerting in place │
│ □ Baseline metrics recorded │
│ │
│ ROLLBACK │
│ □ Rollback procedure documented and tested │
│ □ Previous version still deployable │
│ □ Database rollback plan (if needed) │
│ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ DURING DEPLOYMENT │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ □ Watch error rates (compare to baseline) │
│ □ Watch latency (especially p99) │
│ □ Watch instance health (startup, liveness, readiness) │
│ □ Watch resource saturation (CPU, memory, connections) │
│ □ Watch business metrics (conversion, success rates) │
│ □ Be ready to rollback │
│ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ POST-DEPLOYMENT │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ □ Verify all instances on new version │
│ □ Monitor for slow-burn issues (memory leaks, etc.) │
│ □ Check background job processing │
│ □ Verify external integrations still working │
│ □ Clean up feature flags (if applicable) │
│ □ Schedule contract migration (if applicable) │
│ □ Document any issues encountered │
│ │
└─────────────────────────────────────────────────────────────────────┘
Conclusion: The Unglamorous Truth
Zero-downtime deployments are achievable, but they require:
- Discipline: Following the expand-contract pattern, even when it feels slow
- Testing: Verifying migrations against production-scale data
- Monitoring: Watching the right metrics during deployment
- Humility: Accepting that "zero" is aspirational, not absolute
The teams that do this well share common traits:
- They deploy frequently (practice makes better)
- They keep changes small (easier to debug)
- They automate rollbacks (not just deployments)
- They treat incidents as learning opportunities
The diagram with the blue and green boxes isn't wrong—it's just incomplete. The real work is in the details: the database migrations, the health checks, the connection draining, and the hundred other small things that determine success.
Zero-downtime deployment isn't a feature you enable. It's a practice you develop over time, one unglamorous detail at a time.
What did you think?