Blue-Green Deployment Architecture for Frontends: Zero-Downtime Releases at Scale
Blue-Green Deployment Architecture for Frontends: Zero-Downtime Releases at Scale
Introduction: Why Blue-Green Deployments Exist
When Shopify's engineering team pushed a critical checkout update during Black Friday 2019, they needed absolute certainty that if anything went wrong, they could revert to the previous version in under 30 seconds. They achieved this using blue-green deployments—maintaining two identical production environments and switching traffic atomically between them.
Blue-green deployment is conceptually simple: run two identical environments (Blue and Green), deploy to the inactive one, test it, then switch traffic. But at scale, the implementation becomes complex, especially for frontends where static assets, CDN caches, client-side state, and hydration timing create challenges that don't exist in backend systems.
This article covers how production systems implement blue-green deployments for frontends, including the infrastructure decisions, traffic switching mechanisms, database considerations, and the failure modes you'll encounter at scale.
Scale Context: Production Reality
Traffic Profile:
- DAU: 40M daily active users
- Peak RPS: 380K requests/second
- Asset Requests: 2.2M RPS (JS/CSS/images)
- Geographic Distribution: 150+ countries
- CDN PoPs: 280+ edge locations
- Deployment Frequency: 15-25 deploys/day
Infrastructure Requirements:
- Environment Parity: 100% identical Blue and Green stacks
- Switch Time: <5 seconds (DNS/load balancer)
- Rollback Time: <10 seconds
- Zero Downtime: 99.99% availability during deployments
- Cost Overhead: 2x infrastructure (both environments hot)
Frontend Architecture:
- Framework: Next.js 14 with App Router
- Rendering: Hybrid SSR/SSG/ISR
- Bundle Size: 1.1MB initial, 4.2MB total
- API Dependencies: 8-12 services per page
- WebSocket Connections: 5M concurrent
High-Level Architecture: Blue-Green System
┌─────────────────────────────────────────────────────────────────────┐
│ USER REQUEST │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ LAYER 1: DNS / Global Load Balancer │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Traffic Router (Active Environment Pointer) │ │
│ │ - Current: BLUE (v47) │ │
│ │ - Standby: GREEN (v48) ← Deploying here │ │
│ └─────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ BLUE ENVIRONMENT │ │ GREEN ENVIRONMENT │
│ (ACTIVE) │ │ (STANDBY) │
│ │ │ │
│ ┌────────────────────┐ │ │ ┌────────────────────┐ │
│ │ CDN Edge (Blue) │ │ │ │ CDN Edge (Green) │ │
│ │ /dist-blue/ │ │ │ │ /dist-green/ │ │
│ └────────────────────┘ │ │ └────────────────────┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌────────────────────┐ │ │ ┌────────────────────┐ │
│ │ Origin Servers │ │ │ │ Origin Servers │ │
│ │ (K8s: blue-ns) │ │ │ │ (K8s: green-ns) │ │
│ │ - SSR Pods (20) │ │ │ │ - SSR Pods (20) │ │
│ │ - BFF Pods (10) │ │ │ │ - BFF Pods (10) │ │
│ └────────────────────┘ │ │ └────────────────────┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌────────────────────┐ │ │ ┌────────────────────┐ │
│ │ Static Assets │ │ │ │ Static Assets │ │
│ │ S3: blue-bucket │ │ │ │ S3: green-bucket │ │
│ └────────────────────┘ │ │ └────────────────────┘ │
└──────────────────────────┘ └──────────────────────────┘
│ │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ SHARED INFRASTRUCTURE (Both environments connect) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Database │ │ Redis │ │ Message Queue │ │
│ │ (PostgreSQL) │ │ (Session) │ │ (Kafka) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Key Architectural Principles:
-
Complete Environment Isolation - Blue and Green have separate Kubernetes namespaces, CDN configurations, S3 buckets, and load balancers. They share only the database and external services.
-
Atomic Traffic Switch - Traffic moves from Blue to Green in a single operation (DNS update or load balancer target group swap). No gradual migration.
-
Hot Standby - Both environments run continuously at production capacity. Green can receive traffic immediately after switch.
-
Shared Persistent State - Database, session store, and message queues are shared. Both environments must be compatible with the same data schema.
-
Version-Tagged Assets - Each environment serves assets from its own path (
/dist-blue/,/dist-green/) to prevent cache collisions.
Traffic Switching Mechanisms
The traffic switch is the core operation in blue-green deployment. There are three primary approaches, each with different characteristics.
1. DNS-Based Switching
Before Switch
┌─────────────┐ DNS Query ┌─────────────────┐
│ Browser │ ──────────────────▶│ DNS Server │
└─────────────┘ │ │
│ app.example.com │
│ → 1.2.3.4 │
│ (Blue LB) │
└─────────────────┘
After Switch
┌─────────────┐ DNS Query ┌─────────────────┐
│ Browser │ ──────────────────▶│ DNS Server │
└─────────────┘ │ │
│ app.example.com │
│ → 5.6.7.8 │
│ (Green LB) │
└─────────────────┘
Implementation (AWS Route 53):
import { Route53Client, ChangeResourceRecordSetsCommand } from '@aws-sdk/client-route-53';
class DNSBasedSwitch {
private route53: Route53Client;
private hostedZoneId: string;
private recordName: string;
async switchToGreen(): Promise<void> {
const greenLbDns = 'green-lb-123456.us-east-1.elb.amazonaws.com';
const command = new ChangeResourceRecordSetsCommand({
HostedZoneId: this.hostedZoneId,
ChangeBatch: {
Comment: 'Switch traffic to Green environment',
Changes: [
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: this.recordName,
Type: 'A',
AliasTarget: {
HostedZoneId: 'Z35SXDOTRQ7X7K', // ALB hosted zone
DNSName: greenLbDns,
EvaluateTargetHealth: true,
},
},
},
],
},
});
await this.route53.send(command);
// DNS propagation takes time
console.log('DNS updated. Propagation may take 60-300 seconds.');
}
async switchToBlue(): Promise<void> {
const blueLbDns = 'blue-lb-789012.us-east-1.elb.amazonaws.com';
// Same logic, different target
}
}
Why DNS Switching Is Problematic:
| Issue | Impact |
|---|---|
| TTL Propagation | DNS caches (browser, OS, ISP) hold old IP for 60-3600 seconds |
| No Instant Rollback | Rollback takes same propagation time as switch |
| Split Traffic | During propagation, some users hit Blue, some hit Green |
| Health Check Lag | Route 53 health checks have 10-30 second intervals |
When to Use: Only for disaster recovery where you're switching to a completely different region. Not recommended for regular deployments.
2. Load Balancer Target Group Switching (Recommended)
graph TB
User[User Request] --> ALB[Application Load Balancer]
ALB --> Listener[HTTPS Listener :443]
Listener --> Rules{Listener Rules}
Rules -->|Active| BlueTG[Blue Target Group<br/>weight: 100%]
Rules -->|Standby| GreenTG[Green Target Group<br/>weight: 0%]
BlueTG --> BlueEC2[Blue Pods<br/>10.0.1.x]
GreenTG --> GreenEC2[Green Pods<br/>10.0.2.x]
subgraph Switch Operation
Before[Before: Blue 100%, Green 0%]
After[After: Blue 0%, Green 100%]
Before -->|Atomic Swap| After
end
Implementation (AWS ALB):
import {
ElasticLoadBalancingV2Client,
ModifyListenerCommand,
DescribeTargetGroupsCommand
} from '@aws-sdk/client-elastic-load-balancing-v2';
class LoadBalancerSwitch {
private elbv2: ElasticLoadBalancingV2Client;
private listenerArn: string;
private blueTargetGroupArn: string;
private greenTargetGroupArn: string;
async switchToGreen(): Promise<void> {
// Verify Green is healthy first
const healthCheck = await this.checkTargetGroupHealth(this.greenTargetGroupArn);
if (healthCheck.unhealthyCount > 0) {
throw new Error(`Green environment unhealthy: ${healthCheck.unhealthyCount} targets down`);
}
// Atomic switch: change listener default action
const command = new ModifyListenerCommand({
ListenerArn: this.listenerArn,
DefaultActions: [
{
Type: 'forward',
TargetGroupArn: this.greenTargetGroupArn,
},
],
});
const startTime = Date.now();
await this.elbv2.send(command);
const duration = Date.now() - startTime;
console.log(`Traffic switched to Green in ${duration}ms`);
// Log the switch event
await this.logSwitchEvent('blue', 'green', duration);
}
async switchToBlue(): Promise<void> {
// Verify Blue is healthy
const healthCheck = await this.checkTargetGroupHealth(this.blueTargetGroupArn);
if (healthCheck.unhealthyCount > 0) {
throw new Error(`Blue environment unhealthy: ${healthCheck.unhealthyCount} targets down`);
}
const command = new ModifyListenerCommand({
ListenerArn: this.listenerArn,
DefaultActions: [
{
Type: 'forward',
TargetGroupArn: this.blueTargetGroupArn,
},
],
});
await this.elbv2.send(command);
console.log('Traffic switched to Blue (rollback complete)');
}
private async checkTargetGroupHealth(targetGroupArn: string): Promise<HealthStatus> {
const command = new DescribeTargetGroupsCommand({
TargetGroupArns: [targetGroupArn],
});
const response = await this.elbv2.send(command);
// Parse health status
return {
healthyCount: response.TargetGroups?.[0]?.HealthyHostCount || 0,
unhealthyCount: response.TargetGroups?.[0]?.UnhealthyHostCount || 0,
};
}
}
Why Load Balancer Switching Works:
| Benefit | Details |
|---|---|
| Instant Switch | ALB rule change takes <2 seconds |
| Instant Rollback | Same speed as forward switch |
| Health-Aware | Only switches if target group is healthy |
| No DNS Propagation | DNS points to ALB, which doesn't change |
| Connection Draining | Existing connections gracefully complete |
Connection Draining Behavior:
T+0s Switch command sent
T+0.5s New requests go to Green
T+0.5s Existing Blue connections continue
T+300s Default deregistration delay expires
T+300s Blue connections forcefully closed (if still open)
3. Kubernetes Service Switching
For Kubernetes-native deployments, you can use label selectors to switch traffic:
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-blue
namespace: production
spec:
replicas: 20
selector:
matchLabels:
app: frontend
version: blue
template:
metadata:
labels:
app: frontend
version: blue
spec:
containers:
- name: frontend
image: frontend:v47
ports:
- containerPort: 3000
---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-green
namespace: production
spec:
replicas: 20
selector:
matchLabels:
app: frontend
version: green
template:
metadata:
labels:
app: frontend
version: green
spec:
containers:
- name: frontend
image: frontend:v48
ports:
- containerPort: 3000
---
# Service (points to active version)
apiVersion: v1
kind: Service
metadata:
name: frontend
namespace: production
spec:
selector:
app: frontend
version: blue # ← Switch this to 'green' for traffic switch
ports:
- port: 80
targetPort: 3000
Switching Script:
import { KubeConfig, CoreV1Api } from '@kubernetes/client-node';
class KubernetesSwitch {
private k8sApi: CoreV1Api;
private namespace = 'production';
private serviceName = 'frontend';
async switchToGreen(): Promise<void> {
// Read current service
const service = await this.k8sApi.readNamespacedService(
this.serviceName,
this.namespace
);
// Update selector
service.body.spec!.selector = {
app: 'frontend',
version: 'green',
};
// Apply update
await this.k8sApi.replaceNamespacedService(
this.serviceName,
this.namespace,
service.body
);
console.log('Traffic switched to Green');
}
async switchToBlue(): Promise<void> {
const service = await this.k8sApi.readNamespacedService(
this.serviceName,
this.namespace
);
service.body.spec!.selector = {
app: 'frontend',
version: 'blue',
};
await this.k8sApi.replaceNamespacedService(
this.serviceName,
this.namespace,
service.body
);
console.log('Traffic switched to Blue (rollback)');
}
}
Comparison of Switching Mechanisms:
| Mechanism | Switch Time | Rollback Time | Complexity | Best For |
|---|---|---|---|---|
| DNS | 60-300s | 60-300s | Low | DR failover only |
| Load Balancer | <2s | <2s | Medium | Production deployments |
| K8s Service | <5s | <5s | Medium | K8s-native apps |
| Istio/Service Mesh | <1s | <1s | High | Advanced traffic control |
Frontend-Specific Blue-Green Challenges
Challenge 1: CDN Cache Coherence
The Problem:
User loads page during switch:
- HTML served from Green (new version)
- JS bundle request goes to CDN
- CDN has cached Blue version of
main.js - Hydration fails because JS doesn't match HTML
Timeline:
T+0 HTML request → Green origin → v48 HTML
T+0.1s JS request → CDN cache hit → v47 JS (cached from Blue)
T+0.2s Browser: "React hydration error: text mismatch"
Solution 1: Versioned Asset Paths
Each environment serves assets from a unique path:
// next.config.js
module.exports = {
assetPrefix: process.env.ASSET_PREFIX, // '/dist-blue/' or '/dist-green/'
generateBuildId: async () => {
return process.env.BUILD_VERSION; // 'v47' or 'v48'
},
};
HTML Output:
<!-- Blue environment -->
<script src="/dist-blue/_next/static/v47/main.js"></script>
<!-- Green environment -->
<script src="/dist-green/_next/static/v48/main.js"></script>
CDN Configuration (Cloudflare):
// Cloudflare Worker
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
// Route based on asset prefix
if (url.pathname.startsWith('/dist-blue/')) {
return fetch(`${env.BLUE_ORIGIN}${url.pathname}`);
}
if (url.pathname.startsWith('/dist-green/')) {
return fetch(`${env.GREEN_ORIGIN}${url.pathname}`);
}
// HTML requests go to active environment
const activeEnv = await env.KV.get('active-environment');
const origin = activeEnv === 'green' ? env.GREEN_ORIGIN : env.BLUE_ORIGIN;
return fetch(`${origin}${url.pathname}`);
}
};
Solution 2: Dual CDN Purge on Switch
class BlueGreenDeployer {
async switch(from: 'blue' | 'green', to: 'blue' | 'green'): Promise<void> {
// Step 1: Warm up standby CDN cache
await this.warmCDNCache(to);
// Step 2: Switch load balancer
await this.switchLoadBalancer(to);
// Step 3: Purge old environment's HTML from CDN
// (JS/CSS can stay - content-addressed URLs)
await this.purgeCDN(`/dist-${from}/*.html`);
console.log(`Switched from ${from} to ${to}`);
}
private async warmCDNCache(env: 'blue' | 'green'): Promise<void> {
const criticalPaths = [
'/',
'/products',
'/cart',
'/checkout',
];
const origin = env === 'green' ? this.greenOrigin : this.blueOrigin;
await Promise.all(
criticalPaths.map(path =>
fetch(`${origin}${path}`, {
headers: { 'X-Warm-Cache': 'true' }
})
)
);
}
}
Challenge 2: Client-Side State Compatibility
The Problem:
User has localStorage data from Blue (v47). Switch happens. User navigates, gets Green (v48). Green expects different localStorage schema.
// v47 (Blue) wrote this:
localStorage.setItem('preferences', JSON.stringify({
theme: 'dark',
notifications: true
}));
// v48 (Green) expects this:
interface PreferencesV2 {
version: 2;
ui: {
theme: 'dark' | 'light';
fontSize: number;
};
notifications: {
email: boolean;
push: boolean;
};
}
Solution: Schema Migration with Version Tolerance
interface PreferencesV1 {
theme: string;
notifications: boolean;
}
interface PreferencesV2 {
version: 2;
ui: { theme: 'dark' | 'light'; fontSize: number };
notifications: { email: boolean; push: boolean };
}
type Preferences = PreferencesV1 | PreferencesV2;
function loadPreferences(): PreferencesV2 {
const raw = localStorage.getItem('preferences');
if (!raw) {
return getDefaultPreferences();
}
try {
const data = JSON.parse(raw) as Preferences;
// Check version
if ('version' in data && data.version === 2) {
return data;
}
// Migrate v1 → v2
const v1 = data as PreferencesV1;
const migrated: PreferencesV2 = {
version: 2,
ui: {
theme: v1.theme === 'dark' ? 'dark' : 'light',
fontSize: 16,
},
notifications: {
email: v1.notifications,
push: v1.notifications,
},
};
// Save migrated version
localStorage.setItem('preferences', JSON.stringify(migrated));
return migrated;
} catch (error) {
console.error('Failed to load preferences, resetting', error);
const defaults = getDefaultPreferences();
localStorage.setItem('preferences', JSON.stringify(defaults));
return defaults;
}
}
Bidirectional Compatibility:
During blue-green transition, users may switch between environments. Both versions must handle each other's data:
// Both v47 and v48 must include this
function loadPreferencesCompat(): PreferencesV2 {
const raw = localStorage.getItem('preferences');
if (!raw) return getDefaultPreferences();
const data = JSON.parse(raw);
// v2 format
if (data.version === 2) {
return data;
}
// v1 format (migrate)
return migrateV1ToV2(data);
}
// v47 must also read v2 format (forward compatibility)
function loadPreferencesV47(): PreferencesV1 {
const raw = localStorage.getItem('preferences');
if (!raw) return { theme: 'light', notifications: true };
const data = JSON.parse(raw);
// Handle v2 format written by v48
if (data.version === 2) {
return {
theme: data.ui.theme,
notifications: data.notifications.email,
};
}
return data;
}
Challenge 3: In-Flight Requests During Switch
The Problem:
User clicks "Submit Order" on Blue at T+0. Switch happens at T+0.5s. Response comes from Green at T+1s. Green doesn't have the same CSRF token validation state.
T+0s User: POST /api/orders (Blue)
T+0.5s Operator: Switch to Green
T+1s Response: 403 Forbidden (CSRF token invalid)
T+1s User: "My order failed!"
Solution: Shared Session Store
// Redis session store (shared by Blue and Green)
import Redis from 'ioredis';
class SessionStore {
private redis: Redis;
async getSession(sessionId: string): Promise<Session | null> {
const data = await this.redis.get(`session:${sessionId}`);
return data ? JSON.parse(data) : null;
}
async setSession(sessionId: string, session: Session): Promise<void> {
await this.redis.setex(
`session:${sessionId}`,
86400, // 24 hour TTL
JSON.stringify(session)
);
}
async validateCSRF(sessionId: string, token: string): Promise<boolean> {
const session = await this.getSession(sessionId);
return session?.csrfToken === token;
}
}
Connection Draining Strategy:
Configure load balancer to wait for in-flight requests:
// AWS ALB deregistration delay
const targetGroupConfig = {
TargetGroupArn: blueTargetGroupArn,
Attributes: [
{
Key: 'deregistration_delay.timeout_seconds',
Value: '30', // Wait 30s for in-flight requests
},
],
};
Timeline with Connection Draining:
T+0s User: POST /api/orders (Blue)
T+0.5s Operator: Switch to Green
T+0.5s New requests → Green
T+0.5s Blue still processing existing request
T+1s Blue returns response to user
T+1s User: "Order successful!"
T+30.5s Blue connections fully drained
Challenge 4: WebSocket Connection Migration
The Problem:
Users have active WebSocket connections to Blue. Switch to Green. WebSockets are still connected to Blue, which is now stale.
Solution: Graceful WebSocket Migration
// Server-side: Broadcast migration notice before switch
class WebSocketManager {
private connections: Map<string, WebSocket> = new Map();
async prepareForSwitch(): Promise<void> {
// Notify all connected clients
const migrationMessage = JSON.stringify({
type: 'ENVIRONMENT_SWITCH',
reconnectIn: 5000, // 5 seconds
newEndpoint: 'wss://green.example.com/ws',
});
for (const [id, ws] of this.connections) {
ws.send(migrationMessage);
}
// Wait for clients to reconnect to Green
await sleep(10000);
// Close remaining connections
for (const [id, ws] of this.connections) {
ws.close(1000, 'Environment switch');
}
}
}
// Client-side: Handle migration
class WebSocketClient {
private ws: WebSocket;
private endpoint: string;
connect(): void {
this.ws = new WebSocket(this.endpoint);
this.ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'ENVIRONMENT_SWITCH') {
console.log('Server switching environments, reconnecting...');
// Reconnect to new endpoint after delay
setTimeout(() => {
this.endpoint = message.newEndpoint;
this.ws.close();
this.connect();
}, message.reconnectIn);
}
};
this.ws.onclose = () => {
// Auto-reconnect with exponential backoff
this.reconnectWithBackoff();
};
}
}
Database Compatibility: The Hardest Problem
Blue-green deployment for frontends is relatively straightforward. The database is where things get complicated.
The Database Challenge
Both Blue and Green environments connect to the same database. If Green requires a schema change, you have two options:
- Backward-Compatible Migrations - Green's schema changes must work with Blue's code
- Database Blue-Green - Maintain two databases with replication (complex)
Strategy 1: Backward-Compatible Migrations
Rule: Every migration must work with both old and new application code.
Example: Adding a Column
-- Migration: Add 'middle_name' column
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);
-- This works because:
-- - Blue (old) ignores middle_name when reading (SELECT * is fine)
-- - Blue (old) doesn't write middle_name (NULL is acceptable)
-- - Green (new) can read and write middle_name
Example: Renaming a Column (Problematic)
-- WRONG: This breaks Blue immediately
ALTER TABLE users RENAME COLUMN name TO full_name;
-- RIGHT: Expand-Contract Pattern
-- Step 1: Add new column (deploy to Green, switch traffic)
ALTER TABLE users ADD COLUMN full_name VARCHAR(200);
-- Step 2: Backfill data
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Step 3: Application reads from both, writes to both
-- (Both Blue and Green must be updated for this step)
-- Step 4: Remove old column (after Blue is decommissioned)
ALTER TABLE users DROP COLUMN name;
Expand-Contract Migration Pattern:
graph LR
A[Original Schema] --> B[Add New Column]
B --> C[Backfill Data]
C --> D[App Uses Both]
D --> E[Remove Old Column]
subgraph "Blue Compatible"
A
B
C
D
end
subgraph "Green Only"
E
end
Strategy 2: Feature Flags for Schema-Dependent Code
// Both Blue and Green include this code
async function getUserDisplayName(userId: string): Promise<string> {
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// Feature flag controls which field to use
if (await featureFlags.isEnabled('use_full_name_field')) {
return user.full_name || user.name; // Fallback to old field
}
return user.name;
}
async function updateUserName(userId: string, name: string): Promise<void> {
if (await featureFlags.isEnabled('use_full_name_field')) {
// Write to both fields during transition
await db.query(
'UPDATE users SET name = $2, full_name = $2 WHERE id = $1',
[userId, name]
);
} else {
await db.query(
'UPDATE users SET name = $2 WHERE id = $1',
[userId, name]
);
}
}
Migration Timeline:
T+0 Blue live, Green deploying
Schema: users (name)
Flag: use_full_name_field = false
T+1 Deploy migration: ADD COLUMN full_name
Blue and Green both work (ignore new column)
T+2 Run backfill: full_name = name
T+3 Switch traffic to Green
Enable flag: use_full_name_field = true
Green reads full_name, writes both
T+4 Verify Green is stable
Blue is dormant (no traffic)
T+5 Deploy code to remove old column dependency
(Next release)
T+6 Drop column: ALTER TABLE users DROP COLUMN name
Deployment Pipeline: End-to-End Flow
sequenceDiagram
participant Dev as Developer
participant CI as CI/CD Pipeline
participant Green as Green Environment
participant Smoke as Smoke Tests
participant LB as Load Balancer
participant Blue as Blue Environment
participant Monitor as Monitoring
Dev->>CI: Push to main
CI->>CI: Build & Test
CI->>Green: Deploy to Green
Green->>Green: Health checks pass
CI->>Smoke: Run smoke tests against Green
Smoke->>Green: HTTP requests
Green->>Smoke: 200 OK
Smoke->>CI: Tests passed
CI->>LB: Switch traffic to Green
LB->>Green: 100% traffic
LB->>Blue: 0% traffic
Monitor->>Green: Collect metrics
alt Metrics OK
Monitor->>CI: Deployment successful
CI->>Blue: Update Blue with new version (prepare for next deploy)
else Metrics Bad
Monitor->>CI: Anomaly detected
CI->>LB: Rollback to Blue
LB->>Blue: 100% traffic
LB->>Green: 0% traffic
CI->>Dev: Alert: Deployment rolled back
end
Implementation:
class BlueGreenPipeline {
async deploy(version: string): Promise<DeploymentResult> {
const startTime = Date.now();
const standbyEnv = await this.getStandbyEnvironment();
console.log(`Deploying ${version} to ${standbyEnv} environment`);
// Step 1: Deploy to standby
await this.deployToEnvironment(standbyEnv, version);
// Step 2: Wait for health checks
await this.waitForHealthy(standbyEnv, 300000); // 5 min timeout
// Step 3: Run smoke tests
const smokeResult = await this.runSmokeTests(standbyEnv);
if (!smokeResult.passed) {
throw new Error(`Smoke tests failed: ${smokeResult.failures.join(', ')}`);
}
// Step 4: Switch traffic
const activeEnv = await this.getActiveEnvironment();
await this.switchTraffic(activeEnv, standbyEnv);
// Step 5: Monitor for anomalies (5 minutes)
const monitorResult = await this.monitorForAnomalies(standbyEnv, 300000);
if (monitorResult.anomalyDetected) {
console.error('Anomaly detected, rolling back');
await this.switchTraffic(standbyEnv, activeEnv);
throw new Error(`Rollback triggered: ${monitorResult.reason}`);
}
// Step 6: Mark deployment successful
await this.recordDeployment({
version,
environment: standbyEnv,
duration: Date.now() - startTime,
status: 'success',
});
return { success: true, environment: standbyEnv };
}
private async runSmokeTests(env: 'blue' | 'green'): Promise<SmokeTestResult> {
const endpoint = env === 'green' ? this.greenEndpoint : this.blueEndpoint;
const tests = [
{ name: 'homepage', path: '/', expectedStatus: 200 },
{ name: 'api-health', path: '/api/health', expectedStatus: 200 },
{ name: 'static-asset', path: '/_next/static/chunks/main.js', expectedStatus: 200 },
];
const results = await Promise.all(
tests.map(async (test) => {
const response = await fetch(`${endpoint}${test.path}`);
return {
name: test.name,
passed: response.status === test.expectedStatus,
actualStatus: response.status,
};
})
);
return {
passed: results.every(r => r.passed),
failures: results.filter(r => !r.passed).map(r => r.name),
};
}
private async monitorForAnomalies(
env: 'blue' | 'green',
durationMs: number
): Promise<MonitorResult> {
const startTime = Date.now();
const checkInterval = 10000; // 10 seconds
while (Date.now() - startTime < durationMs) {
const metrics = await this.collectMetrics(env);
// Check error rate
if (metrics.errorRate > 0.01) { // >1% error rate
return {
anomalyDetected: true,
reason: `Error rate ${(metrics.errorRate * 100).toFixed(2)}% exceeds threshold`,
};
}
// Check latency
if (metrics.p95Latency > 2000) { // >2s p95
return {
anomalyDetected: true,
reason: `P95 latency ${metrics.p95Latency}ms exceeds threshold`,
};
}
await sleep(checkInterval);
}
return { anomalyDetected: false };
}
}
Cost Analysis: Blue-Green Economics
Blue-green deployment requires running two production environments, which has significant cost implications.
Infrastructure Costs
| Component | Blue | Green | Total | Notes |
|---|---|---|---|---|
| EC2/EKS Compute | $15K/mo | $15K/mo | $30K/mo | Both at full capacity |
| Load Balancers | $500/mo | $500/mo | $1K/mo | One per environment |
| CDN (separate configs) | $8K/mo | $8K/mo | $16K/mo | Asset isolation |
| S3 (static assets) | $200/mo | $200/mo | $400/mo | Versioned assets |
| Total Infrastructure | $47.4K/mo |
Cost Optimization Strategies
1. Scale Down Standby Environment:
Keep standby at minimum capacity, scale up before switch:
class CostOptimizedBlueGreen {
async prepareForSwitch(standbyEnv: 'blue' | 'green'): Promise<void> {
// Scale up standby to match production capacity
await this.scaleEnvironment(standbyEnv, {
minReplicas: 20,
maxReplicas: 50,
});
// Wait for pods to be ready
await this.waitForCapacity(standbyEnv, 20);
}
async afterSwitch(previousEnv: 'blue' | 'green'): Promise<void> {
// Scale down previous environment to minimum
await this.scaleEnvironment(previousEnv, {
minReplicas: 2, // Keep warm for fast rollback
maxReplicas: 5,
});
}
}
Cost Savings:
| Strategy | Before | After | Savings |
|---|---|---|---|
| Standby at minimum | $30K | $18K | 40% |
| Spot instances for standby | $30K | $21K | 30% |
| Combined | $30K | $15K | 50% |
2. Share CDN Configuration:
Use path-based routing instead of separate CDN configurations:
// Single CDN, path-based routing
const cdnConfig = {
origins: [
{ name: 'blue', domain: 'blue-origin.example.com' },
{ name: 'green', domain: 'green-origin.example.com' },
],
routes: [
{ path: '/dist-blue/*', origin: 'blue' },
{ path: '/dist-green/*', origin: 'green' },
{ path: '/*', origin: 'active' }, // Dynamic based on switch state
],
};
Blue-Green vs Canary: When to Use Each
| Aspect | Blue-Green | Canary |
|---|---|---|
| Traffic Split | 0/100 or 100/0 (atomic) | Gradual (5% → 25% → 100%) |
| Rollback Speed | Instant (<5s) | Fast (<60s with CDN purge) |
| User Impact | All users switch together | Subset experiences new version |
| Testing Confidence | Lower (no production validation) | Higher (real traffic testing) |
| Infrastructure Cost | 2x (two full environments) | 1.05-1.25x (minimal extra capacity) |
| Complexity | Lower | Higher (traffic splitting logic) |
| Best For | Database migrations, major releases | Incremental features, A/B testing |
Decision Framework:
Is this a database schema change?
YES → Blue-Green (need consistent schema)
Is this a major version upgrade (React, Next.js)?
YES → Blue-Green (all-or-nothing change)
Is this a new feature that can be gradually rolled out?
YES → Canary (validate with subset first)
Do you need instant rollback capability?
YES → Blue-Green (faster rollback)
Is cost a primary concern?
YES → Canary (lower infrastructure overhead)
Production Incidents and Lessons
Incident 1: Session Store Race Condition
What Happened:
Switched from Blue to Green. Users logged out immediately after switch.
Root Cause:
Blue used Redis key prefix session:blue:. Green used session:green:. Sessions weren't shared.
Fix:
// Use environment-agnostic session keys
const SESSION_PREFIX = 'session:'; // Not 'session:blue:' or 'session:green:'
function getSessionKey(sessionId: string): string {
return `${SESSION_PREFIX}${sessionId}`;
}
Incident 2: CDN Cache Serving Stale HTML
What Happened:
Switched to Green, but users still seeing Blue HTML for 30 minutes.
Root Cause:
HTML was cached at CDN with 1-hour TTL. Switch changed origin but CDN didn't know.
Fix:
async function switchWithCachePurge(to: 'blue' | 'green'): Promise<void> {
// Step 1: Purge HTML from CDN
await purgeCDN('/*.html');
await purgeCDN('/');
// Step 2: Wait for purge propagation
await sleep(5000);
// Step 3: Switch load balancer
await switchLoadBalancer(to);
}
Incident 3: Database Migration Timing
What Happened:
Ran migration, switched to Green, Green crashed because migration wasn't complete.
Root Cause:
Migration was still running when traffic switch happened.
Fix:
async function deployWithMigration(version: string): Promise<void> {
// Step 1: Run migration
console.log('Running database migration...');
await runMigration();
// Step 2: Verify migration complete
const migrationStatus = await checkMigrationStatus();
if (migrationStatus !== 'complete') {
throw new Error('Migration not complete, aborting deploy');
}
// Step 3: Deploy application
await deployToStandby(version);
// Step 4: Verify standby can use new schema
await runSchemaValidationTests();
// Step 5: Switch traffic
await switchTraffic();
}
Summary: Blue-Green Architecture Principles
-
Environment Parity - Blue and Green must be identical except for application version. Same infrastructure, same configuration, same capacity.
-
Atomic Switching - Traffic moves from one environment to another in a single operation. No gradual migration.
-
Shared Persistent State - Database, session store, and external services are shared. Both environments must be compatible.
-
Instant Rollback - The primary value of blue-green is rollback speed. If switch fails, revert in seconds.
-
Asset Isolation - Each environment serves assets from unique paths to prevent CDN cache collisions.
-
Connection Draining - Allow in-flight requests to complete before fully switching.
-
Health Verification - Never switch to an environment that isn't healthy. Automated health checks before switch.
-
Cost Awareness - Two environments cost 2x. Optimize by scaling down standby between deployments.
Blue-green deployment trades infrastructure cost for deployment safety. For applications where downtime is unacceptable and rollback speed is critical, it's the right choice. For incremental feature releases, canary deployments are more appropriate.
The architecture outlined here—load balancer switching, versioned asset paths, shared session storage, and automated pipelines—is what production systems at scale use for zero-downtime deployments.
What did you think?