Blue-Green Deployment Architecture for Frontends: Zero-Downtime Releases at Scale

June 14, 202697 min read0 views

blue green deployment

frontend architecture

zero downtime deployments

release engineering

deployment strategies

Blue-Green Deployment Architecture for Frontends: Zero-Downtime Releases at Scale

Introduction: Why Blue-Green Deployments Exist

When Shopify's engineering team pushed a critical checkout update during Black Friday 2019, they needed absolute certainty that if anything went wrong, they could revert to the previous version in under 30 seconds. They achieved this using blue-green deployments—maintaining two identical production environments and switching traffic atomically between them.

Blue-green deployment is conceptually simple: run two identical environments (Blue and Green), deploy to the inactive one, test it, then switch traffic. But at scale, the implementation becomes complex, especially for frontends where static assets, CDN caches, client-side state, and hydration timing create challenges that don't exist in backend systems.

This article covers how production systems implement blue-green deployments for frontends, including the infrastructure decisions, traffic switching mechanisms, database considerations, and the failure modes you'll encounter at scale.

Scale Context: Production Reality

Traffic Profile:

DAU: 40M daily active users
Peak RPS: 380K requests/second
Asset Requests: 2.2M RPS (JS/CSS/images)
Geographic Distribution: 150+ countries
CDN PoPs: 280+ edge locations
Deployment Frequency: 15-25 deploys/day

Infrastructure Requirements:

Environment Parity: 100% identical Blue and Green stacks
Switch Time: <5 seconds (DNS/load balancer)
Rollback Time: <10 seconds
Zero Downtime: 99.99% availability during deployments
Cost Overhead: 2x infrastructure (both environments hot)

Frontend Architecture:

Framework: Next.js 14 with App Router
Rendering: Hybrid SSR/SSG/ISR
Bundle Size: 1.1MB initial, 4.2MB total
API Dependencies: 8-12 services per page
WebSocket Connections: 5M concurrent

High-Level Architecture: Blue-Green System

┌─────────────────────────────────────────────────────────────────────┐
│                        USER REQUEST                                  │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 1: DNS / Global Load Balancer                                │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Traffic Router (Active Environment Pointer)                │    │
│  │  - Current: BLUE (v47)                                      │    │
│  │  - Standby: GREEN (v48) ← Deploying here                    │    │
│  └─────────────────────────────────────────────────────────────┘    │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
              ┌────────────────┴────────────────┐
              │                                 │
              ▼                                 ▼
┌──────────────────────────┐      ┌──────────────────────────┐
│     BLUE ENVIRONMENT     │      │    GREEN ENVIRONMENT     │
│         (ACTIVE)         │      │        (STANDBY)         │
│                          │      │                          │
│  ┌────────────────────┐  │      │  ┌────────────────────┐  │
│  │   CDN Edge (Blue)  │  │      │  │  CDN Edge (Green)  │  │
│  │   /dist-blue/      │  │      │  │  /dist-green/      │  │
│  └────────────────────┘  │      │  └────────────────────┘  │
│            │             │      │            │             │
│            ▼             │      │            ▼             │
│  ┌────────────────────┐  │      │  ┌────────────────────┐  │
│  │   Origin Servers   │  │      │  │   Origin Servers   │  │
│  │   (K8s: blue-ns)   │  │      │  │   (K8s: green-ns)  │  │
│  │   - SSR Pods (20)  │  │      │  │   - SSR Pods (20)  │  │
│  │   - BFF Pods (10)  │  │      │  │   - BFF Pods (10)  │  │
│  └────────────────────┘  │      │  └────────────────────┘  │
│            │             │      │            │             │
│            ▼             │      │            ▼             │
│  ┌────────────────────┐  │      │  ┌────────────────────┐  │
│  │   Static Assets    │  │      │  │   Static Assets    │  │
│  │   S3: blue-bucket  │  │      │  │   S3: green-bucket │  │
│  └────────────────────┘  │      │  └────────────────────┘  │
└──────────────────────────┘      └──────────────────────────┘
              │                                 │
              └─────────────┬───────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│  SHARED INFRASTRUCTURE (Both environments connect)                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │
│  │    Database     │  │      Redis      │  │   Message Queue │      │
│  │   (PostgreSQL)  │  │    (Session)    │  │     (Kafka)     │      │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘      │
└─────────────────────────────────────────────────────────────────────┘

Key Architectural Principles:

Complete Environment Isolation - Blue and Green have separate Kubernetes namespaces, CDN configurations, S3 buckets, and load balancers. They share only the database and external services.
Atomic Traffic Switch - Traffic moves from Blue to Green in a single operation (DNS update or load balancer target group swap). No gradual migration.
Hot Standby - Both environments run continuously at production capacity. Green can receive traffic immediately after switch.
Shared Persistent State - Database, session store, and message queues are shared. Both environments must be compatible with the same data schema.
Version-Tagged Assets - Each environment serves assets from its own path (/dist-blue/, /dist-green/) to prevent cache collisions.

Traffic Switching Mechanisms

The traffic switch is the core operation in blue-green deployment. There are three primary approaches, each with different characteristics.

1. DNS-Based Switching

                                Before Switch
┌─────────────┐     DNS Query      ┌─────────────────┐
│   Browser   │ ──────────────────▶│   DNS Server    │
└─────────────┘                    │                 │
                                   │ app.example.com │
                                   │   → 1.2.3.4     │
                                   │   (Blue LB)     │
                                   └─────────────────┘

                                After Switch
┌─────────────┐     DNS Query      ┌─────────────────┐
│   Browser   │ ──────────────────▶│   DNS Server    │
└─────────────┘                    │                 │
                                   │ app.example.com │
                                   │   → 5.6.7.8     │
                                   │   (Green LB)    │
                                   └─────────────────┘

Implementation (AWS Route 53):

import { Route53Client, ChangeResourceRecordSetsCommand } from '@aws-sdk/client-route-53';

class DNSBasedSwitch {
  private route53: Route53Client;
  private hostedZoneId: string;
  private recordName: string;

  async switchToGreen(): Promise<void> {
    const greenLbDns = 'green-lb-123456.us-east-1.elb.amazonaws.com';

    const command = new ChangeResourceRecordSetsCommand({
      HostedZoneId: this.hostedZoneId,
      ChangeBatch: {
        Comment: 'Switch traffic to Green environment',
        Changes: [
          {
            Action: 'UPSERT',
            ResourceRecordSet: {
              Name: this.recordName,
              Type: 'A',
              AliasTarget: {
                HostedZoneId: 'Z35SXDOTRQ7X7K', // ALB hosted zone
                DNSName: greenLbDns,
                EvaluateTargetHealth: true,
              },
            },
          },
        ],
      },
    });

    await this.route53.send(command);

    // DNS propagation takes time
    console.log('DNS updated. Propagation may take 60-300 seconds.');
  }

  async switchToBlue(): Promise<void> {
    const blueLbDns = 'blue-lb-789012.us-east-1.elb.amazonaws.com';
    // Same logic, different target
  }
}

Why DNS Switching Is Problematic:

Issue	Impact
TTL Propagation	DNS caches (browser, OS, ISP) hold old IP for 60-3600 seconds
No Instant Rollback	Rollback takes same propagation time as switch
Split Traffic	During propagation, some users hit Blue, some hit Green
Health Check Lag	Route 53 health checks have 10-30 second intervals

When to Use: Only for disaster recovery where you're switching to a completely different region. Not recommended for regular deployments.

2. Load Balancer Target Group Switching (Recommended)

graph TB
    User[User Request] --> ALB[Application Load Balancer]

    ALB --> Listener[HTTPS Listener :443]

    Listener --> Rules{Listener Rules}

    Rules -->|Active| BlueTG[Blue Target Group<br/>weight: 100%]
    Rules -->|Standby| GreenTG[Green Target Group<br/>weight: 0%]

    BlueTG --> BlueEC2[Blue Pods<br/>10.0.1.x]
    GreenTG --> GreenEC2[Green Pods<br/>10.0.2.x]

    subgraph Switch Operation
        Before[Before: Blue 100%, Green 0%]
        After[After: Blue 0%, Green 100%]
        Before -->|Atomic Swap| After
    end

Implementation (AWS ALB):

import {
  ElasticLoadBalancingV2Client,
  ModifyListenerCommand,
  DescribeTargetGroupsCommand
} from '@aws-sdk/client-elastic-load-balancing-v2';

class LoadBalancerSwitch {
  private elbv2: ElasticLoadBalancingV2Client;
  private listenerArn: string;
  private blueTargetGroupArn: string;
  private greenTargetGroupArn: string;

  async switchToGreen(): Promise<void> {
    // Verify Green is healthy first
    const healthCheck = await this.checkTargetGroupHealth(this.greenTargetGroupArn);

    if (healthCheck.unhealthyCount > 0) {
      throw new Error(`Green environment unhealthy: ${healthCheck.unhealthyCount} targets down`);
    }

    // Atomic switch: change listener default action
    const command = new ModifyListenerCommand({
      ListenerArn: this.listenerArn,
      DefaultActions: [
        {
          Type: 'forward',
          TargetGroupArn: this.greenTargetGroupArn,
        },
      ],
    });

    const startTime = Date.now();
    await this.elbv2.send(command);
    const duration = Date.now() - startTime;

    console.log(`Traffic switched to Green in ${duration}ms`);

    // Log the switch event
    await this.logSwitchEvent('blue', 'green', duration);
  }

  async switchToBlue(): Promise<void> {
    // Verify Blue is healthy
    const healthCheck = await this.checkTargetGroupHealth(this.blueTargetGroupArn);

    if (healthCheck.unhealthyCount > 0) {
      throw new Error(`Blue environment unhealthy: ${healthCheck.unhealthyCount} targets down`);
    }

    const command = new ModifyListenerCommand({
      ListenerArn: this.listenerArn,
      DefaultActions: [
        {
          Type: 'forward',
          TargetGroupArn: this.blueTargetGroupArn,
        },
      ],
    });

    await this.elbv2.send(command);
    console.log('Traffic switched to Blue (rollback complete)');
  }

  private async checkTargetGroupHealth(targetGroupArn: string): Promise<HealthStatus> {
    const command = new DescribeTargetGroupsCommand({
      TargetGroupArns: [targetGroupArn],
    });

    const response = await this.elbv2.send(command);
    // Parse health status
    return {
      healthyCount: response.TargetGroups?.[0]?.HealthyHostCount || 0,
      unhealthyCount: response.TargetGroups?.[0]?.UnhealthyHostCount || 0,
    };
  }
}

Why Load Balancer Switching Works:

Benefit	Details
Instant Switch	ALB rule change takes <2 seconds
Instant Rollback	Same speed as forward switch
Health-Aware	Only switches if target group is healthy
No DNS Propagation	DNS points to ALB, which doesn't change
Connection Draining	Existing connections gracefully complete

Connection Draining Behavior:

T+0s     Switch command sent
T+0.5s   New requests go to Green
T+0.5s   Existing Blue connections continue
T+300s   Default deregistration delay expires
T+300s   Blue connections forcefully closed (if still open)

3. Kubernetes Service Switching

For Kubernetes-native deployments, you can use label selectors to switch traffic:

# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend-blue
  namespace: production
spec:
  replicas: 20
  selector:
    matchLabels:
      app: frontend
      version: blue
  template:
    metadata:
      labels:
        app: frontend
        version: blue
    spec:
      containers:
        - name: frontend
          image: frontend:v47
          ports:
            - containerPort: 3000

---
# Green Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend-green
  namespace: production
spec:
  replicas: 20
  selector:
    matchLabels:
      app: frontend
      version: green
  template:
    metadata:
      labels:
        app: frontend
        version: green
    spec:
      containers:
        - name: frontend
          image: frontend:v48
          ports:
            - containerPort: 3000

---
# Service (points to active version)
apiVersion: v1
kind: Service
metadata:
  name: frontend
  namespace: production
spec:
  selector:
    app: frontend
    version: blue  # ← Switch this to 'green' for traffic switch
  ports:
    - port: 80
      targetPort: 3000

Switching Script:

import { KubeConfig, CoreV1Api } from '@kubernetes/client-node';

class KubernetesSwitch {
  private k8sApi: CoreV1Api;
  private namespace = 'production';
  private serviceName = 'frontend';

  async switchToGreen(): Promise<void> {
    // Read current service
    const service = await this.k8sApi.readNamespacedService(
      this.serviceName,
      this.namespace
    );

    // Update selector
    service.body.spec!.selector = {
      app: 'frontend',
      version: 'green',
    };

    // Apply update
    await this.k8sApi.replaceNamespacedService(
      this.serviceName,
      this.namespace,
      service.body
    );

    console.log('Traffic switched to Green');
  }

  async switchToBlue(): Promise<void> {
    const service = await this.k8sApi.readNamespacedService(
      this.serviceName,
      this.namespace
    );

    service.body.spec!.selector = {
      app: 'frontend',
      version: 'blue',
    };

    await this.k8sApi.replaceNamespacedService(
      this.serviceName,
      this.namespace,
      service.body
    );

    console.log('Traffic switched to Blue (rollback)');
  }
}

Comparison of Switching Mechanisms:

Mechanism	Switch Time	Rollback Time	Complexity	Best For
DNS	60-300s	60-300s	Low	DR failover only
Load Balancer	<2s	<2s	Medium	Production deployments
K8s Service	<5s	<5s	Medium	K8s-native apps
Istio/Service Mesh	<1s	<1s	High	Advanced traffic control

Frontend-Specific Blue-Green Challenges

Challenge 1: CDN Cache Coherence

The Problem:

User loads page during switch:

HTML served from Green (new version)
JS bundle request goes to CDN
CDN has cached Blue version of main.js
Hydration fails because JS doesn't match HTML

Timeline:
T+0      HTML request → Green origin → v48 HTML
T+0.1s   JS request → CDN cache hit → v47 JS (cached from Blue)
T+0.2s   Browser: "React hydration error: text mismatch"

Solution 1: Versioned Asset Paths

Each environment serves assets from a unique path:

// next.config.js
module.exports = {
  assetPrefix: process.env.ASSET_PREFIX, // '/dist-blue/' or '/dist-green/'

  generateBuildId: async () => {
    return process.env.BUILD_VERSION; // 'v47' or 'v48'
  },
};

HTML Output:

<!-- Blue environment -->
<script src="/dist-blue/_next/static/v47/main.js"></script>

<!-- Green environment -->
<script src="/dist-green/_next/static/v48/main.js"></script>

CDN Configuration (Cloudflare):

// Cloudflare Worker
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    // Route based on asset prefix
    if (url.pathname.startsWith('/dist-blue/')) {
      return fetch(`${env.BLUE_ORIGIN}${url.pathname}`);
    }

    if (url.pathname.startsWith('/dist-green/')) {
      return fetch(`${env.GREEN_ORIGIN}${url.pathname}`);
    }

    // HTML requests go to active environment
    const activeEnv = await env.KV.get('active-environment');
    const origin = activeEnv === 'green' ? env.GREEN_ORIGIN : env.BLUE_ORIGIN;

    return fetch(`${origin}${url.pathname}`);
  }
};

Solution 2: Dual CDN Purge on Switch

class BlueGreenDeployer {
  async switch(from: 'blue' | 'green', to: 'blue' | 'green'): Promise<void> {
    // Step 1: Warm up standby CDN cache
    await this.warmCDNCache(to);

    // Step 2: Switch load balancer
    await this.switchLoadBalancer(to);

    // Step 3: Purge old environment's HTML from CDN
    // (JS/CSS can stay - content-addressed URLs)
    await this.purgeCDN(`/dist-${from}/*.html`);

    console.log(`Switched from ${from} to ${to}`);
  }

  private async warmCDNCache(env: 'blue' | 'green'): Promise<void> {
    const criticalPaths = [
      '/',
      '/products',
      '/cart',
      '/checkout',
    ];

    const origin = env === 'green' ? this.greenOrigin : this.blueOrigin;

    await Promise.all(
      criticalPaths.map(path =>
        fetch(`${origin}${path}`, {
          headers: { 'X-Warm-Cache': 'true' }
        })
      )
    );
  }
}

Challenge 2: Client-Side State Compatibility

The Problem:

User has localStorage data from Blue (v47). Switch happens. User navigates, gets Green (v48). Green expects different localStorage schema.

// v47 (Blue) wrote this:
localStorage.setItem('preferences', JSON.stringify({
  theme: 'dark',
  notifications: true
}));

// v48 (Green) expects this:
interface PreferencesV2 {
  version: 2;
  ui: {
    theme: 'dark' | 'light';
    fontSize: number;
  };
  notifications: {
    email: boolean;
    push: boolean;
  };
}

Solution: Schema Migration with Version Tolerance

interface PreferencesV1 {
  theme: string;
  notifications: boolean;
}

interface PreferencesV2 {
  version: 2;
  ui: { theme: 'dark' | 'light'; fontSize: number };
  notifications: { email: boolean; push: boolean };
}

type Preferences = PreferencesV1 | PreferencesV2;

function loadPreferences(): PreferencesV2 {
  const raw = localStorage.getItem('preferences');

  if (!raw) {
    return getDefaultPreferences();
  }

  try {
    const data = JSON.parse(raw) as Preferences;

    // Check version
    if ('version' in data && data.version === 2) {
      return data;
    }

    // Migrate v1 → v2
    const v1 = data as PreferencesV1;
    const migrated: PreferencesV2 = {
      version: 2,
      ui: {
        theme: v1.theme === 'dark' ? 'dark' : 'light',
        fontSize: 16,
      },
      notifications: {
        email: v1.notifications,
        push: v1.notifications,
      },
    };

    // Save migrated version
    localStorage.setItem('preferences', JSON.stringify(migrated));

    return migrated;
  } catch (error) {
    console.error('Failed to load preferences, resetting', error);
    const defaults = getDefaultPreferences();
    localStorage.setItem('preferences', JSON.stringify(defaults));
    return defaults;
  }
}

Bidirectional Compatibility:

During blue-green transition, users may switch between environments. Both versions must handle each other's data:

// Both v47 and v48 must include this
function loadPreferencesCompat(): PreferencesV2 {
  const raw = localStorage.getItem('preferences');

  if (!raw) return getDefaultPreferences();

  const data = JSON.parse(raw);

  // v2 format
  if (data.version === 2) {
    return data;
  }

  // v1 format (migrate)
  return migrateV1ToV2(data);
}

// v47 must also read v2 format (forward compatibility)
function loadPreferencesV47(): PreferencesV1 {
  const raw = localStorage.getItem('preferences');

  if (!raw) return { theme: 'light', notifications: true };

  const data = JSON.parse(raw);

  // Handle v2 format written by v48
  if (data.version === 2) {
    return {
      theme: data.ui.theme,
      notifications: data.notifications.email,
    };
  }

  return data;
}

Challenge 3: In-Flight Requests During Switch

The Problem:

User clicks "Submit Order" on Blue at T+0. Switch happens at T+0.5s. Response comes from Green at T+1s. Green doesn't have the same CSRF token validation state.

T+0s      User: POST /api/orders (Blue)
T+0.5s    Operator: Switch to Green
T+1s      Response: 403 Forbidden (CSRF token invalid)
T+1s      User: "My order failed!"

Solution: Shared Session Store

// Redis session store (shared by Blue and Green)
import Redis from 'ioredis';

class SessionStore {
  private redis: Redis;

  async getSession(sessionId: string): Promise<Session | null> {
    const data = await this.redis.get(`session:${sessionId}`);
    return data ? JSON.parse(data) : null;
  }

  async setSession(sessionId: string, session: Session): Promise<void> {
    await this.redis.setex(
      `session:${sessionId}`,
      86400, // 24 hour TTL
      JSON.stringify(session)
    );
  }

  async validateCSRF(sessionId: string, token: string): Promise<boolean> {
    const session = await this.getSession(sessionId);
    return session?.csrfToken === token;
  }
}

Connection Draining Strategy:

Configure load balancer to wait for in-flight requests:

// AWS ALB deregistration delay
const targetGroupConfig = {
  TargetGroupArn: blueTargetGroupArn,
  Attributes: [
    {
      Key: 'deregistration_delay.timeout_seconds',
      Value: '30', // Wait 30s for in-flight requests
    },
  ],
};

Timeline with Connection Draining:

T+0s      User: POST /api/orders (Blue)
T+0.5s    Operator: Switch to Green
T+0.5s    New requests → Green
T+0.5s    Blue still processing existing request
T+1s      Blue returns response to user
T+1s      User: "Order successful!"
T+30.5s   Blue connections fully drained

Challenge 4: WebSocket Connection Migration

The Problem:

Users have active WebSocket connections to Blue. Switch to Green. WebSockets are still connected to Blue, which is now stale.

Solution: Graceful WebSocket Migration

// Server-side: Broadcast migration notice before switch
class WebSocketManager {
  private connections: Map<string, WebSocket> = new Map();

  async prepareForSwitch(): Promise<void> {
    // Notify all connected clients
    const migrationMessage = JSON.stringify({
      type: 'ENVIRONMENT_SWITCH',
      reconnectIn: 5000, // 5 seconds
      newEndpoint: 'wss://green.example.com/ws',
    });

    for (const [id, ws] of this.connections) {
      ws.send(migrationMessage);
    }

    // Wait for clients to reconnect to Green
    await sleep(10000);

    // Close remaining connections
    for (const [id, ws] of this.connections) {
      ws.close(1000, 'Environment switch');
    }
  }
}

// Client-side: Handle migration
class WebSocketClient {
  private ws: WebSocket;
  private endpoint: string;

  connect(): void {
    this.ws = new WebSocket(this.endpoint);

    this.ws.onmessage = (event) => {
      const message = JSON.parse(event.data);

      if (message.type === 'ENVIRONMENT_SWITCH') {
        console.log('Server switching environments, reconnecting...');

        // Reconnect to new endpoint after delay
        setTimeout(() => {
          this.endpoint = message.newEndpoint;
          this.ws.close();
          this.connect();
        }, message.reconnectIn);
      }
    };

    this.ws.onclose = () => {
      // Auto-reconnect with exponential backoff
      this.reconnectWithBackoff();
    };
  }
}

Database Compatibility: The Hardest Problem

Blue-green deployment for frontends is relatively straightforward. The database is where things get complicated.

The Database Challenge

Both Blue and Green environments connect to the same database. If Green requires a schema change, you have two options:

Backward-Compatible Migrations - Green's schema changes must work with Blue's code
Database Blue-Green - Maintain two databases with replication (complex)

Strategy 1: Backward-Compatible Migrations

Rule: Every migration must work with both old and new application code.

Example: Adding a Column

-- Migration: Add 'middle_name' column
ALTER TABLE users ADD COLUMN middle_name VARCHAR(100);

-- This works because:
-- - Blue (old) ignores middle_name when reading (SELECT * is fine)
-- - Blue (old) doesn't write middle_name (NULL is acceptable)
-- - Green (new) can read and write middle_name

Example: Renaming a Column (Problematic)

-- WRONG: This breaks Blue immediately
ALTER TABLE users RENAME COLUMN name TO full_name;

-- RIGHT: Expand-Contract Pattern
-- Step 1: Add new column (deploy to Green, switch traffic)
ALTER TABLE users ADD COLUMN full_name VARCHAR(200);

-- Step 2: Backfill data
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Step 3: Application reads from both, writes to both
-- (Both Blue and Green must be updated for this step)

-- Step 4: Remove old column (after Blue is decommissioned)
ALTER TABLE users DROP COLUMN name;

Expand-Contract Migration Pattern:

graph LR
    A[Original Schema] --> B[Add New Column]
    B --> C[Backfill Data]
    C --> D[App Uses Both]
    D --> E[Remove Old Column]

    subgraph "Blue Compatible"
        A
        B
        C
        D
    end

    subgraph "Green Only"
        E
    end

Strategy 2: Feature Flags for Schema-Dependent Code

// Both Blue and Green include this code
async function getUserDisplayName(userId: string): Promise<string> {
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  // Feature flag controls which field to use
  if (await featureFlags.isEnabled('use_full_name_field')) {
    return user.full_name || user.name; // Fallback to old field
  }

  return user.name;
}

async function updateUserName(userId: string, name: string): Promise<void> {
  if (await featureFlags.isEnabled('use_full_name_field')) {
    // Write to both fields during transition
    await db.query(
      'UPDATE users SET name = $2, full_name = $2 WHERE id = $1',
      [userId, name]
    );
  } else {
    await db.query(
      'UPDATE users SET name = $2 WHERE id = $1',
      [userId, name]
    );
  }
}

Migration Timeline:

T+0      Blue live, Green deploying
         Schema: users (name)
         Flag: use_full_name_field = false

T+1      Deploy migration: ADD COLUMN full_name
         Blue and Green both work (ignore new column)

T+2      Run backfill: full_name = name

T+3      Switch traffic to Green
         Enable flag: use_full_name_field = true
         Green reads full_name, writes both

T+4      Verify Green is stable
         Blue is dormant (no traffic)

T+5      Deploy code to remove old column dependency
         (Next release)

T+6      Drop column: ALTER TABLE users DROP COLUMN name

Deployment Pipeline: End-to-End Flow

sequenceDiagram
    participant Dev as Developer
    participant CI as CI/CD Pipeline
    participant Green as Green Environment
    participant Smoke as Smoke Tests
    participant LB as Load Balancer
    participant Blue as Blue Environment
    participant Monitor as Monitoring

    Dev->>CI: Push to main
    CI->>CI: Build & Test
    CI->>Green: Deploy to Green

    Green->>Green: Health checks pass

    CI->>Smoke: Run smoke tests against Green
    Smoke->>Green: HTTP requests
    Green->>Smoke: 200 OK

    Smoke->>CI: Tests passed

    CI->>LB: Switch traffic to Green
    LB->>Green: 100% traffic
    LB->>Blue: 0% traffic

    Monitor->>Green: Collect metrics

    alt Metrics OK
        Monitor->>CI: Deployment successful
        CI->>Blue: Update Blue with new version (prepare for next deploy)
    else Metrics Bad
        Monitor->>CI: Anomaly detected
        CI->>LB: Rollback to Blue
        LB->>Blue: 100% traffic
        LB->>Green: 0% traffic
        CI->>Dev: Alert: Deployment rolled back
    end

Implementation:

class BlueGreenPipeline {
  async deploy(version: string): Promise<DeploymentResult> {
    const startTime = Date.now();
    const standbyEnv = await this.getStandbyEnvironment();

    console.log(`Deploying ${version} to ${standbyEnv} environment`);

    // Step 1: Deploy to standby
    await this.deployToEnvironment(standbyEnv, version);

    // Step 2: Wait for health checks
    await this.waitForHealthy(standbyEnv, 300000); // 5 min timeout

    // Step 3: Run smoke tests
    const smokeResult = await this.runSmokeTests(standbyEnv);

    if (!smokeResult.passed) {
      throw new Error(`Smoke tests failed: ${smokeResult.failures.join(', ')}`);
    }

    // Step 4: Switch traffic
    const activeEnv = await this.getActiveEnvironment();
    await this.switchTraffic(activeEnv, standbyEnv);

    // Step 5: Monitor for anomalies (5 minutes)
    const monitorResult = await this.monitorForAnomalies(standbyEnv, 300000);

    if (monitorResult.anomalyDetected) {
      console.error('Anomaly detected, rolling back');
      await this.switchTraffic(standbyEnv, activeEnv);
      throw new Error(`Rollback triggered: ${monitorResult.reason}`);
    }

    // Step 6: Mark deployment successful
    await this.recordDeployment({
      version,
      environment: standbyEnv,
      duration: Date.now() - startTime,
      status: 'success',
    });

    return { success: true, environment: standbyEnv };
  }

  private async runSmokeTests(env: 'blue' | 'green'): Promise<SmokeTestResult> {
    const endpoint = env === 'green' ? this.greenEndpoint : this.blueEndpoint;

    const tests = [
      { name: 'homepage', path: '/', expectedStatus: 200 },
      { name: 'api-health', path: '/api/health', expectedStatus: 200 },
      { name: 'static-asset', path: '/_next/static/chunks/main.js', expectedStatus: 200 },
    ];

    const results = await Promise.all(
      tests.map(async (test) => {
        const response = await fetch(`${endpoint}${test.path}`);
        return {
          name: test.name,
          passed: response.status === test.expectedStatus,
          actualStatus: response.status,
        };
      })
    );

    return {
      passed: results.every(r => r.passed),
      failures: results.filter(r => !r.passed).map(r => r.name),
    };
  }

  private async monitorForAnomalies(
    env: 'blue' | 'green',
    durationMs: number
  ): Promise<MonitorResult> {
    const startTime = Date.now();
    const checkInterval = 10000; // 10 seconds

    while (Date.now() - startTime < durationMs) {
      const metrics = await this.collectMetrics(env);

      // Check error rate
      if (metrics.errorRate > 0.01) { // >1% error rate
        return {
          anomalyDetected: true,
          reason: `Error rate ${(metrics.errorRate * 100).toFixed(2)}% exceeds threshold`,
        };
      }

      // Check latency
      if (metrics.p95Latency > 2000) { // >2s p95
        return {
          anomalyDetected: true,
          reason: `P95 latency ${metrics.p95Latency}ms exceeds threshold`,
        };
      }

      await sleep(checkInterval);
    }

    return { anomalyDetected: false };
  }
}

Cost Analysis: Blue-Green Economics

Blue-green deployment requires running two production environments, which has significant cost implications.

Infrastructure Costs

Component	Blue	Green	Total	Notes
EC2/EKS Compute	$15K/mo	$15K/mo	$30K/mo	Both at full capacity
Load Balancers	$500/mo	$500/mo	$1K/mo	One per environment
CDN (separate configs)	$8K/mo	$8K/mo	$16K/mo	Asset isolation
S3 (static assets)	$200/mo	$200/mo	$400/mo	Versioned assets
Total Infrastructure			$47.4K/mo

Cost Optimization Strategies

1. Scale Down Standby Environment:

Keep standby at minimum capacity, scale up before switch:

class CostOptimizedBlueGreen {
  async prepareForSwitch(standbyEnv: 'blue' | 'green'): Promise<void> {
    // Scale up standby to match production capacity
    await this.scaleEnvironment(standbyEnv, {
      minReplicas: 20,
      maxReplicas: 50,
    });

    // Wait for pods to be ready
    await this.waitForCapacity(standbyEnv, 20);
  }

  async afterSwitch(previousEnv: 'blue' | 'green'): Promise<void> {
    // Scale down previous environment to minimum
    await this.scaleEnvironment(previousEnv, {
      minReplicas: 2, // Keep warm for fast rollback
      maxReplicas: 5,
    });
  }
}

Cost Savings:

Strategy	Before	After	Savings
Standby at minimum	$30K	$18K	40%
Spot instances for standby	$30K	$21K	30%
Combined	$30K	$15K	50%

2. Share CDN Configuration:

Use path-based routing instead of separate CDN configurations:

// Single CDN, path-based routing
const cdnConfig = {
  origins: [
    { name: 'blue', domain: 'blue-origin.example.com' },
    { name: 'green', domain: 'green-origin.example.com' },
  ],
  routes: [
    { path: '/dist-blue/*', origin: 'blue' },
    { path: '/dist-green/*', origin: 'green' },
    { path: '/*', origin: 'active' }, // Dynamic based on switch state
  ],
};

Blue-Green vs Canary: When to Use Each

Aspect	Blue-Green	Canary
Traffic Split	0/100 or 100/0 (atomic)	Gradual (5% → 25% → 100%)
Rollback Speed	Instant (<5s)	Fast (<60s with CDN purge)
User Impact	All users switch together	Subset experiences new version
Testing Confidence	Lower (no production validation)	Higher (real traffic testing)
Infrastructure Cost	2x (two full environments)	1.05-1.25x (minimal extra capacity)
Complexity	Lower	Higher (traffic splitting logic)
Best For	Database migrations, major releases	Incremental features, A/B testing

Decision Framework:

Is this a database schema change?
  YES → Blue-Green (need consistent schema)

Is this a major version upgrade (React, Next.js)?
  YES → Blue-Green (all-or-nothing change)

Is this a new feature that can be gradually rolled out?
  YES → Canary (validate with subset first)

Do you need instant rollback capability?
  YES → Blue-Green (faster rollback)

Is cost a primary concern?
  YES → Canary (lower infrastructure overhead)

Production Incidents and Lessons

Incident 1: Session Store Race Condition

What Happened:

Switched from Blue to Green. Users logged out immediately after switch.

Root Cause:

Blue used Redis key prefix session:blue:. Green used session:green:. Sessions weren't shared.

Fix:

// Use environment-agnostic session keys
const SESSION_PREFIX = 'session:'; // Not 'session:blue:' or 'session:green:'

function getSessionKey(sessionId: string): string {
  return `${SESSION_PREFIX}${sessionId}`;
}

Incident 2: CDN Cache Serving Stale HTML

What Happened:

Switched to Green, but users still seeing Blue HTML for 30 minutes.

Root Cause:

HTML was cached at CDN with 1-hour TTL. Switch changed origin but CDN didn't know.

Fix:

async function switchWithCachePurge(to: 'blue' | 'green'): Promise<void> {
  // Step 1: Purge HTML from CDN
  await purgeCDN('/*.html');
  await purgeCDN('/');

  // Step 2: Wait for purge propagation
  await sleep(5000);

  // Step 3: Switch load balancer
  await switchLoadBalancer(to);
}

Incident 3: Database Migration Timing

What Happened:

Ran migration, switched to Green, Green crashed because migration wasn't complete.

Root Cause:

Migration was still running when traffic switch happened.

Fix:

async function deployWithMigration(version: string): Promise<void> {
  // Step 1: Run migration
  console.log('Running database migration...');
  await runMigration();

  // Step 2: Verify migration complete
  const migrationStatus = await checkMigrationStatus();
  if (migrationStatus !== 'complete') {
    throw new Error('Migration not complete, aborting deploy');
  }

  // Step 3: Deploy application
  await deployToStandby(version);

  // Step 4: Verify standby can use new schema
  await runSchemaValidationTests();

  // Step 5: Switch traffic
  await switchTraffic();
}

Summary: Blue-Green Architecture Principles

Environment Parity - Blue and Green must be identical except for application version. Same infrastructure, same configuration, same capacity.
Atomic Switching - Traffic moves from one environment to another in a single operation. No gradual migration.
Shared Persistent State - Database, session store, and external services are shared. Both environments must be compatible.
Instant Rollback - The primary value of blue-green is rollback speed. If switch fails, revert in seconds.
Asset Isolation - Each environment serves assets from unique paths to prevent CDN cache collisions.
Connection Draining - Allow in-flight requests to complete before fully switching.
Health Verification - Never switch to an environment that isn't healthy. Automated health checks before switch.
Cost Awareness - Two environments cost 2x. Optimize by scaling down standby between deployments.

Blue-green deployment trades infrastructure cost for deployment safety. For applications where downtime is unacceptable and rollback speed is critical, it's the right choice. For incremental feature releases, canary deployments are more appropriate.

The architecture outlined here—load balancer switching, versioned asset paths, shared session storage, and automated pipelines—is what production systems at scale use for zero-downtime deployments.

What did you think?