Frontend Disaster Recovery Architecture: Building Resilient Systems at Scale

June 17, 2026111 min read0 views

failover

frontend architecture

resilience engineering

multi region architecture

Introduction: Why Frontend DR Is Different

When AWS us-east-1 experienced a major outage in December 2021, companies with proper disaster recovery plans failed over in minutes. Companies without them were offline for hours. The difference wasn't backend infrastructure—it was whether their frontend could serve users from an alternate location while maintaining functionality.

Frontend disaster recovery is fundamentally different from backend DR because:

Static Assets Create False Confidence - "Our app is on a CDN, it's already distributed." Until the CDN's origin becomes unreachable and cached content expires.
Client-Side Code Can't Failover - Your React app doesn't know to use a different API endpoint when the primary fails. That logic must be built in.
Browser State Is Untransferable - localStorage, IndexedDB, and Service Worker caches are device-specific. Users can't seamlessly continue on a different device.
Hydration Depends on API Availability - SSR/SSG pages look fine until JavaScript tries to hydrate and discovers APIs are unreachable.
Third-Party Dependencies Are Hidden SPOFs - Your auth provider, analytics, feature flags, and payment processor are all single points of failure.

This article covers how to architect frontend systems that survive regional outages, CDN failures, and origin unavailability—while maintaining user experience during degraded states.

Scale Context: Production Reality

System Profile:

DAU: 35M daily active users
Peak RPS: 320K requests/second
Geographic Distribution: 140+ countries
Primary Region: US-East (us-east-1)
Secondary Region: US-West (us-west-2)
Tertiary Region: EU-West (eu-west-1)

Availability Targets:

SLA: 99.95% (26 minutes downtime/year)
RTO (Recovery Time Objective): 5 minutes
RPO (Recovery Point Objective): 0 for stateless, 1 minute for stateful

Infrastructure:

CDN: Multi-provider (Cloudflare + CloudFront)
Origin: Kubernetes across 3 regions
Database: PostgreSQL with cross-region replication
Cache: Redis with active-passive replication
Static Assets: S3 with cross-region replication

High-Level DR Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER REQUEST                                    │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  LAYER 1: Global DNS (Route 53 / Cloudflare DNS)                            │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Health-Based Routing                                                 │  │
│  │  - Primary: us-east-1 (weight: 100, health: ✓)                       │  │
│  │  - Secondary: us-west-2 (weight: 0, failover if primary unhealthy)   │  │
│  │  - Tertiary: eu-west-1 (latency-based for EU users)                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  LAYER 2: CDN Edge (Multi-Provider)                                         │
│  ┌─────────────────────────────┐    ┌─────────────────────────────┐         │
│  │      Cloudflare (Primary)  │    │    CloudFront (Failover)    │         │
│  │  - 300+ PoPs               │    │  - 450+ PoPs                │         │
│  │  - Edge Workers            │    │  - Lambda@Edge              │         │
│  │  - KV Storage              │    │  - S3 Origin                │         │
│  └─────────────────────────────┘    └─────────────────────────────┘         │
│                                                                              │
│  Cache Strategy:                                                             │
│  - HTML: 60s TTL + stale-while-revalidate: 3600s + stale-if-error: 86400s  │
│  - JS/CSS: Immutable (content-addressed)                                    │
│  - API responses: Vary by region, 30s TTL                                   │
└─────────────────────────────────────┬───────────────────────────────────────┘
                                      │
              ┌───────────────────────┼───────────────────────┐
              │                       │                       │
              ▼                       ▼                       ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│    US-EAST-1 (PRIMARY)  │ │   US-WEST-2 (SECONDARY) │ │   EU-WEST-1 (TERTIARY)  │
│                         │ │                         │ │                         │
│  ┌───────────────────┐  │ │  ┌───────────────────┐  │ │  ┌───────────────────┐  │
│  │   Load Balancer   │  │ │  │   Load Balancer   │  │ │  │   Load Balancer   │  │
│  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │
│            │            │ │            │            │ │            │            │
│  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │
│  │  Frontend Cluster │  │ │  │  Frontend Cluster │  │ │  │  Frontend Cluster │  │
│  │  (K8s: 20 pods)   │  │ │  │  (K8s: 10 pods)   │  │ │  │  (K8s: 10 pods)   │  │
│  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │
│            │            │ │            │            │ │            │            │
│  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │
│  │   BFF Services    │  │ │  │   BFF Services    │  │ │  │   BFF Services    │  │
│  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │ │  └─────────┬─────────┘  │
│            │            │ │            │            │ │            │            │
│  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │ │  ┌─────────┴─────────┐  │
│  │  PostgreSQL (RW)  │  │ │  │  PostgreSQL (RO)  │  │ │  │  PostgreSQL (RO)  │  │
│  │  Redis (Primary)  │  │ │  │  Redis (Replica)  │  │ │  │  Redis (Replica)  │  │
│  └───────────────────┘  │ │  └───────────────────┘  │ │  └───────────────────┘  │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
              │                       │                       │
              └───────────────────────┴───────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  SHARED SERVICES (Cross-Region)                                              │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐                    │
│  │  Auth0 (Auth) │  │ Stripe (Pay)  │  │ LaunchDarkly  │                    │
│  │  Multi-region │  │  Multi-region │  │  (Flags)      │                    │
│  └───────────────┘  └───────────────┘  └───────────────┘                    │
└─────────────────────────────────────────────────────────────────────────────┘

RTO and RPO: Defining Recovery Objectives

Recovery Time Objective (RTO)

Time from failure detection to service restoration.

Component	RTO Target	Mechanism
Static Assets (CDN)	0	Already distributed, stale-if-error
HTML Pages	30s	CDN failover + stale content
API Endpoints	5 min	DNS failover + health checks
Database Writes	15 min	Promote read replica
Full Functionality	30 min	All services recovered

Recovery Point Objective (RPO)

Maximum acceptable data loss (time between last backup and failure).

Data Type	RPO Target	Mechanism
User Sessions	0	Redis replication (sync)
User Data	1 min	PostgreSQL streaming replication
Analytics Events	5 min	Kafka mirroring
Audit Logs	0	Multi-region writes

DNS-Level Failover Architecture

Health Check Configuration

// AWS Route 53 Health Check
const healthCheckConfig = {
  Type: 'HTTPS',
  ResourcePath: '/api/health',
  FullyQualifiedDomainName: 'api.example.com',
  Port: 443,
  RequestInterval: 10, // Check every 10 seconds
  FailureThreshold: 3, // 3 failures = unhealthy
  MeasureLatency: true,
  Regions: [
    'us-east-1',
    'us-west-2',
    'eu-west-1',
    'ap-southeast-1',
  ],
};

// Health check endpoint
app.get('/api/health', async (req, res) => {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkExternalAPIs(),
  ]);

  const allHealthy = checks.every(c => c.status === 'fulfilled');

  if (allHealthy) {
    res.status(200).json({ status: 'healthy', timestamp: Date.now() });
  } else {
    const failures = checks
      .filter(c => c.status === 'rejected')
      .map(c => c.reason);

    res.status(503).json({ status: 'unhealthy', failures });
  }
});

Failover DNS Configuration

// Route 53 failover routing policy
const dnsConfig = {
  HostedZoneId: 'Z1234567890',
  ChangeBatch: {
    Changes: [
      // Primary record
      {
        Action: 'UPSERT',
        ResourceRecordSet: {
          Name: 'app.example.com',
          Type: 'A',
          SetIdentifier: 'primary-us-east',
          Failover: 'PRIMARY',
          HealthCheckId: 'health-check-us-east',
          AliasTarget: {
            HostedZoneId: 'Z35SXDOTRQ7X7K',
            DNSName: 'alb-us-east.example.com',
            EvaluateTargetHealth: true,
          },
        },
      },
      // Secondary record
      {
        Action: 'UPSERT',
        ResourceRecordSet: {
          Name: 'app.example.com',
          Type: 'A',
          SetIdentifier: 'secondary-us-west',
          Failover: 'SECONDARY',
          HealthCheckId: 'health-check-us-west',
          AliasTarget: {
            HostedZoneId: 'Z1H1FL5HABSF5',
            DNSName: 'alb-us-west.example.com',
            EvaluateTargetHealth: true,
          },
        },
      },
    ],
  },
};

Failover Timeline

T+0s      Primary region fails
T+10s     First health check fails
T+20s     Second health check fails
T+30s     Third health check fails (threshold reached)
T+30s     Route 53 marks primary unhealthy
T+30s     DNS starts returning secondary IP
T+60s     DNS TTL expires (most clients)
T+90s     90% of traffic on secondary

Total failover time: ~90 seconds (dominated by DNS TTL)

Reducing DNS Failover Time:

// Use low TTL for failover records
const lowTTLConfig = {
  TTL: 30, // 30 seconds (vs default 300s)
  // Tradeoff: More DNS queries, faster failover
};

// Or use AWS Global Accelerator for instant failover
const globalAcceleratorConfig = {
  // Static anycast IPs, no DNS propagation delay
  StaticIpAddresses: ['1.2.3.4', '5.6.7.8'],
  // Health checks at network level
  EndpointGroups: [
    { Region: 'us-east-1', Weight: 100, HealthCheckPort: 443 },
    { Region: 'us-west-2', Weight: 0, HealthCheckPort: 443 },
  ],
};

CDN Resilience Strategies

Multi-CDN Architecture

// Edge worker: Multi-origin failover
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const origins = [
      { name: 'primary', url: env.PRIMARY_ORIGIN, timeout: 5000 },
      { name: 'secondary', url: env.SECONDARY_ORIGIN, timeout: 5000 },
      { name: 'tertiary', url: env.TERTIARY_ORIGIN, timeout: 10000 },
    ];

    for (const origin of origins) {
      try {
        const controller = new AbortController();
        const timeoutId = setTimeout(() => controller.abort(), origin.timeout);

        const response = await fetch(
          new URL(request.url).pathname,
          {
            ...request,
            signal: controller.signal,
          }
        );

        clearTimeout(timeoutId);

        if (response.ok) {
          // Add origin header for debugging
          const newResponse = new Response(response.body, response);
          newResponse.headers.set('X-Served-By', origin.name);
          return newResponse;
        }

        console.error(`Origin ${origin.name} returned ${response.status}`);
      } catch (error) {
        console.error(`Origin ${origin.name} failed: ${error.message}`);
      }
    }

    // All origins failed - serve stale content if available
    const cached = await caches.default.match(request);

    if (cached) {
      const staleResponse = new Response(cached.body, cached);
      staleResponse.headers.set('X-Served-From', 'stale-cache');
      staleResponse.headers.set('X-Cache-Age', cached.headers.get('Age') || 'unknown');
      return staleResponse;
    }

    // No cached content - return error page
    return new Response(ERROR_PAGE_HTML, {
      status: 503,
      headers: { 'Content-Type': 'text/html' },
    });
  },
};

const ERROR_PAGE_HTML = `
<!DOCTYPE html>
<html>
<head>
  <title>Service Temporarily Unavailable</title>
  <style>
    body { font-family: system-ui; text-align: center; padding: 50px; }
    h1 { color: #333; }
    p { color: #666; }
  </style>
</head>
<body>
  <h1>We'll be back shortly</h1>
  <p>Our team is working to restore service. Please try again in a few minutes.</p>
  <script>
    // Auto-refresh after 30 seconds
    setTimeout(() => location.reload(), 30000);
  </script>
</body>
</html>
`;

Stale-While-Revalidate + Stale-If-Error

// Origin server cache headers
function setCacheHeaders(res: Response, contentType: string): void {
  if (contentType === 'text/html') {
    // HTML: Short TTL, but serve stale on errors
    res.setHeader('Cache-Control', [
      'public',
      'max-age=60',                  // Fresh for 60 seconds
      'stale-while-revalidate=3600', // Serve stale for 1 hour while revalidating
      'stale-if-error=86400',        // Serve stale for 24 hours if origin errors
    ].join(', '));
  } else if (contentType.includes('javascript') || contentType.includes('css')) {
    // Static assets: Immutable (content-addressed)
    res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
  } else if (contentType.includes('json')) {
    // API responses: Short TTL
    res.setHeader('Cache-Control', 'public, max-age=30, stale-if-error=300');
  }
}

Failover Timeline with Stale-If-Error:

T+0       Origin fails
T+0       CDN serves cached content (stale-if-error)
T+0       Users see content immediately (no interruption)
T+60s     Cache age exceeds max-age (60s)
T+60s     CDN attempts revalidation (fails)
T+60s     CDN continues serving stale (stale-if-error active)
T+86400s  stale-if-error expires (24 hours)
T+86400s  If origin still down, 503 errors begin

Effective RTO for cached content: 0 seconds

Client-Side Resilience Patterns

Service Worker Offline Cache

// sw.js - Service Worker
const CACHE_VERSION = 'v1';
const STATIC_CACHE = `static-${CACHE_VERSION}`;
const DYNAMIC_CACHE = `dynamic-${CACHE_VERSION}`;

const STATIC_ASSETS = [
  '/',
  '/offline.html',
  '/manifest.json',
  '/_next/static/css/main.css',
  '/_next/static/js/main.js',
];

// Install: Cache static assets
self.addEventListener('install', (event) => {
  event.waitUntil(
    caches.open(STATIC_CACHE).then((cache) => {
      return cache.addAll(STATIC_ASSETS);
    })
  );
});

// Fetch: Network-first with cache fallback
self.addEventListener('fetch', (event) => {
  const request = event.request;

  // Skip non-GET requests
  if (request.method !== 'GET') return;

  // Skip API requests for offline page
  if (request.url.includes('/api/')) {
    event.respondWith(handleAPIRequest(request));
    return;
  }

  // Static assets: Cache-first
  if (isStaticAsset(request.url)) {
    event.respondWith(cacheFirst(request));
    return;
  }

  // HTML: Network-first with offline fallback
  event.respondWith(networkFirstWithOfflineFallback(request));
});

async function networkFirstWithOfflineFallback(request: Request): Promise<Response> {
  try {
    const networkResponse = await fetch(request);

    // Cache successful responses
    if (networkResponse.ok) {
      const cache = await caches.open(DYNAMIC_CACHE);
      cache.put(request, networkResponse.clone());
    }

    return networkResponse;
  } catch (error) {
    // Network failed - try cache
    const cached = await caches.match(request);

    if (cached) {
      return cached;
    }

    // No cache - return offline page
    return caches.match('/offline.html');
  }
}

async function handleAPIRequest(request: Request): Promise<Response> {
  try {
    const response = await fetch(request);

    // Cache GET API responses
    if (response.ok && request.method === 'GET') {
      const cache = await caches.open(DYNAMIC_CACHE);
      cache.put(request, response.clone());
    }

    return response;
  } catch (error) {
    // Network failed - return cached response or error
    const cached = await caches.match(request);

    if (cached) {
      // Add header to indicate stale data
      const staleResponse = new Response(cached.body, {
        status: cached.status,
        headers: {
          ...Object.fromEntries(cached.headers),
          'X-Stale-Data': 'true',
        },
      });
      return staleResponse;
    }

    return new Response(
      JSON.stringify({ error: 'Offline', cached: false }),
      { status: 503, headers: { 'Content-Type': 'application/json' } }
    );
  }
}

async function cacheFirst(request: Request): Promise<Response> {
  const cached = await caches.match(request);

  if (cached) {
    return cached;
  }

  const networkResponse = await fetch(request);
  const cache = await caches.open(STATIC_CACHE);
  cache.put(request, networkResponse.clone());

  return networkResponse;
}

function isStaticAsset(url: string): boolean {
  return url.match(/\.(js|css|woff2|png|jpg|svg)$/);
}

API Client with Retry and Failover

interface APIClientConfig {
  primaryEndpoint: string;
  failoverEndpoints: string[];
  timeout: number;
  maxRetries: number;
}

class ResilientAPIClient {
  private config: APIClientConfig;
  private currentEndpoint: string;
  private endpointFailures: Map<string, number> = new Map();

  constructor(config: APIClientConfig) {
    this.config = config;
    this.currentEndpoint = config.primaryEndpoint;
  }

  async fetch<T>(path: string, options: RequestInit = {}): Promise<T> {
    const endpoints = [this.config.primaryEndpoint, ...this.config.failoverEndpoints];

    for (const endpoint of endpoints) {
      // Skip endpoints with recent failures
      if (this.isEndpointCircuitOpen(endpoint)) {
        continue;
      }

      try {
        const response = await this.fetchWithTimeout(
          `${endpoint}${path}`,
          options,
          this.config.timeout
        );

        if (response.ok) {
          // Reset failure count on success
          this.endpointFailures.set(endpoint, 0);
          return response.json();
        }

        // Server error - try next endpoint
        if (response.status >= 500) {
          this.recordFailure(endpoint);
          continue;
        }

        // Client error - don't retry
        throw new APIError(response.status, await response.text());
      } catch (error) {
        if (error instanceof APIError) throw error;

        this.recordFailure(endpoint);
        console.error(`Endpoint ${endpoint} failed: ${error.message}`);
      }
    }

    // All endpoints failed
    throw new Error('All API endpoints unavailable');
  }

  private async fetchWithTimeout(
    url: string,
    options: RequestInit,
    timeout: number
  ): Promise<Response> {
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), timeout);

    try {
      const response = await fetch(url, {
        ...options,
        signal: controller.signal,
      });
      return response;
    } finally {
      clearTimeout(timeoutId);
    }
  }

  private isEndpointCircuitOpen(endpoint: string): boolean {
    const failures = this.endpointFailures.get(endpoint) || 0;
    // Open circuit after 3 failures
    return failures >= 3;
  }

  private recordFailure(endpoint: string): void {
    const current = this.endpointFailures.get(endpoint) || 0;
    this.endpointFailures.set(endpoint, current + 1);

    // Reset after 30 seconds (half-open circuit)
    setTimeout(() => {
      this.endpointFailures.set(endpoint, 0);
    }, 30000);
  }
}

class APIError extends Error {
  constructor(public status: number, public body: string) {
    super(`API Error: ${status}`);
  }
}

// Usage
const apiClient = new ResilientAPIClient({
  primaryEndpoint: 'https://api.example.com',
  failoverEndpoints: [
    'https://api-us-west.example.com',
    'https://api-eu.example.com',
  ],
  timeout: 5000,
  maxRetries: 3,
});

Graceful Degradation UI

// React hook for degraded mode detection
function useDegradedMode(): DegradedModeState {
  const [state, setState] = useState<DegradedModeState>({
    isOnline: navigator.onLine,
    apiAvailable: true,
    usingStaleData: false,
    features: {
      checkout: true,
      search: true,
      recommendations: true,
    },
  });

  useEffect(() => {
    // Monitor network status
    const handleOnline = () => setState(s => ({ ...s, isOnline: true }));
    const handleOffline = () => setState(s => ({ ...s, isOnline: false }));

    window.addEventListener('online', handleOnline);
    window.addEventListener('offline', handleOffline);

    // Check API availability periodically
    const checkAPI = async () => {
      try {
        const response = await fetch('/api/health', { timeout: 3000 });
        const health = await response.json();

        setState(s => ({
          ...s,
          apiAvailable: true,
          features: {
            checkout: health.services.checkout === 'healthy',
            search: health.services.search === 'healthy',
            recommendations: health.services.recommendations === 'healthy',
          },
        }));
      } catch (error) {
        setState(s => ({
          ...s,
          apiAvailable: false,
          features: {
            checkout: false,
            search: false,
            recommendations: false,
          },
        }));
      }
    };

    const interval = setInterval(checkAPI, 30000);
    checkAPI(); // Initial check

    return () => {
      window.removeEventListener('online', handleOnline);
      window.removeEventListener('offline', handleOffline);
      clearInterval(interval);
    };
  }, []);

  return state;
}

// Degraded mode component wrapper
function DegradedModeProvider({ children }: { children: React.ReactNode }) {
  const degradedMode = useDegradedMode();

  return (
    <DegradedModeContext.Provider value={degradedMode}>
      {!degradedMode.isOnline && <OfflineBanner />}
      {!degradedMode.apiAvailable && degradedMode.isOnline && <APIUnavailableBanner />}
      {degradedMode.usingStaleData && <StaleDataBanner />}
      {children}
    </DegradedModeContext.Provider>
  );
}

// Feature-specific degradation
function CheckoutButton() {
  const { features } = useDegradedMode();

  if (!features.checkout) {
    return (
      <button disabled className="checkout-button disabled">
        Checkout Temporarily Unavailable
      </button>
    );
  }

  return (
    <button className="checkout-button" onClick={handleCheckout}>
      Proceed to Checkout
    </button>
  );
}

function SearchBar() {
  const { features } = useDegradedMode();

  if (!features.search) {
    return (
      <div className="search-degraded">
        <input disabled placeholder="Search unavailable - browse categories instead" />
        <CategoryBrowser />
      </div>
    );
  }

  return <FullSearchBar />;
}

Database Failover Strategies

Read Replica Promotion

// Database client with failover
class DatabaseClient {
  private primaryPool: Pool;
  private replicaPools: Pool[];
  private currentWritePool: Pool;

  constructor(config: DatabaseConfig) {
    this.primaryPool = new Pool({
      host: config.primaryHost,
      port: 5432,
      ...config.credentials,
    });

    this.replicaPools = config.replicaHosts.map(host =>
      new Pool({ host, port: 5432, ...config.credentials })
    );

    this.currentWritePool = this.primaryPool;
  }

  async query(sql: string, params?: any[]): Promise<QueryResult> {
    const isWrite = this.isWriteQuery(sql);

    if (isWrite) {
      return this.executeWrite(sql, params);
    }

    return this.executeRead(sql, params);
  }

  private async executeRead(sql: string, params?: any[]): Promise<QueryResult> {
    // Try replicas first (load distribution)
    const shuffledReplicas = this.shuffleArray([...this.replicaPools]);

    for (const pool of shuffledReplicas) {
      try {
        return await pool.query(sql, params);
      } catch (error) {
        console.error(`Replica failed: ${error.message}`);
      }
    }

    // All replicas failed - try primary
    return this.primaryPool.query(sql, params);
  }

  private async executeWrite(sql: string, params?: any[]): Promise<QueryResult> {
    try {
      return await this.currentWritePool.query(sql, params);
    } catch (error) {
      if (this.isPrimaryFailure(error)) {
        // Attempt to promote replica
        await this.promoteReplica();
        return this.currentWritePool.query(sql, params);
      }
      throw error;
    }
  }

  private async promoteReplica(): Promise<void> {
    console.log('Primary failed, promoting replica...');

    for (const replica of this.replicaPools) {
      try {
        // Check if replica can be promoted
        const result = await replica.query('SELECT pg_is_in_recovery()');

        if (result.rows[0].pg_is_in_recovery) {
          // Promote replica to primary
          await replica.query('SELECT pg_promote()');

          this.currentWritePool = replica;
          console.log('Replica promoted successfully');
          return;
        }
      } catch (error) {
        console.error(`Failed to promote replica: ${error.message}`);
      }
    }

    throw new Error('No replica available for promotion');
  }

  private isWriteQuery(sql: string): boolean {
    const writePatterns = /^(INSERT|UPDATE|DELETE|CREATE|ALTER|DROP|TRUNCATE)/i;
    return writePatterns.test(sql.trim());
  }

  private isPrimaryFailure(error: Error): boolean {
    return error.message.includes('ECONNREFUSED') ||
           error.message.includes('Connection terminated');
  }

  private shuffleArray<T>(array: T[]): T[] {
    for (let i = array.length - 1; i > 0; i--) {
      const j = Math.floor(Math.random() * (i + 1));
      [array[i], array[j]] = [array[j], array[i]];
    }
    return array;
  }
}

Read-Only Mode During Failover

// Application read-only mode
class ReadOnlyModeManager {
  private isReadOnly: boolean = false;
  private readOnlyReason: string = '';

  async checkAndSetMode(): Promise<void> {
    try {
      // Try a simple write query
      await db.query("INSERT INTO health_check (ts) VALUES (NOW()) ON CONFLICT DO NOTHING");
      this.isReadOnly = false;
    } catch (error) {
      console.error('Write failed, entering read-only mode:', error.message);
      this.isReadOnly = true;
      this.readOnlyReason = 'Database write unavailable';
    }
  }

  getMode(): { isReadOnly: boolean; reason: string } {
    return { isReadOnly: this.isReadOnly, reason: this.readOnlyReason };
  }
}

// API middleware
function readOnlyMiddleware(req: Request, res: Response, next: NextFunction) {
  const mode = readOnlyManager.getMode();

  // Allow reads
  if (req.method === 'GET') {
    return next();
  }

  // Block writes in read-only mode
  if (mode.isReadOnly) {
    return res.status(503).json({
      error: 'Service in read-only mode',
      reason: mode.reason,
      retryAfter: 60,
    });
  }

  next();
}

// Frontend handling
async function handleFormSubmit(data: FormData): Promise<void> {
  try {
    await api.post('/api/submit', data);
  } catch (error) {
    if (error.status === 503 && error.body.error === 'Service in read-only mode') {
      // Queue for later submission
      await queueForLater(data);
      showNotification('Your changes will be saved when service is restored');
    } else {
      throw error;
    }
  }
}

Third-Party Dependency Resilience

Auth Provider Failover

// Multi-provider auth client
class ResilientAuthClient {
  private providers: AuthProvider[];
  private currentProvider: AuthProvider;

  constructor() {
    this.providers = [
      new Auth0Provider(process.env.AUTH0_CONFIG),
      new CognitoProvider(process.env.COGNITO_CONFIG), // Backup
    ];
    this.currentProvider = this.providers[0];
  }

  async validateToken(token: string): Promise<User | null> {
    for (const provider of this.providers) {
      try {
        const user = await provider.validateToken(token);

        if (user) {
          this.currentProvider = provider;
          return user;
        }
      } catch (error) {
        console.error(`Auth provider ${provider.name} failed:`, error.message);
      }
    }

    // All providers failed - check local cache
    return this.getFromCache(token);
  }

  private async getFromCache(token: string): Promise<User | null> {
    const cached = await redis.get(`auth:token:${hashToken(token)}`);

    if (cached) {
      const user = JSON.parse(cached);

      // Only use cache if token hasn't expired
      if (user.exp > Date.now() / 1000) {
        return user;
      }
    }

    return null;
  }

  async login(credentials: Credentials): Promise<AuthResult> {
    for (const provider of this.providers) {
      try {
        return await provider.login(credentials);
      } catch (error) {
        console.error(`Login via ${provider.name} failed:`, error.message);
      }
    }

    throw new Error('All auth providers unavailable');
  }
}

Feature Flag Fallback

// Feature flag client with fallback
class ResilientFeatureFlags {
  private client: LaunchDarklyClient;
  private localDefaults: Map<string, boolean>;
  private cachedFlags: Map<string, boolean> = new Map();

  constructor() {
    this.client = new LaunchDarklyClient(process.env.LD_SDK_KEY);
    this.localDefaults = new Map([
      ['new-checkout', false],
      ['dark-mode', true],
      ['recommendations', true],
    ]);

    // Cache flags on successful fetch
    this.client.on('ready', () => {
      this.cacheAllFlags();
    });

    this.client.on('update', () => {
      this.cacheAllFlags();
    });
  }

  async isEnabled(flag: string, user?: User): Promise<boolean> {
    try {
      const value = await this.client.variation(flag, user, null);

      if (value !== null) {
        // Update cache
        this.cachedFlags.set(flag, value);
        return value;
      }
    } catch (error) {
      console.error(`Feature flag service unavailable: ${error.message}`);
    }

    // Fallback to cache
    if (this.cachedFlags.has(flag)) {
      return this.cachedFlags.get(flag);
    }

    // Fallback to local defaults
    return this.localDefaults.get(flag) ?? false;
  }

  private async cacheAllFlags(): Promise<void> {
    const allFlags = await this.client.allFlagsState();

    for (const [key, value] of Object.entries(allFlags)) {
      this.cachedFlags.set(key, value);
    }

    // Persist to localStorage for offline access
    if (typeof window !== 'undefined') {
      localStorage.setItem('feature-flags-cache', JSON.stringify(
        Object.fromEntries(this.cachedFlags)
      ));
    }
  }
}

DR Testing: Chaos Engineering for Frontend

Automated DR Tests

// DR test suite
describe('Disaster Recovery Tests', () => {
  describe('CDN Failover', () => {
    it('should serve stale content when origin is down', async () => {
      // Setup: Ensure page is cached
      await fetch('https://app.example.com/');

      // Simulate origin failure
      await originSimulator.disable('us-east-1');

      // Verify stale content is served
      const response = await fetch('https://app.example.com/');
      expect(response.ok).toBe(true);
      expect(response.headers.get('X-Cache')).toContain('STALE');

      // Cleanup
      await originSimulator.enable('us-east-1');
    });
  });

  describe('API Failover', () => {
    it('should fail over to secondary API endpoint', async () => {
      // Disable primary API
      await apiSimulator.disable('primary');

      // Make request
      const response = await apiClient.fetch('/api/products');

      // Verify response came from secondary
      expect(response.headers.get('X-Served-By')).toBe('secondary');

      // Cleanup
      await apiSimulator.enable('primary');
    });
  });

  describe('Database Failover', () => {
    it('should continue serving reads during primary failure', async () => {
      // Disable primary database
      await dbSimulator.failPrimary();

      // Verify reads still work
      const products = await db.query('SELECT * FROM products LIMIT 10');
      expect(products.rows.length).toBeGreaterThan(0);

      // Cleanup
      await dbSimulator.restorePrimary();
    });

    it('should enter read-only mode when writes fail', async () => {
      await dbSimulator.failPrimary();

      // Attempt write
      const response = await fetch('/api/orders', {
        method: 'POST',
        body: JSON.stringify({ items: [1, 2, 3] }),
      });

      expect(response.status).toBe(503);
      expect(await response.json()).toMatchObject({
        error: 'Service in read-only mode',
      });

      await dbSimulator.restorePrimary();
    });
  });

  describe('Third-Party Failover', () => {
    it('should use cached auth when provider is down', async () => {
      // Login first (caches token)
      await authClient.login({ email: 'test@example.com', password: 'test' });

      // Disable auth provider
      await authSimulator.disable();

      // Verify token validation still works
      const user = await authClient.validateToken(cachedToken);
      expect(user).not.toBeNull();

      await authSimulator.enable();
    });
  });
});

Game Day Runbook

# DR Game Day Runbook

## Pre-Game Day (24 hours before)
1. [ ] Notify stakeholders
2. [ ] Verify monitoring dashboards
3. [ ] Confirm rollback procedures
4. [ ] Brief on-call team

## Game Day Scenarios

### Scenario 1: Primary Region Failure (30 minutes)
1. Disable health check for us-east-1
2. Monitor DNS failover (expected: 60-90 seconds)
3. Verify traffic routing to us-west-2
4. Test critical user flows:
   - [ ] Homepage loads
   - [ ] User can log in
   - [ ] Product pages load
   - [ ] Cart functionality (read-only acceptable)
   - [ ] Checkout disabled notification shown
5. Re-enable us-east-1
6. Verify traffic returns to primary

### Scenario 2: CDN Outage (20 minutes)
1. Simulate Cloudflare outage (block at firewall)
2. Verify CloudFront failover
3. Test asset loading:
   - [ ] JS bundles load
   - [ ] CSS loads
   - [ ] Images load
4. Restore Cloudflare
5. Verify traffic distribution

### Scenario 3: Database Primary Failure (25 minutes)
1. Stop PostgreSQL primary
2. Monitor automatic failover
3. Verify read-only mode activated
4. Test degraded functionality:
   - [ ] Product browsing works
   - [ ] Search works
   - [ ] Write operations show appropriate error
5. Promote replica to primary
6. Verify write operations resume

## Post-Game Day
1. Document any issues
2. Update runbooks
3. Schedule fixes for identified gaps
4. Share learnings with team

Monitoring and Alerting for DR

// DR-specific monitoring
const drMetrics = {
  // Origin health
  'origin.health.us_east': () => checkOriginHealth('us-east-1'),
  'origin.health.us_west': () => checkOriginHealth('us-west-2'),
  'origin.health.eu_west': () => checkOriginHealth('eu-west-1'),

  // Failover status
  'failover.active': () => isFailoverActive(),
  'failover.duration': () => getFailoverDuration(),

  // Degraded mode
  'degraded.read_only': () => isReadOnlyMode(),
  'degraded.features_disabled': () => countDisabledFeatures(),

  // Cache health
  'cache.stale_serving': () => isServingStale(),
  'cache.hit_rate': () => getCacheHitRate(),
};

// Alerts
const drAlerts = [
  {
    name: 'Primary Region Unhealthy',
    condition: 'origin.health.us_east == 0 for 1 minute',
    severity: 'critical',
    action: 'page-oncall',
  },
  {
    name: 'All Regions Unhealthy',
    condition: 'origin.health.us_east == 0 AND origin.health.us_west == 0 AND origin.health.eu_west == 0',
    severity: 'critical',
    action: 'page-oncall, notify-leadership',
  },
  {
    name: 'Extended Failover',
    condition: 'failover.active == 1 AND failover.duration > 30 minutes',
    severity: 'high',
    action: 'page-oncall',
  },
  {
    name: 'Read-Only Mode Active',
    condition: 'degraded.read_only == 1 for 5 minutes',
    severity: 'high',
    action: 'notify-team',
  },
  {
    name: 'Serving Stale Content',
    condition: 'cache.stale_serving == 1 for 10 minutes',
    severity: 'medium',
    action: 'notify-team',
  },
];

Summary: DR Architecture Principles

Multi-Layer Redundancy - DNS, CDN, origin, and database each have independent failover mechanisms. No single point of failure.
Stale-If-Error Is Your Friend - Configure CDN to serve stale content when origins fail. Users get content, not errors.
Client-Side Resilience - Service workers, retry logic, and circuit breakers ensure the frontend handles backend failures gracefully.
Graceful Degradation - Design features to work in degraded modes. Read-only is better than offline.
Shared State Replication - Sessions, user data, and critical state must replicate across regions with minimal lag.
Third-Party Fallbacks - Auth, payments, and feature flags need backup strategies. Don't let external dependencies become SPOFs.
Test Your DR - Regular game days and chaos engineering ensure DR plans actually work when needed.
RTO/RPO Reality - Define realistic recovery objectives and architect to meet them. 5-minute RTO requires different infrastructure than 1-hour RTO.

The architecture outlined here—multi-region deployment, CDN resilience, client-side caching, database replication, and graceful degradation—represents production-grade disaster recovery for frontend systems at scale.

When AWS goes down, your users should barely notice.

What did you think?