Frontend Disaster Recovery Architecture: Building Resilient Systems at Scale
Introduction: Why Frontend DR Is Different
When AWS us-east-1 experienced a major outage in December 2021, companies with proper disaster recovery plans failed over in minutes. Companies without them were offline for hours. The difference wasn't backend infrastructure—it was whether their frontend could serve users from an alternate location while maintaining functionality.
Frontend disaster recovery is fundamentally different from backend DR because:
-
Static Assets Create False Confidence - "Our app is on a CDN, it's already distributed." Until the CDN's origin becomes unreachable and cached content expires.
-
Client-Side Code Can't Failover - Your React app doesn't know to use a different API endpoint when the primary fails. That logic must be built in.
-
Browser State Is Untransferable - localStorage, IndexedDB, and Service Worker caches are device-specific. Users can't seamlessly continue on a different device.
-
Hydration Depends on API Availability - SSR/SSG pages look fine until JavaScript tries to hydrate and discovers APIs are unreachable.
-
Third-Party Dependencies Are Hidden SPOFs - Your auth provider, analytics, feature flags, and payment processor are all single points of failure.
This article covers how to architect frontend systems that survive regional outages, CDN failures, and origin unavailability—while maintaining user experience during degraded states.
Scale Context: Production Reality
System Profile:
- DAU: 35M daily active users
- Peak RPS: 320K requests/second
- Geographic Distribution: 140+ countries
- Primary Region: US-East (us-east-1)
- Secondary Region: US-West (us-west-2)
- Tertiary Region: EU-West (eu-west-1)
Availability Targets:
- SLA: 99.95% (26 minutes downtime/year)
- RTO (Recovery Time Objective): 5 minutes
- RPO (Recovery Point Objective): 0 for stateless, 1 minute for stateful
Infrastructure:
- CDN: Multi-provider (Cloudflare + CloudFront)
- Origin: Kubernetes across 3 regions
- Database: PostgreSQL with cross-region replication
- Cache: Redis with active-passive replication
- Static Assets: S3 with cross-region replication
High-Level DR Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER REQUEST │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 1: Global DNS (Route 53 / Cloudflare DNS) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Health-Based Routing │ │
│ │ - Primary: us-east-1 (weight: 100, health: ✓) │ │
│ │ - Secondary: us-west-2 (weight: 0, failover if primary unhealthy) │ │
│ │ - Tertiary: eu-west-1 (latency-based for EU users) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 2: CDN Edge (Multi-Provider) │
│ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Cloudflare (Primary) │ │ CloudFront (Failover) │ │
│ │ - 300+ PoPs │ │ - 450+ PoPs │ │
│ │ - Edge Workers │ │ - Lambda@Edge │ │
│ │ - KV Storage │ │ - S3 Origin │ │
│ └─────────────────────────────┘ └─────────────────────────────┘ │
│ │
│ Cache Strategy: │
│ - HTML: 60s TTL + stale-while-revalidate: 3600s + stale-if-error: 86400s │
│ - JS/CSS: Immutable (content-addressed) │
│ - API responses: Vary by region, 30s TTL │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ US-EAST-1 (PRIMARY) │ │ US-WEST-2 (SECONDARY) │ │ EU-WEST-1 (TERTIARY) │
│ │ │ │ │ │
│ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │
│ │ Load Balancer │ │ │ │ Load Balancer │ │ │ │ Load Balancer │ │
│ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │
│ │ Frontend Cluster │ │ │ │ Frontend Cluster │ │ │ │ Frontend Cluster │ │
│ │ (K8s: 20 pods) │ │ │ │ (K8s: 10 pods) │ │ │ │ (K8s: 10 pods) │ │
│ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │
│ │ BFF Services │ │ │ │ BFF Services │ │ │ │ BFF Services │ │
│ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │ │ └─────────┬─────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │ │ ┌─────────┴─────────┐ │
│ │ PostgreSQL (RW) │ │ │ │ PostgreSQL (RO) │ │ │ │ PostgreSQL (RO) │ │
│ │ Redis (Primary) │ │ │ │ Redis (Replica) │ │ │ │ Redis (Replica) │ │
│ └───────────────────┘ │ │ └───────────────────┘ │ │ └───────────────────┘ │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
│ │ │
└───────────────────────┴───────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SHARED SERVICES (Cross-Region) │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Auth0 (Auth) │ │ Stripe (Pay) │ │ LaunchDarkly │ │
│ │ Multi-region │ │ Multi-region │ │ (Flags) │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
RTO and RPO: Defining Recovery Objectives
Recovery Time Objective (RTO)
Time from failure detection to service restoration.
| Component | RTO Target | Mechanism |
|---|---|---|
| Static Assets (CDN) | 0 | Already distributed, stale-if-error |
| HTML Pages | 30s | CDN failover + stale content |
| API Endpoints | 5 min | DNS failover + health checks |
| Database Writes | 15 min | Promote read replica |
| Full Functionality | 30 min | All services recovered |
Recovery Point Objective (RPO)
Maximum acceptable data loss (time between last backup and failure).
| Data Type | RPO Target | Mechanism |
|---|---|---|
| User Sessions | 0 | Redis replication (sync) |
| User Data | 1 min | PostgreSQL streaming replication |
| Analytics Events | 5 min | Kafka mirroring |
| Audit Logs | 0 | Multi-region writes |
DNS-Level Failover Architecture
Health Check Configuration
// AWS Route 53 Health Check
const healthCheckConfig = {
Type: 'HTTPS',
ResourcePath: '/api/health',
FullyQualifiedDomainName: 'api.example.com',
Port: 443,
RequestInterval: 10, // Check every 10 seconds
FailureThreshold: 3, // 3 failures = unhealthy
MeasureLatency: true,
Regions: [
'us-east-1',
'us-west-2',
'eu-west-1',
'ap-southeast-1',
],
};
// Health check endpoint
app.get('/api/health', async (req, res) => {
const checks = await Promise.allSettled([
checkDatabase(),
checkRedis(),
checkExternalAPIs(),
]);
const allHealthy = checks.every(c => c.status === 'fulfilled');
if (allHealthy) {
res.status(200).json({ status: 'healthy', timestamp: Date.now() });
} else {
const failures = checks
.filter(c => c.status === 'rejected')
.map(c => c.reason);
res.status(503).json({ status: 'unhealthy', failures });
}
});
Failover DNS Configuration
// Route 53 failover routing policy
const dnsConfig = {
HostedZoneId: 'Z1234567890',
ChangeBatch: {
Changes: [
// Primary record
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: 'app.example.com',
Type: 'A',
SetIdentifier: 'primary-us-east',
Failover: 'PRIMARY',
HealthCheckId: 'health-check-us-east',
AliasTarget: {
HostedZoneId: 'Z35SXDOTRQ7X7K',
DNSName: 'alb-us-east.example.com',
EvaluateTargetHealth: true,
},
},
},
// Secondary record
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: 'app.example.com',
Type: 'A',
SetIdentifier: 'secondary-us-west',
Failover: 'SECONDARY',
HealthCheckId: 'health-check-us-west',
AliasTarget: {
HostedZoneId: 'Z1H1FL5HABSF5',
DNSName: 'alb-us-west.example.com',
EvaluateTargetHealth: true,
},
},
},
],
},
};
Failover Timeline
T+0s Primary region fails
T+10s First health check fails
T+20s Second health check fails
T+30s Third health check fails (threshold reached)
T+30s Route 53 marks primary unhealthy
T+30s DNS starts returning secondary IP
T+60s DNS TTL expires (most clients)
T+90s 90% of traffic on secondary
Total failover time: ~90 seconds (dominated by DNS TTL)
Reducing DNS Failover Time:
// Use low TTL for failover records
const lowTTLConfig = {
TTL: 30, // 30 seconds (vs default 300s)
// Tradeoff: More DNS queries, faster failover
};
// Or use AWS Global Accelerator for instant failover
const globalAcceleratorConfig = {
// Static anycast IPs, no DNS propagation delay
StaticIpAddresses: ['1.2.3.4', '5.6.7.8'],
// Health checks at network level
EndpointGroups: [
{ Region: 'us-east-1', Weight: 100, HealthCheckPort: 443 },
{ Region: 'us-west-2', Weight: 0, HealthCheckPort: 443 },
],
};
CDN Resilience Strategies
Multi-CDN Architecture
// Edge worker: Multi-origin failover
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const origins = [
{ name: 'primary', url: env.PRIMARY_ORIGIN, timeout: 5000 },
{ name: 'secondary', url: env.SECONDARY_ORIGIN, timeout: 5000 },
{ name: 'tertiary', url: env.TERTIARY_ORIGIN, timeout: 10000 },
];
for (const origin of origins) {
try {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), origin.timeout);
const response = await fetch(
new URL(request.url).pathname,
{
...request,
signal: controller.signal,
}
);
clearTimeout(timeoutId);
if (response.ok) {
// Add origin header for debugging
const newResponse = new Response(response.body, response);
newResponse.headers.set('X-Served-By', origin.name);
return newResponse;
}
console.error(`Origin ${origin.name} returned ${response.status}`);
} catch (error) {
console.error(`Origin ${origin.name} failed: ${error.message}`);
}
}
// All origins failed - serve stale content if available
const cached = await caches.default.match(request);
if (cached) {
const staleResponse = new Response(cached.body, cached);
staleResponse.headers.set('X-Served-From', 'stale-cache');
staleResponse.headers.set('X-Cache-Age', cached.headers.get('Age') || 'unknown');
return staleResponse;
}
// No cached content - return error page
return new Response(ERROR_PAGE_HTML, {
status: 503,
headers: { 'Content-Type': 'text/html' },
});
},
};
const ERROR_PAGE_HTML = `
<!DOCTYPE html>
<html>
<head>
<title>Service Temporarily Unavailable</title>
<style>
body { font-family: system-ui; text-align: center; padding: 50px; }
h1 { color: #333; }
p { color: #666; }
</style>
</head>
<body>
<h1>We'll be back shortly</h1>
<p>Our team is working to restore service. Please try again in a few minutes.</p>
<script>
// Auto-refresh after 30 seconds
setTimeout(() => location.reload(), 30000);
</script>
</body>
</html>
`;
Stale-While-Revalidate + Stale-If-Error
// Origin server cache headers
function setCacheHeaders(res: Response, contentType: string): void {
if (contentType === 'text/html') {
// HTML: Short TTL, but serve stale on errors
res.setHeader('Cache-Control', [
'public',
'max-age=60', // Fresh for 60 seconds
'stale-while-revalidate=3600', // Serve stale for 1 hour while revalidating
'stale-if-error=86400', // Serve stale for 24 hours if origin errors
].join(', '));
} else if (contentType.includes('javascript') || contentType.includes('css')) {
// Static assets: Immutable (content-addressed)
res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
} else if (contentType.includes('json')) {
// API responses: Short TTL
res.setHeader('Cache-Control', 'public, max-age=30, stale-if-error=300');
}
}
Failover Timeline with Stale-If-Error:
T+0 Origin fails
T+0 CDN serves cached content (stale-if-error)
T+0 Users see content immediately (no interruption)
T+60s Cache age exceeds max-age (60s)
T+60s CDN attempts revalidation (fails)
T+60s CDN continues serving stale (stale-if-error active)
T+86400s stale-if-error expires (24 hours)
T+86400s If origin still down, 503 errors begin
Effective RTO for cached content: 0 seconds
Client-Side Resilience Patterns
Service Worker Offline Cache
// sw.js - Service Worker
const CACHE_VERSION = 'v1';
const STATIC_CACHE = `static-${CACHE_VERSION}`;
const DYNAMIC_CACHE = `dynamic-${CACHE_VERSION}`;
const STATIC_ASSETS = [
'/',
'/offline.html',
'/manifest.json',
'/_next/static/css/main.css',
'/_next/static/js/main.js',
];
// Install: Cache static assets
self.addEventListener('install', (event) => {
event.waitUntil(
caches.open(STATIC_CACHE).then((cache) => {
return cache.addAll(STATIC_ASSETS);
})
);
});
// Fetch: Network-first with cache fallback
self.addEventListener('fetch', (event) => {
const request = event.request;
// Skip non-GET requests
if (request.method !== 'GET') return;
// Skip API requests for offline page
if (request.url.includes('/api/')) {
event.respondWith(handleAPIRequest(request));
return;
}
// Static assets: Cache-first
if (isStaticAsset(request.url)) {
event.respondWith(cacheFirst(request));
return;
}
// HTML: Network-first with offline fallback
event.respondWith(networkFirstWithOfflineFallback(request));
});
async function networkFirstWithOfflineFallback(request: Request): Promise<Response> {
try {
const networkResponse = await fetch(request);
// Cache successful responses
if (networkResponse.ok) {
const cache = await caches.open(DYNAMIC_CACHE);
cache.put(request, networkResponse.clone());
}
return networkResponse;
} catch (error) {
// Network failed - try cache
const cached = await caches.match(request);
if (cached) {
return cached;
}
// No cache - return offline page
return caches.match('/offline.html');
}
}
async function handleAPIRequest(request: Request): Promise<Response> {
try {
const response = await fetch(request);
// Cache GET API responses
if (response.ok && request.method === 'GET') {
const cache = await caches.open(DYNAMIC_CACHE);
cache.put(request, response.clone());
}
return response;
} catch (error) {
// Network failed - return cached response or error
const cached = await caches.match(request);
if (cached) {
// Add header to indicate stale data
const staleResponse = new Response(cached.body, {
status: cached.status,
headers: {
...Object.fromEntries(cached.headers),
'X-Stale-Data': 'true',
},
});
return staleResponse;
}
return new Response(
JSON.stringify({ error: 'Offline', cached: false }),
{ status: 503, headers: { 'Content-Type': 'application/json' } }
);
}
}
async function cacheFirst(request: Request): Promise<Response> {
const cached = await caches.match(request);
if (cached) {
return cached;
}
const networkResponse = await fetch(request);
const cache = await caches.open(STATIC_CACHE);
cache.put(request, networkResponse.clone());
return networkResponse;
}
function isStaticAsset(url: string): boolean {
return url.match(/\.(js|css|woff2|png|jpg|svg)$/);
}
API Client with Retry and Failover
interface APIClientConfig {
primaryEndpoint: string;
failoverEndpoints: string[];
timeout: number;
maxRetries: number;
}
class ResilientAPIClient {
private config: APIClientConfig;
private currentEndpoint: string;
private endpointFailures: Map<string, number> = new Map();
constructor(config: APIClientConfig) {
this.config = config;
this.currentEndpoint = config.primaryEndpoint;
}
async fetch<T>(path: string, options: RequestInit = {}): Promise<T> {
const endpoints = [this.config.primaryEndpoint, ...this.config.failoverEndpoints];
for (const endpoint of endpoints) {
// Skip endpoints with recent failures
if (this.isEndpointCircuitOpen(endpoint)) {
continue;
}
try {
const response = await this.fetchWithTimeout(
`${endpoint}${path}`,
options,
this.config.timeout
);
if (response.ok) {
// Reset failure count on success
this.endpointFailures.set(endpoint, 0);
return response.json();
}
// Server error - try next endpoint
if (response.status >= 500) {
this.recordFailure(endpoint);
continue;
}
// Client error - don't retry
throw new APIError(response.status, await response.text());
} catch (error) {
if (error instanceof APIError) throw error;
this.recordFailure(endpoint);
console.error(`Endpoint ${endpoint} failed: ${error.message}`);
}
}
// All endpoints failed
throw new Error('All API endpoints unavailable');
}
private async fetchWithTimeout(
url: string,
options: RequestInit,
timeout: number
): Promise<Response> {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
...options,
signal: controller.signal,
});
return response;
} finally {
clearTimeout(timeoutId);
}
}
private isEndpointCircuitOpen(endpoint: string): boolean {
const failures = this.endpointFailures.get(endpoint) || 0;
// Open circuit after 3 failures
return failures >= 3;
}
private recordFailure(endpoint: string): void {
const current = this.endpointFailures.get(endpoint) || 0;
this.endpointFailures.set(endpoint, current + 1);
// Reset after 30 seconds (half-open circuit)
setTimeout(() => {
this.endpointFailures.set(endpoint, 0);
}, 30000);
}
}
class APIError extends Error {
constructor(public status: number, public body: string) {
super(`API Error: ${status}`);
}
}
// Usage
const apiClient = new ResilientAPIClient({
primaryEndpoint: 'https://api.example.com',
failoverEndpoints: [
'https://api-us-west.example.com',
'https://api-eu.example.com',
],
timeout: 5000,
maxRetries: 3,
});
Graceful Degradation UI
// React hook for degraded mode detection
function useDegradedMode(): DegradedModeState {
const [state, setState] = useState<DegradedModeState>({
isOnline: navigator.onLine,
apiAvailable: true,
usingStaleData: false,
features: {
checkout: true,
search: true,
recommendations: true,
},
});
useEffect(() => {
// Monitor network status
const handleOnline = () => setState(s => ({ ...s, isOnline: true }));
const handleOffline = () => setState(s => ({ ...s, isOnline: false }));
window.addEventListener('online', handleOnline);
window.addEventListener('offline', handleOffline);
// Check API availability periodically
const checkAPI = async () => {
try {
const response = await fetch('/api/health', { timeout: 3000 });
const health = await response.json();
setState(s => ({
...s,
apiAvailable: true,
features: {
checkout: health.services.checkout === 'healthy',
search: health.services.search === 'healthy',
recommendations: health.services.recommendations === 'healthy',
},
}));
} catch (error) {
setState(s => ({
...s,
apiAvailable: false,
features: {
checkout: false,
search: false,
recommendations: false,
},
}));
}
};
const interval = setInterval(checkAPI, 30000);
checkAPI(); // Initial check
return () => {
window.removeEventListener('online', handleOnline);
window.removeEventListener('offline', handleOffline);
clearInterval(interval);
};
}, []);
return state;
}
// Degraded mode component wrapper
function DegradedModeProvider({ children }: { children: React.ReactNode }) {
const degradedMode = useDegradedMode();
return (
<DegradedModeContext.Provider value={degradedMode}>
{!degradedMode.isOnline && <OfflineBanner />}
{!degradedMode.apiAvailable && degradedMode.isOnline && <APIUnavailableBanner />}
{degradedMode.usingStaleData && <StaleDataBanner />}
{children}
</DegradedModeContext.Provider>
);
}
// Feature-specific degradation
function CheckoutButton() {
const { features } = useDegradedMode();
if (!features.checkout) {
return (
<button disabled className="checkout-button disabled">
Checkout Temporarily Unavailable
</button>
);
}
return (
<button className="checkout-button" onClick={handleCheckout}>
Proceed to Checkout
</button>
);
}
function SearchBar() {
const { features } = useDegradedMode();
if (!features.search) {
return (
<div className="search-degraded">
<input disabled placeholder="Search unavailable - browse categories instead" />
<CategoryBrowser />
</div>
);
}
return <FullSearchBar />;
}
Database Failover Strategies
Read Replica Promotion
// Database client with failover
class DatabaseClient {
private primaryPool: Pool;
private replicaPools: Pool[];
private currentWritePool: Pool;
constructor(config: DatabaseConfig) {
this.primaryPool = new Pool({
host: config.primaryHost,
port: 5432,
...config.credentials,
});
this.replicaPools = config.replicaHosts.map(host =>
new Pool({ host, port: 5432, ...config.credentials })
);
this.currentWritePool = this.primaryPool;
}
async query(sql: string, params?: any[]): Promise<QueryResult> {
const isWrite = this.isWriteQuery(sql);
if (isWrite) {
return this.executeWrite(sql, params);
}
return this.executeRead(sql, params);
}
private async executeRead(sql: string, params?: any[]): Promise<QueryResult> {
// Try replicas first (load distribution)
const shuffledReplicas = this.shuffleArray([...this.replicaPools]);
for (const pool of shuffledReplicas) {
try {
return await pool.query(sql, params);
} catch (error) {
console.error(`Replica failed: ${error.message}`);
}
}
// All replicas failed - try primary
return this.primaryPool.query(sql, params);
}
private async executeWrite(sql: string, params?: any[]): Promise<QueryResult> {
try {
return await this.currentWritePool.query(sql, params);
} catch (error) {
if (this.isPrimaryFailure(error)) {
// Attempt to promote replica
await this.promoteReplica();
return this.currentWritePool.query(sql, params);
}
throw error;
}
}
private async promoteReplica(): Promise<void> {
console.log('Primary failed, promoting replica...');
for (const replica of this.replicaPools) {
try {
// Check if replica can be promoted
const result = await replica.query('SELECT pg_is_in_recovery()');
if (result.rows[0].pg_is_in_recovery) {
// Promote replica to primary
await replica.query('SELECT pg_promote()');
this.currentWritePool = replica;
console.log('Replica promoted successfully');
return;
}
} catch (error) {
console.error(`Failed to promote replica: ${error.message}`);
}
}
throw new Error('No replica available for promotion');
}
private isWriteQuery(sql: string): boolean {
const writePatterns = /^(INSERT|UPDATE|DELETE|CREATE|ALTER|DROP|TRUNCATE)/i;
return writePatterns.test(sql.trim());
}
private isPrimaryFailure(error: Error): boolean {
return error.message.includes('ECONNREFUSED') ||
error.message.includes('Connection terminated');
}
private shuffleArray<T>(array: T[]): T[] {
for (let i = array.length - 1; i > 0; i--) {
const j = Math.floor(Math.random() * (i + 1));
[array[i], array[j]] = [array[j], array[i]];
}
return array;
}
}
Read-Only Mode During Failover
// Application read-only mode
class ReadOnlyModeManager {
private isReadOnly: boolean = false;
private readOnlyReason: string = '';
async checkAndSetMode(): Promise<void> {
try {
// Try a simple write query
await db.query("INSERT INTO health_check (ts) VALUES (NOW()) ON CONFLICT DO NOTHING");
this.isReadOnly = false;
} catch (error) {
console.error('Write failed, entering read-only mode:', error.message);
this.isReadOnly = true;
this.readOnlyReason = 'Database write unavailable';
}
}
getMode(): { isReadOnly: boolean; reason: string } {
return { isReadOnly: this.isReadOnly, reason: this.readOnlyReason };
}
}
// API middleware
function readOnlyMiddleware(req: Request, res: Response, next: NextFunction) {
const mode = readOnlyManager.getMode();
// Allow reads
if (req.method === 'GET') {
return next();
}
// Block writes in read-only mode
if (mode.isReadOnly) {
return res.status(503).json({
error: 'Service in read-only mode',
reason: mode.reason,
retryAfter: 60,
});
}
next();
}
// Frontend handling
async function handleFormSubmit(data: FormData): Promise<void> {
try {
await api.post('/api/submit', data);
} catch (error) {
if (error.status === 503 && error.body.error === 'Service in read-only mode') {
// Queue for later submission
await queueForLater(data);
showNotification('Your changes will be saved when service is restored');
} else {
throw error;
}
}
}
Third-Party Dependency Resilience
Auth Provider Failover
// Multi-provider auth client
class ResilientAuthClient {
private providers: AuthProvider[];
private currentProvider: AuthProvider;
constructor() {
this.providers = [
new Auth0Provider(process.env.AUTH0_CONFIG),
new CognitoProvider(process.env.COGNITO_CONFIG), // Backup
];
this.currentProvider = this.providers[0];
}
async validateToken(token: string): Promise<User | null> {
for (const provider of this.providers) {
try {
const user = await provider.validateToken(token);
if (user) {
this.currentProvider = provider;
return user;
}
} catch (error) {
console.error(`Auth provider ${provider.name} failed:`, error.message);
}
}
// All providers failed - check local cache
return this.getFromCache(token);
}
private async getFromCache(token: string): Promise<User | null> {
const cached = await redis.get(`auth:token:${hashToken(token)}`);
if (cached) {
const user = JSON.parse(cached);
// Only use cache if token hasn't expired
if (user.exp > Date.now() / 1000) {
return user;
}
}
return null;
}
async login(credentials: Credentials): Promise<AuthResult> {
for (const provider of this.providers) {
try {
return await provider.login(credentials);
} catch (error) {
console.error(`Login via ${provider.name} failed:`, error.message);
}
}
throw new Error('All auth providers unavailable');
}
}
Feature Flag Fallback
// Feature flag client with fallback
class ResilientFeatureFlags {
private client: LaunchDarklyClient;
private localDefaults: Map<string, boolean>;
private cachedFlags: Map<string, boolean> = new Map();
constructor() {
this.client = new LaunchDarklyClient(process.env.LD_SDK_KEY);
this.localDefaults = new Map([
['new-checkout', false],
['dark-mode', true],
['recommendations', true],
]);
// Cache flags on successful fetch
this.client.on('ready', () => {
this.cacheAllFlags();
});
this.client.on('update', () => {
this.cacheAllFlags();
});
}
async isEnabled(flag: string, user?: User): Promise<boolean> {
try {
const value = await this.client.variation(flag, user, null);
if (value !== null) {
// Update cache
this.cachedFlags.set(flag, value);
return value;
}
} catch (error) {
console.error(`Feature flag service unavailable: ${error.message}`);
}
// Fallback to cache
if (this.cachedFlags.has(flag)) {
return this.cachedFlags.get(flag);
}
// Fallback to local defaults
return this.localDefaults.get(flag) ?? false;
}
private async cacheAllFlags(): Promise<void> {
const allFlags = await this.client.allFlagsState();
for (const [key, value] of Object.entries(allFlags)) {
this.cachedFlags.set(key, value);
}
// Persist to localStorage for offline access
if (typeof window !== 'undefined') {
localStorage.setItem('feature-flags-cache', JSON.stringify(
Object.fromEntries(this.cachedFlags)
));
}
}
}
DR Testing: Chaos Engineering for Frontend
Automated DR Tests
// DR test suite
describe('Disaster Recovery Tests', () => {
describe('CDN Failover', () => {
it('should serve stale content when origin is down', async () => {
// Setup: Ensure page is cached
await fetch('https://app.example.com/');
// Simulate origin failure
await originSimulator.disable('us-east-1');
// Verify stale content is served
const response = await fetch('https://app.example.com/');
expect(response.ok).toBe(true);
expect(response.headers.get('X-Cache')).toContain('STALE');
// Cleanup
await originSimulator.enable('us-east-1');
});
});
describe('API Failover', () => {
it('should fail over to secondary API endpoint', async () => {
// Disable primary API
await apiSimulator.disable('primary');
// Make request
const response = await apiClient.fetch('/api/products');
// Verify response came from secondary
expect(response.headers.get('X-Served-By')).toBe('secondary');
// Cleanup
await apiSimulator.enable('primary');
});
});
describe('Database Failover', () => {
it('should continue serving reads during primary failure', async () => {
// Disable primary database
await dbSimulator.failPrimary();
// Verify reads still work
const products = await db.query('SELECT * FROM products LIMIT 10');
expect(products.rows.length).toBeGreaterThan(0);
// Cleanup
await dbSimulator.restorePrimary();
});
it('should enter read-only mode when writes fail', async () => {
await dbSimulator.failPrimary();
// Attempt write
const response = await fetch('/api/orders', {
method: 'POST',
body: JSON.stringify({ items: [1, 2, 3] }),
});
expect(response.status).toBe(503);
expect(await response.json()).toMatchObject({
error: 'Service in read-only mode',
});
await dbSimulator.restorePrimary();
});
});
describe('Third-Party Failover', () => {
it('should use cached auth when provider is down', async () => {
// Login first (caches token)
await authClient.login({ email: 'test@example.com', password: 'test' });
// Disable auth provider
await authSimulator.disable();
// Verify token validation still works
const user = await authClient.validateToken(cachedToken);
expect(user).not.toBeNull();
await authSimulator.enable();
});
});
});
Game Day Runbook
# DR Game Day Runbook
## Pre-Game Day (24 hours before)
1. [ ] Notify stakeholders
2. [ ] Verify monitoring dashboards
3. [ ] Confirm rollback procedures
4. [ ] Brief on-call team
## Game Day Scenarios
### Scenario 1: Primary Region Failure (30 minutes)
1. Disable health check for us-east-1
2. Monitor DNS failover (expected: 60-90 seconds)
3. Verify traffic routing to us-west-2
4. Test critical user flows:
- [ ] Homepage loads
- [ ] User can log in
- [ ] Product pages load
- [ ] Cart functionality (read-only acceptable)
- [ ] Checkout disabled notification shown
5. Re-enable us-east-1
6. Verify traffic returns to primary
### Scenario 2: CDN Outage (20 minutes)
1. Simulate Cloudflare outage (block at firewall)
2. Verify CloudFront failover
3. Test asset loading:
- [ ] JS bundles load
- [ ] CSS loads
- [ ] Images load
4. Restore Cloudflare
5. Verify traffic distribution
### Scenario 3: Database Primary Failure (25 minutes)
1. Stop PostgreSQL primary
2. Monitor automatic failover
3. Verify read-only mode activated
4. Test degraded functionality:
- [ ] Product browsing works
- [ ] Search works
- [ ] Write operations show appropriate error
5. Promote replica to primary
6. Verify write operations resume
## Post-Game Day
1. Document any issues
2. Update runbooks
3. Schedule fixes for identified gaps
4. Share learnings with team
Monitoring and Alerting for DR
// DR-specific monitoring
const drMetrics = {
// Origin health
'origin.health.us_east': () => checkOriginHealth('us-east-1'),
'origin.health.us_west': () => checkOriginHealth('us-west-2'),
'origin.health.eu_west': () => checkOriginHealth('eu-west-1'),
// Failover status
'failover.active': () => isFailoverActive(),
'failover.duration': () => getFailoverDuration(),
// Degraded mode
'degraded.read_only': () => isReadOnlyMode(),
'degraded.features_disabled': () => countDisabledFeatures(),
// Cache health
'cache.stale_serving': () => isServingStale(),
'cache.hit_rate': () => getCacheHitRate(),
};
// Alerts
const drAlerts = [
{
name: 'Primary Region Unhealthy',
condition: 'origin.health.us_east == 0 for 1 minute',
severity: 'critical',
action: 'page-oncall',
},
{
name: 'All Regions Unhealthy',
condition: 'origin.health.us_east == 0 AND origin.health.us_west == 0 AND origin.health.eu_west == 0',
severity: 'critical',
action: 'page-oncall, notify-leadership',
},
{
name: 'Extended Failover',
condition: 'failover.active == 1 AND failover.duration > 30 minutes',
severity: 'high',
action: 'page-oncall',
},
{
name: 'Read-Only Mode Active',
condition: 'degraded.read_only == 1 for 5 minutes',
severity: 'high',
action: 'notify-team',
},
{
name: 'Serving Stale Content',
condition: 'cache.stale_serving == 1 for 10 minutes',
severity: 'medium',
action: 'notify-team',
},
];
Summary: DR Architecture Principles
-
Multi-Layer Redundancy - DNS, CDN, origin, and database each have independent failover mechanisms. No single point of failure.
-
Stale-If-Error Is Your Friend - Configure CDN to serve stale content when origins fail. Users get content, not errors.
-
Client-Side Resilience - Service workers, retry logic, and circuit breakers ensure the frontend handles backend failures gracefully.
-
Graceful Degradation - Design features to work in degraded modes. Read-only is better than offline.
-
Shared State Replication - Sessions, user data, and critical state must replicate across regions with minimal lag.
-
Third-Party Fallbacks - Auth, payments, and feature flags need backup strategies. Don't let external dependencies become SPOFs.
-
Test Your DR - Regular game days and chaos engineering ensure DR plans actually work when needed.
-
RTO/RPO Reality - Define realistic recovery objectives and architect to meet them. 5-minute RTO requires different infrastructure than 1-hour RTO.
The architecture outlined here—multi-region deployment, CDN resilience, client-side caching, database replication, and graceful degradation—represents production-grade disaster recovery for frontend systems at scale.
When AWS goes down, your users should barely notice.
What did you think?