Canary Deployment Architecture for Frontends: Production-Grade Strategies at Scale
Canary Deployment Architecture for Frontends: Production-Grade Strategies at Scale
Introduction: Why Frontend Canaries Are Fundamentally Different
When Netflix pushed a buggy JavaScript bundle that caused infinite re-render loops for 2% of their traffic, they caught it in 90 seconds. When a smaller company did the same without canary deployments, they discovered the issue after 40,000 users had already experienced crashes, leading to a 12% same-day churn spike.
The difference? A production-grade canary deployment architecture built specifically for frontend constraints.
Most engineers understand canary deployments for backend services: route 5% of traffic to new code, compare error rates, proceed or rollback. Simple. But frontends break this mental model in fundamental ways:
The Frontend Canary Problem:
-
Static Assets Are Immutable - You can't "gradually deploy" a JavaScript bundle. Once it's on the CDN, it's there. You need traffic splitting, not deployment splitting.
-
Browser Caching Creates Version Chaos - User A might load index.html (new) but main.js (old). User B might have the opposite. You're not deploying one version, you're managing version matrices.
-
Client-Side State Persists Across Deployments - A user's localStorage, IndexedDB, and ServiceWorker cache might be from v47, while your canary is v51. Migrations happen in the browser, not on deploy.
-
Hydration Failures Are Silent - SSR/SSG apps can serve perfectly fine HTML from the canary, then crash during hydration. Traditional monitoring catches this too late.
-
CDN Cache Invalidation Takes Time - You can't instantly rollback a frontend deployment. CDN propagation means bad code lives for 30-300 seconds minimum, affecting thousands of requests.
This article explains how companies like Vercel, Cloudflare, and Netflix build frontend canary systems that account for these constraints while maintaining sub-100ms decision loops and zero-downtime rollbacks.
We'll cover the architecture decisions that matter: edge-based traffic splitting, client-side version detection, hydration monitoring, cache invalidation strategies, and the automation systems that make decisions faster than humans can.
Scale Context: Production Reality
Before diving into architecture, let's establish realistic production constraints for a hyper-scale frontend:
Traffic Profile:
- DAU: 50M daily active users
- Peak RPS: 450K requests/second (main HTML)
- Asset Requests: 2.8M RPS (JS/CSS/images/fonts)
- Geographic Distribution: 180+ countries, 60% mobile
- CDN PoPs: 300+ edge locations worldwide
- Simultaneous Deploys: 40-60 per day (feature teams + hotfixes)
Frontend Architecture:
- Framework: Next.js 14 (App Router) with React Server Components
- Rendering: Hybrid SSR + SSG + ISR
- Bundle Size: 850KB initial (gzipped), 3.2MB total (all routes)
- Code Splits: 120+ dynamic chunks
- API Calls per Page: 6-12 (BFF aggregation)
- WebSocket Connections: 8M concurrent (realtime features)
Deployment Constraints:
- Build Time: 4-8 minutes (full production build)
- CDN Propagation: 30-90 seconds (global edge cache)
- Canary Duration: 5-45 minutes (depends on confidence)
- Rollback SLA: <2 minutes (detection + action)
- Acceptable Error Budget: 0.1% additional error rate during canary
Monitoring Requirements:
- Metric Collection Latency: <10 seconds
- Decision Loop: <60 seconds
- Sample Size for Statistical Significance: Minimum 10K requests
- False Positive Rate: <1% (automated rollback)
Cost Constraints:
- CDN Bandwidth: $0.08/GB (1.2PB/month = $96K/month)
- Edge Compute: $0.50 per million requests
- Monitoring: $12K/month (APM + RUM + logs)
- Canary Overhead: Must stay <5% of infrastructure costs
At this scale, a naive canary implementation breaks. You need purpose-built architecture.
High-Level Architecture: Frontend Canary System
A production-grade frontend canary system has seven layers:
┌─────────────────────────────────────────────────────────────────┐
│ USER REQUEST │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 1: DNS / Global Load Balancer │
│ - Geographic routing (latency-based) │
│ - DDoS protection │
│ - Health checks │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 2: CDN Edge (300+ PoPs) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Edge Worker (V8 Isolate) │ │
│ │ - Traffic splitting logic │ │
│ │ - Version assignment (cookie/header) │ │
│ │ - Cache key variation │ │
│ │ - Client hints inspection │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Canary Decision: 95% → stable / 5% → canary │
└─────────┬────────────────────────────────┬─────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ STABLE ORIGIN │ │ CANARY ORIGIN │
│ │ │ │
│ /dist-v47/ │ │ /dist-v48/ │
│ main.js │ │ main.js │
│ index.html │ │ index.html │
│ _next/chunks/ │ │ _next/chunks/ │
└──────────────────┘ └──────────────────┘
│ │
└────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 3: Origin Servers (Kubernetes) │
│ - SSR rendering (Node.js pods) │
│ - API BFF (GraphQL aggregation) │
│ - Server Components execution │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 4: Client-Side Instrumentation │
│ - Version detection (injected in HTML) │
│ - Performance metrics (Core Web Vitals) │
│ - Error tracking (window.onerror, React error boundaries) │
│ - Hydration timing (React profiling) │
│ - Navigation timing (PerformanceObserver) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 5: Telemetry Pipeline │
│ - Structured logging (JSON) │
│ - Metrics aggregation (Prometheus/Datadog) │
│ - Real-time streaming (Kafka → Flink) │
│ - Time-series database (InfluxDB/TimescaleDB) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 6: Canary Analysis Engine │
│ - Statistical comparison (two-sample t-test) │
│ - Anomaly detection (IQR, Z-score) │
│ - Threshold evaluation (SLO-based) │
│ - Confidence scoring (Bayesian inference) │
│ - Automated decision (proceed/hold/rollback) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LAYER 7: Deployment Orchestration │
│ - Progressive rollout (5% → 25% → 50% → 100%) │
│ - CDN cache purge (selective invalidation) │
│ - Feature flag coordination (LaunchDarkly/custom) │
│ - Rollback execution (atomic pointer swap) │
│ - Incident automation (PagerDuty/Slack) │
└─────────────────────────────────────────────────────────────────┘
Key Architectural Principles:
-
Edge-First Traffic Splitting - Decision happens at CDN edge (not origin). This prevents origin load amplification and enables <1ms routing decisions.
-
Sticky Sessions via Cookie - Once a user is assigned canary/stable, they stay there for the entire session. Prevents A/B switching mid-session which causes hydration failures.
-
Separate Asset Paths - Canary and stable assets live in different CDN paths (
/dist-v47/vs/dist-v48/). No shared cache keys. This prevents version collisions. -
Client-Side Version Injection - Every HTML response includes
<meta name="app-version" content="v48-canary">. Enables client-side telemetry tagging and error attribution. -
Real-Time Metrics Streaming - Telemetry flows through Kafka to Flink for sub-10-second aggregation. Batch processing (5-minute windows) is too slow for canary decisions.
-
Automated Decision Loop - Humans approve the deploy, but robots decide rollout progression. Statistical tests run every 30 seconds. If canary fails, rollback happens in <60 seconds without human intervention.
Traffic Splitting Mechanisms: Edge vs Origin vs Client
There are four places you can implement canary traffic splitting. Each has different tradeoffs.
1. DNS-Based Splitting (DON'T USE)
user → DNS (weighted records) → 95% to stable-lb.example.com
→ 5% to canary-lb.example.com
Why This Fails:
- DNS caching (60s-3600s TTL) means rollback takes minutes to hours
- Client-side DNS resolvers ignore weights
- No session stickiness
- Geographic distribution is uneven
Verdict: Never use DNS for frontend canaries. It's too slow and unpredictable.
2. Load Balancer-Based Splitting (LEGACY)
user → ALB/NLB → weighted target groups → 95% stable pods
→ 5% canary pods
Why This Works (Sort Of):
- Session stickiness via cookies
- Fast rollback (<10 seconds)
- Origin-level control
Why This Fails at Scale:
- Load balancer becomes a bottleneck (L4/L7 inspection overhead)
- No geographic granularity (all regions get same canary %)
- Origin load amplification (cache misses hit origin harder)
- Doesn't work with CDN-cached static assets
Verdict: Works for SSR-heavy apps without CDN, but not optimal for modern frontends.
3. CDN-Based Splitting (GOOD)
graph TB
User[User Request] --> CDN[CDN Edge PoP]
CDN --> Cache{Asset in<br/>Edge Cache?}
Cache -->|Yes| Return[Return Cached Asset]
Cache -->|No| EdgeLogic[Edge Worker Logic]
EdgeLogic --> Hash{Hash user ID<br/>% 100}
Hash -->|"< 5"| Canary[Set canary cookie<br/>Route to /dist-v48/]
Hash -->|">= 5"| Stable[Set stable cookie<br/>Route to /dist-v47/]
Canary --> FetchCanary[Fetch from Canary Origin]
Stable --> FetchStable[Fetch from Stable Origin]
FetchCanary --> CacheCanary[Cache with key:<br/>canary-v48-/path]
FetchStable --> CacheStable[Cache with key:<br/>stable-v47-/path]
CacheCanary --> Return
CacheStable --> Return
Implementation (Cloudflare Workers):
// Deployed to 300+ edge PoPs
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const cookies = parseCookies(request.headers.get('Cookie') || '');
// Check if user already has a version assignment
let assignedVersion = cookies['x-canary-version'];
if (!assignedVersion) {
// New user - assign based on hash
const userId = cookies['user_id'] || generateAnonymousId();
const hash = hashCode(userId);
const bucket = Math.abs(hash) % 100;
// Get current canary percentage from KV (updated by control plane)
const canaryPct = parseInt(await env.KV.get('canary-percentage') || '5');
assignedVersion = bucket < canaryPct ? 'canary' : 'stable';
}
// Get version numbers from KV
const stableVersion = await env.KV.get('stable-version'); // "v47"
const canaryVersion = await env.KV.get('canary-version'); // "v48"
const targetVersion = assignedVersion === 'canary'
? canaryVersion
: stableVersion;
// Rewrite URL to version-specific path
const originPath = `/dist-${targetVersion}${url.pathname}`;
const originUrl = new URL(originPath, env.ORIGIN_URL);
// Add cache key variation
const cacheKey = new Request(originUrl.toString(), {
headers: request.headers,
cf: {
cacheKey: `${assignedVersion}-${targetVersion}-${url.pathname}`
}
});
// Check edge cache first
let response = await caches.default.match(cacheKey);
if (!response) {
// Cache miss - fetch from origin
response = await fetch(originUrl, {
headers: {
...request.headers,
'X-Canary-Version': assignedVersion,
'X-App-Version': targetVersion
}
});
// Cache the response (if cacheable)
if (response.ok && response.headers.get('Cache-Control')) {
const cachedResponse = response.clone();
ctx.waitUntil(caches.default.put(cacheKey, cachedResponse));
}
}
// Inject version cookie in response
const newResponse = new Response(response.body, response);
newResponse.headers.set(
'Set-Cookie',
`x-canary-version=${assignedVersion}; Path=/; Max-Age=86400; SameSite=Lax; Secure`
);
// Add version header for telemetry
newResponse.headers.set('X-Served-Version', targetVersion);
newResponse.headers.set('X-Canary-Assignment', assignedVersion);
return newResponse;
}
};
function hashCode(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash; // Convert to 32-bit integer
}
return hash;
}
function parseCookies(cookieHeader: string): Record<string, string> {
return Object.fromEntries(
cookieHeader.split(';').map(c => {
const [key, ...v] = c.trim().split('=');
return [key, v.join('=')];
})
);
}
function generateAnonymousId(): string {
return crypto.randomUUID();
}
Why This Works:
- Decision happens at edge (300+ PoPs, <1ms latency)
- User stickiness via cookie (session-consistent)
- Separate cache keys prevent version collisions
- Dynamic canary percentage (KV store update → instant effect)
- Works for both HTML and static assets
Why This Still Has Limitations:
- Edge Worker CPU limits (50ms execution time)
- KV eventual consistency (can take 60s to propagate globally)
- Requires CDN that supports edge compute (Cloudflare, Fastly, AWS CloudFront Functions)
Verdict: This is the industry standard for frontend canaries at scale.
4. Client-Side Splitting (AVOID)
// In initial HTML
<script>
const canaryPct = 5;
const bucket = Math.floor(Math.random() * 100);
const version = bucket < canaryPct ? 'v48' : 'v47';
// Dynamically load versioned bundle
const script = document.createElement('script');
script.src = `/dist-${version}/main.js`;
document.head.appendChild(script);
</script>
Why This Fails:
- Initial HTML is already cached (can't control version)
- Random assignment changes on every page load (no stickiness)
- No SSR support
- Breaks preloading and resource hints
- Hurts Core Web Vitals (delayed script execution)
Verdict: Only use as a last resort if you have zero backend control.
Frontend-Specific Canary Challenges
Backend canary deployments are stateless: send request, get response, compare metrics. Frontend canaries have three fundamental problems that break this model.
Challenge 1: Static Asset Cache Coherence
The Problem:
You deploy canary v48. A user requests:
index.html(v48, canary) ← Edge routes to canary originmain.js(v47, stable) ← Browser cache hit from yesterdaychunk-profile.js(v48, canary) ← Code-split route, cache miss
Now the user is running v48 HTML + v47 main bundle + v48 profile chunk. Webpack module federation crashes because chunk manifests don't align.
Why This Happens:
CDN cache and browser cache have different TTLs:
- HTML:
Cache-Control: public, max-age=0, must-revalidate(always check origin) - JS bundles:
Cache-Control: public, max-age=31536000, immutable(cache forever)
When you deploy canary, HTML immediately points to v48 assets, but browser cache still has v47 bundles.
Solution 1: Content-Addressed Assets (Standard)
Every asset has a hash in its filename:
main.a3f5d2b9.js ← v47 stable
main.c7e9f1a4.js ← v48 canary
When HTML changes version, it references different asset URLs. Browser cache is keyed by URL, so no collision.
Webpack/Next.js automatically does this:
// next.config.js
module.exports = {
generateBuildId: async () => {
// Use git commit SHA as build ID
return execSync('git rev-parse HEAD').toString().trim();
},
// Generates: /_next/static/<buildId>/pages/index.js
}
Solution 2: Versioned Asset Paths
CDN edge worker rewrites paths:
// Canary: /dist-v48/_next/static/chunks/main.js
// Stable: /dist-v47/_next/static/chunks/main.js
Both can coexist in CDN cache with different cache keys.
The Hydration Problem:
Even with content-addressed assets, Server-Side Rendering creates timing issues.
Scenario:
- User requests
/product/123at 10:00:00 AM - Edge routes to canary origin (v48)
- SSR renders React tree with v48 code
- HTML sent to client with
<script src="/main.c7e9f1a4.js"> - Client fetches main.js at 10:00:02 AM
- During those 2 seconds, canary fails and gets rolled back
- CDN edge now routes to stable origin (v47)
- But HTML already references v48 assets
- Hydration error: Client-side React tree doesn't match server-rendered HTML
Solution: Version Pinning in HTML
// SSR render time
const response = await renderToString(
<App version="v48" buildId="c7e9f1a4" />
);
// Inject version in HTML
const html = `
<!DOCTYPE html>
<html>
<head>
<meta name="app-version" content="v48" data-build-id="c7e9f1a4">
<script>
// Client checks version before hydration
window.__APP_VERSION__ = 'v48';
window.__BUILD_ID__ = 'c7e9f1a4';
</script>
<!-- Asset URLs include version -->
<script src="/dist-v48/main.c7e9f1a4.js"></script>
</head>
<body>
<div id="root">${response}</div>
</body>
</html>
`;
Client-side version check:
// Runs before React hydration
async function validateVersion() {
const expectedVersion = window.__APP_VERSION__;
const expectedBuildId = window.__BUILD_ID__;
// Check if version is still valid
const response = await fetch('/api/version-check', {
headers: {
'X-Client-Version': expectedVersion,
'X-Build-ID': expectedBuildId
}
});
const { valid, currentVersion } = await response.json();
if (!valid) {
console.warn(`Version mismatch: expected ${expectedVersion}, current ${currentVersion}`);
// Option 1: Hard reload (loses client state)
window.location.reload();
// Option 2: Soft migration (preserve state)
await migrateClientState(expectedVersion, currentVersion);
}
}
validateVersion().then(() => {
// Safe to hydrate
hydrateRoot(document.getElementById('root'), <App />);
});
The Cost:
- Extra API call before hydration (adds 50-150ms to TTI)
- Complicates deployment (version registry service)
- Can cause reload loops if migration fails
Better Solution: Server-Sent Version Hints
CDN edge injects version into HTML response:
// Cloudflare Worker
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await fetch(request);
// Only inject for HTML responses
if (!response.headers.get('Content-Type')?.includes('text/html')) {
return response;
}
const html = await response.text();
// Get current version from KV (updated on rollback)
const currentCanaryVersion = await env.KV.get('canary-version');
const currentStableVersion = await env.KV.get('stable-version');
// Inject version check script
const versionCheckScript = `
<script>
(function() {
const serverVersion = '${currentCanaryVersion}';
const clientVersion = document.querySelector('meta[name="app-version"]')?.content;
if (serverVersion !== clientVersion) {
console.warn('Version drift detected, reloading...');
window.location.reload();
}
})();
</script>
`;
const modifiedHtml = html.replace('</head>', `${versionCheckScript}</head>`);
return new Response(modifiedHtml, {
status: response.status,
headers: response.headers
});
}
};
Verdict: Content-addressed assets + versioned paths + edge version checks = cache coherence.
Challenge 2: Canary Metrics Are Skewed
The Sampling Bias Problem:
You deploy canary to 5% of traffic. After 10 minutes:
- Canary error rate: 0.15%
- Stable error rate: 0.08%
Should you rollback? Not necessarily.
Why Metrics Are Biased:
-
Geographic Distribution: Canary users might be disproportionately in regions with slower networks (affects LCP/FID).
-
Device Distribution: If canary hash function correlates with user IDs, and newer users tend to have better devices, canary will show better performance even if code is identical.
-
Bot Traffic: Bots don't have cookies, so they get reassigned canary/stable on every request. This dilutes real user metrics.
-
Time-of-Day Effects: If you start canary at 2 PM EST, US traffic dominates. By 10 PM EST, APAC traffic dominates. Different usage patterns skew metrics.
-
Sample Size Disparity: 5% canary = 22.5K RPS. 95% stable = 427.5K RPS. Small sample sizes have higher variance.
Solution: Stratified Sampling + Statistical Tests
interface CanaryMetrics {
version: 'canary' | 'stable';
timestamp: number;
// Stratification dimensions
region: string; // "us-east", "eu-west", "ap-southeast"
deviceType: string; // "mobile", "desktop", "tablet"
connectionType: string; // "4g", "3g", "wifi", "unknown"
// Core metrics
errorRate: number; // errors / total requests
p50Latency: number; // milliseconds
p95Latency: number;
p99Latency: number;
// Core Web Vitals
lcp: number; // Largest Contentful Paint
fid: number; // First Input Delay (deprecated, use INP)
inp: number; // Interaction to Next Paint
cls: number; // Cumulative Layout Shift
// Hydration metrics
hydrationTime: number; // milliseconds
hydrationError: boolean;
// Sample size
sampleSize: number;
}
class CanaryAnalyzer {
async compareMetrics(
canaryMetrics: CanaryMetrics[],
stableMetrics: CanaryMetrics[]
): Promise<AnalysisResult> {
// Group metrics by stratification dimensions
const canaryByStrata = this.stratify(canaryMetrics);
const stableByStrata = this.stratify(stableMetrics);
const results: StratumResult[] = [];
// Compare each stratum independently
for (const stratum of Object.keys(canaryByStrata)) {
const canaryData = canaryByStrata[stratum];
const stableData = stableByStrata[stratum];
if (!stableData) {
console.warn(`No stable data for stratum ${stratum}`);
continue;
}
// Require minimum sample size for statistical significance
if (canaryData.sampleSize < 1000 || stableData.sampleSize < 1000) {
console.warn(`Insufficient sample size for stratum ${stratum}`);
continue;
}
// Perform two-sample t-test for each metric
const errorRateTest = this.twoSampleTTest(
canaryData.errorRates,
stableData.errorRates
);
const p95LatencyTest = this.twoSampleTTest(
canaryData.p95Latencies,
stableData.p95Latencies
);
const lcpTest = this.twoSampleTTest(
canaryData.lcps,
stableData.lcps
);
results.push({
stratum,
errorRateDelta: errorRateTest.delta,
errorRateSignificant: errorRateTest.pValue < 0.05,
p95LatencyDelta: p95LatencyTest.delta,
p95LatencySignificant: p95LatencyTest.pValue < 0.05,
lcpDelta: lcpTest.delta,
lcpSignificant: lcpTest.pValue < 0.05,
});
}
// Aggregate results across strata
return this.aggregateResults(results);
}
private stratify(metrics: CanaryMetrics[]): Record<string, AggregatedMetrics> {
const stratified: Record<string, CanaryMetrics[]> = {};
for (const metric of metrics) {
// Create stratum key
const key = `${metric.region}:${metric.deviceType}:${metric.connectionType}`;
if (!stratified[key]) {
stratified[key] = [];
}
stratified[key].push(metric);
}
// Aggregate within each stratum
const aggregated: Record<string, AggregatedMetrics> = {};
for (const [key, metrics] of Object.entries(stratified)) {
aggregated[key] = {
errorRates: metrics.map(m => m.errorRate),
p95Latencies: metrics.map(m => m.p95Latency),
lcps: metrics.map(m => m.lcp),
sampleSize: metrics.reduce((sum, m) => sum + m.sampleSize, 0),
};
}
return aggregated;
}
private twoSampleTTest(
sample1: number[],
sample2: number[]
): { delta: number; pValue: number } {
const mean1 = this.mean(sample1);
const mean2 = this.mean(sample2);
const delta = mean1 - mean2;
const variance1 = this.variance(sample1);
const variance2 = this.variance(sample2);
const n1 = sample1.length;
const n2 = sample2.length;
// Welch's t-test (unequal variances)
const tStatistic = delta / Math.sqrt(variance1 / n1 + variance2 / n2);
// Degrees of freedom (Welch-Satterthwaite equation)
const df = Math.pow(variance1 / n1 + variance2 / n2, 2) /
(Math.pow(variance1 / n1, 2) / (n1 - 1) + Math.pow(variance2 / n2, 2) / (n2 - 1));
// Calculate p-value (simplified - use stats library in production)
const pValue = this.tTestPValue(tStatistic, df);
return { delta, pValue };
}
private mean(values: number[]): number {
return values.reduce((sum, v) => sum + v, 0) / values.length;
}
private variance(values: number[]): number {
const mean = this.mean(values);
const squaredDiffs = values.map(v => Math.pow(v - mean, 2));
return this.mean(squaredDiffs);
}
private tTestPValue(tStatistic: number, df: number): number {
// Use jStat or math.js for actual t-distribution CDF
// Simplified approximation for example
return 2 * (1 - this.normalCDF(Math.abs(tStatistic)));
}
private normalCDF(x: number): number {
// Standard normal CDF approximation
return 0.5 * (1 + this.erf(x / Math.sqrt(2)));
}
private erf(x: number): number {
// Error function approximation
const sign = x >= 0 ? 1 : -1;
x = Math.abs(x);
const a1 = 0.254829592;
const a2 = -0.284496736;
const a3 = 1.421413741;
const a4 = -1.453152027;
const a5 = 1.061405429;
const p = 0.3275911;
const t = 1 / (1 + p * x);
const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);
return sign * y;
}
private aggregateResults(results: StratumResult[]): AnalysisResult {
// Weighted average by sample size
let totalSampleSize = 0;
let weightedErrorDelta = 0;
let weightedLatencyDelta = 0;
let weightedLcpDelta = 0;
for (const result of results) {
const weight = result.sampleSize || 1;
totalSampleSize += weight;
weightedErrorDelta += result.errorRateDelta * weight;
weightedLatencyDelta += result.p95LatencyDelta * weight;
weightedLcpDelta += result.lcpDelta * weight;
}
return {
overallErrorRateDelta: weightedErrorDelta / totalSampleSize,
overallP95LatencyDelta: weightedLatencyDelta / totalSampleSize,
overallLcpDelta: weightedLcpDelta / totalSampleSize,
significantRegressions: results.filter(r =>
(r.errorRateSignificant && r.errorRateDelta > 0) ||
(r.p95LatencySignificant && r.p95LatencyDelta > 0) ||
(r.lcpSignificant && r.lcpDelta > 0)
).length,
recommendation: this.makeRecommendation(results),
};
}
private makeRecommendation(results: StratumResult[]): 'proceed' | 'hold' | 'rollback' {
// Rollback if ANY stratum shows significant regression in critical metrics
const criticalRegressions = results.filter(r =>
(r.errorRateSignificant && r.errorRateDelta > 0.01) || // >1% error rate increase
(r.lcpSignificant && r.lcpDelta > 500) // >500ms LCP increase
);
if (criticalRegressions.length > 0) {
return 'rollback';
}
// Hold if minor regressions in non-critical strata
const minorRegressions = results.filter(r =>
(r.p95LatencySignificant && r.p95LatencyDelta > 100) // >100ms latency increase
);
if (minorRegressions.length > results.length * 0.2) { // >20% of strata
return 'hold';
}
return 'proceed';
}
}
Production Thresholds (Netflix-scale):
| Metric | Threshold | Action |
|---|---|---|
| Error Rate | >0.05% increase | Immediate rollback |
| LCP | >300ms increase | Rollback |
| INP | >100ms increase | Rollback |
| Hydration Errors | >0.01% occurrence | Rollback |
| P95 API Latency | >200ms increase | Hold, investigate |
| Memory Usage | >50MB increase | Hold, investigate |
| Bundle Size | >100KB increase | Review, no auto-rollback |
Verdict: Don't compare raw metrics. Use stratified sampling + statistical tests + domain-specific thresholds.
Challenge 3: Client-Side State Migrations
The Problem:
You deploy canary v48 which changes localStorage schema:
// v47 (stable)
localStorage.setItem('cart', JSON.stringify({
items: [{ id: 1, qty: 2 }]
}));
// v48 (canary)
localStorage.setItem('cart', JSON.stringify({
version: 2,
items: [{ productId: 1, quantity: 2, addedAt: Date.now() }]
}));
A user loads v47 (stable), adds items to cart, then navigates to a new page which gets routed to v48 (canary). The canary code reads localStorage, finds old schema, crashes.
Why This Happens:
Unlike backend deployments where you can run database migrations atomically, frontend state lives in user's browser across deployments. Canary and stable can read/write the same storage.
Solution 1: Defensive Reads with Schema Versioning
interface CartV1 {
items: Array<{ id: number; qty: number }>;
}
interface CartV2 {
version: 2;
items: Array<{
productId: number;
quantity: number;
addedAt: number;
}>;
}
type Cart = CartV1 | CartV2;
function readCart(): CartV2 {
const raw = localStorage.getItem('cart');
if (!raw) {
return { version: 2, items: [] };
}
try {
const data = JSON.parse(raw) as Cart;
// Check version
if ('version' in data && data.version === 2) {
return data; // Already v2
}
// Migrate v1 → v2
const v1 = data as CartV1;
const migrated: CartV2 = {
version: 2,
items: v1.items.map(item => ({
productId: item.id,
quantity: item.qty,
addedAt: Date.now(), // Best guess
})),
};
// Write migrated version
localStorage.setItem('cart', JSON.stringify(migrated));
return migrated;
} catch (error) {
console.error('Failed to read cart', error);
// Corrupted data - reset
const empty: CartV2 = { version: 2, items: [] };
localStorage.setItem('cart', JSON.stringify(empty));
return empty;
}
}
The Conflict Problem:
User opens two tabs:
- Tab 1: Stable (v47) - writes v1 schema
- Tab 2: Canary (v48) - reads, migrates to v2, writes v2 schema
- Tab 1: Writes again - overwrites with v1 schema
- Tab 2: Reads - sees v1 again, re-migrates
This causes data loss and migration loops.
Solution 2: Versioned Keys
function readCart(version: string): CartV2 {
const key = `cart:${version}`;
const raw = localStorage.getItem(key);
if (raw) {
return JSON.parse(raw);
}
// Try to migrate from previous version
const previousVersion = getPreviousVersion(version);
if (previousVersion) {
const previousKey = `cart:${previousVersion}`;
const previousRaw = localStorage.getItem(previousKey);
if (previousRaw) {
const migrated = migrateCart(JSON.parse(previousRaw), previousVersion, version);
localStorage.setItem(key, JSON.stringify(migrated));
return migrated;
}
}
return { version: 2, items: [] };
}
function writeCart(version: string, cart: CartV2): void {
const key = `cart:${version}`;
localStorage.setItem(key, JSON.stringify(cart));
}
The Storage Explosion Problem:
If you keep versioned keys indefinitely, localStorage fills up (5-10MB limit). You need garbage collection:
function cleanupOldVersions(currentVersion: string): void {
const allKeys = Object.keys(localStorage);
const cartKeys = allKeys.filter(k => k.startsWith('cart:'));
for (const key of cartKeys) {
const [, version] = key.split(':');
if (version !== currentVersion && isOlderThan(version, currentVersion, 7)) {
// Delete versions older than 7 days
localStorage.removeItem(key);
}
}
}
Solution 3: Backend-Synchronized State (Best)
Don't rely on client storage for critical state. Sync to backend:
class CartManager {
private version: string;
private userId: string;
async loadCart(): Promise<CartV2> {
// Try local cache first
const cached = this.readLocalCache();
if (cached && !this.isStale(cached)) {
return cached;
}
// Fetch from backend (source of truth)
const response = await fetch('/api/cart', {
headers: { 'X-App-Version': this.version }
});
const cart = await response.json();
// Update local cache
this.writeLocalCache(cart);
return cart;
}
async updateCart(updates: Partial<CartV2>): Promise<void> {
// Optimistic update
const current = await this.loadCart();
const updated = { ...current, ...updates };
this.writeLocalCache(updated);
// Sync to backend
try {
await fetch('/api/cart', {
method: 'PUT',
headers: {
'Content-Type': 'application/json',
'X-App-Version': this.version,
},
body: JSON.stringify(updated),
});
} catch (error) {
// Rollback optimistic update
this.writeLocalCache(current);
throw error;
}
}
private readLocalCache(): CartV2 | null {
const key = `cart:cache`;
const raw = localStorage.getItem(key);
return raw ? JSON.parse(raw) : null;
}
private writeLocalCache(cart: CartV2): void {
localStorage.setItem('cart:cache', JSON.stringify({
data: cart,
timestamp: Date.now(),
version: this.version,
}));
}
private isStale(cached: { timestamp: number; version: string }): boolean {
const age = Date.now() - cached.timestamp;
return age > 60000 || cached.version !== this.version;
}
}
Verdict: Use versioned schemas + backend sync for critical state. Accept client-only state will occasionally break during canary and require resets.
Canary Analysis System: Automated Decision Making
The entire canary system lives or dies on the analysis pipeline. Humans approve the deploy, but robots must decide progression, because:
- Canary windows are short (5-45 minutes)
- Decisions need to happen every 30-60 seconds
- Metrics are noisy (need statistical significance)
- False positives are expensive (block good deploys)
- False negatives are catastrophic (let bad code reach 100%)
Here's how production systems do it:
Architecture: Real-Time Metrics Pipeline
graph LR
Client[Browser/Device] -->|RUM Beacon| Ingestion[Kafka Ingestion]
CDNLogs[CDN Access Logs] -->|Stream| Ingestion
OriginLogs[Origin Server Logs] -->|Stream| Ingestion
Ingestion --> Flink[Flink Stream Processing]
Flink --> Aggregate[Time-Window Aggregation<br/>30s tumbling windows]
Aggregate --> Stratify[Stratification<br/>by region/device/connection]
Stratify --> TSDB[(InfluxDB/TimescaleDB<br/>Time-Series Storage)]
TSDB --> Analyzer[Canary Analyzer Service]
Analyzer --> StatTests[Statistical Tests<br/>t-test, Mann-Whitney]
StatTests --> Anomaly[Anomaly Detection<br/>IQR, Z-score]
Anomaly --> Threshold[Threshold Evaluation<br/>SLO-based]
Threshold --> Bayes[Bayesian Confidence<br/>Prior + Evidence]
Bayes --> Decision{Decision}
Decision -->|Proceed| Progression[Progressive Rollout<br/>5% → 25% → 50% → 100%]
Decision -->|Hold| Monitor[Continue Monitoring]
Decision -->|Rollback| Rollback[Automated Rollback<br/>+ Incident Creation]
Progression --> UpdateKV[Update KV Store<br/>canary-percentage]
Rollback --> UpdateKV
UpdateKV --> CDNEdge[CDN Edge Workers]
Implementation: Canary Analyzer Service
interface MetricSnapshot {
timestamp: number;
version: 'canary' | 'stable';
stratum: {
region: string;
deviceType: string;
connectionType: string;
};
metrics: {
requests: number;
errors: number;
errorRate: number;
p50Latency: number;
p95Latency: number;
p99Latency: number;
lcp: number;
inp: number;
cls: number;
hydrationErrors: number;
jsErrors: number;
};
}
interface CanaryDecision {
timestamp: number;
decision: 'proceed' | 'hold' | 'rollback';
confidence: number; // 0-1
reason: string;
metrics: {
errorRateDelta: number;
latencyDelta: number;
lcpDelta: number;
};
recommendation: {
nextPercentage?: number; // If proceeding
holdDuration?: number; // If holding
rollbackReason?: string; // If rolling back
};
}
class CanaryAnalyzerService {
private tsdb: TimeSeriesDB;
private config: CanaryConfig;
constructor(tsdb: TimeSeriesDB, config: CanaryConfig) {
this.tsdb = tsdb;
this.config = config;
}
async analyze(deploymentId: string): Promise<CanaryDecision> {
// Fetch metrics from last 5 minutes
const endTime = Date.now();
const startTime = endTime - (5 * 60 * 1000);
const canaryMetrics = await this.tsdb.query({
measurement: 'frontend_metrics',
tags: {
deployment_id: deploymentId,
version: 'canary',
},
timeRange: [startTime, endTime],
});
const stableMetrics = await this.tsdb.query({
measurement: 'frontend_metrics',
tags: {
version: 'stable',
},
timeRange: [startTime, endTime],
});
// Check minimum sample size
const canaryRequests = canaryMetrics.reduce((sum, m) => sum + m.metrics.requests, 0);
if (canaryRequests < this.config.minSampleSize) {
return {
timestamp: Date.now(),
decision: 'hold',
confidence: 0,
reason: `Insufficient sample size: ${canaryRequests} < ${this.config.minSampleSize}`,
metrics: { errorRateDelta: 0, latencyDelta: 0, lcpDelta: 0 },
recommendation: {
holdDuration: 60000, // Wait 1 more minute
},
};
}
// Stratify metrics
const canaryByStrata = this.stratify(canaryMetrics);
const stableByStrata = this.stratify(stableMetrics);
// Run analysis per stratum
const stratumResults: StratumAnalysis[] = [];
for (const stratum of Object.keys(canaryByStrata)) {
const canaryData = canaryByStrata[stratum];
const stableData = stableByStrata[stratum];
if (!stableData) continue;
const result = await this.analyzeStratum(canaryData, stableData, stratum);
stratumResults.push(result);
}
// Aggregate results
const aggregated = this.aggregateStratumResults(stratumResults);
// Run anomaly detection
const anomalies = await this.detectAnomalies(deploymentId, canaryMetrics);
// Bayesian decision making
const decision = this.makeBayesianDecision(aggregated, anomalies);
return decision;
}
private stratify(metrics: MetricSnapshot[]): Record<string, MetricSnapshot[]> {
const stratified: Record<string, MetricSnapshot[]> = {};
for (const metric of metrics) {
const key = `${metric.stratum.region}:${metric.stratum.deviceType}:${metric.stratum.connectionType}`;
if (!stratified[key]) {
stratified[key] = [];
}
stratified[key].push(metric);
}
return stratified;
}
private async analyzeStratum(
canaryMetrics: MetricSnapshot[],
stableMetrics: MetricSnapshot[],
stratum: string
): Promise<StratumAnalysis> {
// Extract metric arrays
const canaryErrors = canaryMetrics.map(m => m.metrics.errorRate);
const stableErrors = stableMetrics.map(m => m.metrics.errorRate);
const canaryP95 = canaryMetrics.map(m => m.metrics.p95Latency);
const stableP95 = stableMetrics.map(m => m.metrics.p95Latency);
const canaryLCP = canaryMetrics.map(m => m.metrics.lcp);
const stableLCP = stableMetrics.map(m => m.metrics.lcp);
// Statistical tests
const errorTest = this.welchTTest(canaryErrors, stableErrors);
const latencyTest = this.welchTTest(canaryP95, stableP95);
const lcpTest = this.welchTTest(canaryLCP, stableLCP);
// Calculate effect sizes (Cohen's d)
const errorEffectSize = this.cohensD(canaryErrors, stableErrors);
const latencyEffectSize = this.cohensD(canaryP95, stableP95);
const lcpEffectSize = this.cohensD(canaryLCP, stableLCP);
return {
stratum,
sampleSize: canaryMetrics.reduce((sum, m) => sum + m.metrics.requests, 0),
errorRate: {
canaryMean: this.mean(canaryErrors),
stableMean: this.mean(stableErrors),
delta: errorTest.delta,
pValue: errorTest.pValue,
significant: errorTest.pValue < 0.05,
effectSize: errorEffectSize,
},
p95Latency: {
canaryMean: this.mean(canaryP95),
stableMean: this.mean(stableP95),
delta: latencyTest.delta,
pValue: latencyTest.pValue,
significant: latencyTest.pValue < 0.05,
effectSize: latencyEffectSize,
},
lcp: {
canaryMean: this.mean(canaryLCP),
stableMean: this.mean(stableLCP),
delta: lcpTest.delta,
pValue: lcpTest.pValue,
significant: lcpTest.pValue < 0.05,
effectSize: lcpEffectSize,
},
};
}
private welchTTest(sample1: number[], sample2: number[]): TTestResult {
const mean1 = this.mean(sample1);
const mean2 = this.mean(sample2);
const delta = mean1 - mean2;
const variance1 = this.variance(sample1);
const variance2 = this.variance(sample2);
const n1 = sample1.length;
const n2 = sample2.length;
const standardError = Math.sqrt(variance1 / n1 + variance2 / n2);
const tStatistic = delta / standardError;
// Degrees of freedom
const df = Math.pow(variance1 / n1 + variance2 / n2, 2) /
(Math.pow(variance1 / n1, 2) / (n1 - 1) + Math.pow(variance2 / n2, 2) / (n2 - 1));
// p-value (use stats library in production)
const pValue = this.tDistributionCDF(Math.abs(tStatistic), df);
return { delta, pValue, tStatistic };
}
private cohensD(sample1: number[], sample2: number[]): number {
const mean1 = this.mean(sample1);
const mean2 = this.mean(sample2);
const variance1 = this.variance(sample1);
const variance2 = this.variance(sample2);
const n1 = sample1.length;
const n2 = sample2.length;
// Pooled standard deviation
const pooledSD = Math.sqrt(
((n1 - 1) * variance1 + (n2 - 1) * variance2) / (n1 + n2 - 2)
);
return (mean1 - mean2) / pooledSD;
}
private async detectAnomalies(
deploymentId: string,
canaryMetrics: MetricSnapshot[]
): Promise<Anomaly[]> {
const anomalies: Anomaly[] = [];
// Get historical baseline (last 7 days)
const baseline = await this.tsdb.query({
measurement: 'frontend_metrics',
tags: { version: 'stable' },
timeRange: [Date.now() - (7 * 24 * 60 * 60 * 1000), Date.now()],
aggregation: 'mean',
});
// Calculate IQR for each metric
const errorRates = baseline.map(m => m.metrics.errorRate).sort((a, b) => a - b);
const errorIQR = this.calculateIQR(errorRates);
const latencies = baseline.map(m => m.metrics.p95Latency).sort((a, b) => a - b);
const latencyIQR = this.calculateIQR(latencies);
// Check canary metrics against baseline
for (const metric of canaryMetrics) {
// Error rate anomaly
if (metric.metrics.errorRate > errorIQR.q3 + 1.5 * errorIQR.iqr) {
anomalies.push({
type: 'error_rate',
severity: 'critical',
value: metric.metrics.errorRate,
threshold: errorIQR.q3 + 1.5 * errorIQR.iqr,
message: `Error rate ${metric.metrics.errorRate.toFixed(4)} exceeds threshold`,
});
}
// Latency anomaly
if (metric.metrics.p95Latency > latencyIQR.q3 + 1.5 * latencyIQR.iqr) {
anomalies.push({
type: 'latency',
severity: 'warning',
value: metric.metrics.p95Latency,
threshold: latencyIQR.q3 + 1.5 * latencyIQR.iqr,
message: `P95 latency ${metric.metrics.p95Latency}ms exceeds threshold`,
});
}
// Hydration error spike
if (metric.metrics.hydrationErrors > 0) {
anomalies.push({
type: 'hydration_error',
severity: 'critical',
value: metric.metrics.hydrationErrors,
threshold: 0,
message: `Hydration errors detected: ${metric.metrics.hydrationErrors}`,
});
}
}
return anomalies;
}
private calculateIQR(sortedValues: number[]): { q1: number; q3: number; iqr: number } {
const n = sortedValues.length;
const q1Index = Math.floor(n * 0.25);
const q3Index = Math.floor(n * 0.75);
const q1 = sortedValues[q1Index];
const q3 = sortedValues[q3Index];
return { q1, q3, iqr: q3 - q1 };
}
private makeBayesianDecision(
stratumResults: StratumAnalysis[],
anomalies: Anomaly[]
): CanaryDecision {
// Prior probability (based on historical rollback rate)
const priorRollbackRate = 0.08; // 8% of canaries get rolled back
let posteriorRollbackProb = priorRollbackRate;
// Update based on statistical tests
const criticalRegressions = stratumResults.filter(r =>
(r.errorRate.significant && r.errorRate.delta > this.config.thresholds.errorRate) ||
(r.lcp.significant && r.lcp.delta > this.config.thresholds.lcp)
);
if (criticalRegressions.length > 0) {
// Strong evidence of regression
posteriorRollbackProb = 0.95;
}
// Update based on anomalies
const criticalAnomalies = anomalies.filter(a => a.severity === 'critical');
if (criticalAnomalies.length > 0) {
posteriorRollbackProb = Math.max(posteriorRollbackProb, 0.9);
}
// Effect size consideration
const largeEffectSizes = stratumResults.filter(r =>
Math.abs(r.errorRate.effectSize) > 0.8 || // Large effect
Math.abs(r.lcp.effectSize) > 0.8
);
if (largeEffectSizes.length > 0 && posteriorRollbackProb < 0.5) {
posteriorRollbackProb = 0.5; // Moderate confidence
}
// Make decision
let decision: 'proceed' | 'hold' | 'rollback';
let reason: string;
let recommendation: any = {};
if (posteriorRollbackProb > 0.7) {
decision = 'rollback';
reason = `High rollback probability (${posteriorRollbackProb.toFixed(2)}). `;
if (criticalAnomalies.length > 0) {
reason += `Critical anomalies: ${criticalAnomalies.map(a => a.message).join(', ')}`;
} else {
reason += `Significant regressions in ${criticalRegressions.length} strata`;
}
recommendation.rollbackReason = reason;
} else if (posteriorRollbackProb > 0.3 || largeEffectSizes.length > 0) {
decision = 'hold';
reason = `Moderate rollback probability (${posteriorRollbackProb.toFixed(2)}). Collecting more data.`;
recommendation.holdDuration = 120000; // Hold for 2 minutes
} else {
decision = 'proceed';
reason = `Low rollback probability (${posteriorRollbackProb.toFixed(2)}). Metrics within acceptable range.`;
recommendation.nextPercentage = this.calculateNextPercentage(stratumResults);
}
// Calculate aggregate deltas
const totalSampleSize = stratumResults.reduce((sum, r) => sum + r.sampleSize, 0);
const weightedErrorDelta = stratumResults.reduce(
(sum, r) => sum + r.errorRate.delta * r.sampleSize,
0
) / totalSampleSize;
const weightedLatencyDelta = stratumResults.reduce(
(sum, r) => sum + r.p95Latency.delta * r.sampleSize,
0
) / totalSampleSize;
const weightedLcpDelta = stratumResults.reduce(
(sum, r) => sum + r.lcp.delta * r.sampleSize,
0
) / totalSampleSize;
return {
timestamp: Date.now(),
decision,
confidence: 1 - posteriorRollbackProb,
reason,
metrics: {
errorRateDelta: weightedErrorDelta,
latencyDelta: weightedLatencyDelta,
lcpDelta: weightedLcpDelta,
},
recommendation,
};
}
private calculateNextPercentage(stratumResults: StratumAnalysis[]): number {
// Conservative progression if any warnings
const warnings = stratumResults.filter(r =>
(r.errorRate.significant && r.errorRate.delta > 0) ||
(r.p95Latency.significant && r.p95Latency.delta > 100) ||
(r.lcp.significant && r.lcp.delta > 200)
);
if (warnings.length > 0) {
return 25; // Go to 25% cautiously
}
// Aggressive progression if clear improvements
const improvements = stratumResults.filter(r =>
(r.errorRate.significant && r.errorRate.delta < 0) ||
(r.p95Latency.significant && r.p95Latency.delta < -50) ||
(r.lcp.significant && r.lcp.delta < -100)
);
if (improvements.length > stratumResults.length * 0.5) {
return 100; // Go to 100% quickly
}
// Default: gradual progression
return 50;
}
private mean(values: number[]): number {
return values.reduce((sum, v) => sum + v, 0) / values.length;
}
private variance(values: number[]): number {
const mean = this.mean(values);
return values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / (values.length - 1);
}
private tDistributionCDF(t: number, df: number): number {
// Simplified - use jStat or math.js in production
return 2 * (1 - this.normalCDF(t));
}
private normalCDF(z: number): number {
return 0.5 * (1 + this.erf(z / Math.sqrt(2)));
}
private erf(x: number): number {
// Abramowitz and Stegun approximation
const sign = x >= 0 ? 1 : -1;
x = Math.abs(x);
const a1 = 0.254829592;
const a2 = -0.284496736;
const a3 = 1.421413741;
const a4 = -1.453152027;
const a5 = 1.061405429;
const p = 0.3275911;
const t = 1 / (1 + p * x);
const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);
return sign * y;
}
}
interface CanaryConfig {
minSampleSize: number;
thresholds: {
errorRate: number;
latency: number;
lcp: number;
inp: number;
};
}
interface StratumAnalysis {
stratum: string;
sampleSize: number;
errorRate: MetricComparison;
p95Latency: MetricComparison;
lcp: MetricComparison;
}
interface MetricComparison {
canaryMean: number;
stableMean: number;
delta: number;
pValue: number;
significant: boolean;
effectSize: number;
}
interface TTestResult {
delta: number;
pValue: number;
tStatistic: number;
}
interface Anomaly {
type: string;
severity: 'critical' | 'warning';
value: number;
threshold: number;
message: string;
}
Decision Loop Cadence:
┌─────────────────────────────────────────────────────────────┐
│ Canary Timeline (Progressive Rollout) │
└─────────────────────────────────────────────────────────────┘
T+0m Deploy canary (5% traffic)
├─ Edge workers updated
├─ Metrics start flowing
└─ Analysis: HOLD (insufficient data)
T+2m First decision point
├─ Sample size: 15K requests
├─ Analysis: PROCEED (no regressions)
└─ Action: Increase to 25%
T+5m Second decision point
├─ Sample size: 75K requests
├─ Analysis: HOLD (latency spike in EU)
└─ Action: Hold at 25%, investigate
T+8m Third decision point
├─ Sample size: 180K requests
├─ Analysis: PROCEED (spike was CDN issue, resolved)
└─ Action: Increase to 50%
T+12m Fourth decision point
├─ Sample size: 450K requests
├─ Analysis: PROCEED (metrics within bounds)
└─ Action: Increase to 100%
T+15m Canary complete
├─ Total requests: 1.2M
├─ Final error rate delta: +0.003% (acceptable)
└─ Mark as stable, promote to production
Rollback Scenario:
T+0m Deploy canary (5% traffic)
T+2m First decision point
├─ Sample size: 15K requests
├─ Analysis: ROLLBACK
│ ├─ Error rate: 0.24% (stable: 0.08%)
│ ├─ Delta: +0.16% (threshold: +0.05%)
│ └─ Confidence: 0.95
└─ Action: IMMEDIATE ROLLBACK
T+2m:30s Rollback initiated
├─ Update KV: canary-percentage = 0
├─ Edge workers stop routing to canary
├─ CDN cache purge: /dist-v48/*
└─ Incident created in PagerDuty
T+3m Rollback complete
├─ 100% traffic on stable (v47)
├─ Canary origin scaled down
└─ Engineering team notified
Verdict: Automated analysis must run every 30-60 seconds with Bayesian decision-making and anomaly detection. False negative (missed regression) is worse than false positive (blocked good deploy).
Progressive Rollout Strategies
Once the analyzer says "proceed," you need to gradually increase canary traffic. The progression strategy depends on risk tolerance and confidence.
Strategy 1: Percentage-Based (Standard)
5% → 25% → 50% → 100%
Timing:
- Hold at 5% for 5-10 minutes (minimum viable sample)
- Hold at 25% for 5-10 minutes (catch medium-impact bugs)
- Hold at 50% for 5-10 minutes (validate at scale)
- Jump to 100% if all clear
Implementation:
class ProgressiveRolloutController {
private currentPercentage: number = 5;
private deploymentStartTime: number = Date.now();
async progressToNextStage(analysis: CanaryDecision): Promise<void> {
if (analysis.decision === 'rollback') {
await this.rollback(analysis.reason);
return;
}
if (analysis.decision === 'hold') {
console.log(`Holding at ${this.currentPercentage}%: ${analysis.reason}`);
return;
}
// Proceed to next stage
const nextPercentage = this.calculateNextPercentage();
if (nextPercentage === this.currentPercentage) {
console.log(`Already at ${this.currentPercentage}%, no progression`);
return;
}
console.log(`Progressing from ${this.currentPercentage}% to ${nextPercentage}%`);
await this.updateCanaryPercentage(nextPercentage);
this.currentPercentage = nextPercentage;
if (nextPercentage === 100) {
await this.finalizeDeployment();
}
}
private calculateNextPercentage(): number {
const stages = [5, 25, 50, 100];
const currentIndex = stages.indexOf(this.currentPercentage);
if (currentIndex === -1 || currentIndex === stages.length - 1) {
return this.currentPercentage;
}
const timeAtCurrentStage = Date.now() - this.deploymentStartTime;
const minDuration = 5 * 60 * 1000; // 5 minutes
if (timeAtCurrentStage < minDuration) {
return this.currentPercentage; // Not ready to progress
}
return stages[currentIndex + 1];
}
private async updateCanaryPercentage(percentage: number): Promise<void> {
// Update KV store (propagates to all edge workers)
await this.kv.put('canary-percentage', percentage.toString());
// Log event
await this.logger.info('canary_progression', {
from: this.currentPercentage,
to: percentage,
timestamp: Date.now(),
});
// Send metrics
await this.metrics.gauge('canary.percentage', percentage);
}
private async rollback(reason: string): Promise<void> {
console.error(`Rolling back canary: ${reason}`);
// Set percentage to 0 (disables canary routing)
await this.updateCanaryPercentage(0);
// Purge canary assets from CDN
await this.purgeCDNCache('/dist-v48/*');
// Create incident
await this.createIncident({
title: 'Canary Rollback: v48',
severity: 'high',
description: reason,
});
// Notify team
await this.notifyTeam('Canary rolled back', reason);
}
private async finalizeDeployment(): Promise<void> {
console.log('Canary successful, finalizing deployment');
// Update stable version pointer
await this.kv.put('stable-version', 'v48');
// Purge old stable assets
await this.purgeCDNCache('/dist-v47/*');
// Update deployment status
await this.db.updateDeployment({
id: this.deploymentId,
status: 'completed',
completedAt: Date.now(),
});
// Notify team
await this.notifyTeam('Deployment complete', 'Canary reached 100% successfully');
}
}
Strategy 2: Region-Based (Geographic Rollout)
US-West → US-East → EU → APAC → Global
Why This Works:
- Different regions have different peak times (natural load distribution)
- Can catch region-specific bugs (locale, timezone, network)
- Limits blast radius (if US-West fails, APAC is unaffected)
Implementation:
class RegionBasedRollout {
private regions = ['us-west', 'us-east', 'eu-west', 'eu-central', 'ap-southeast', 'ap-northeast'];
private currentRegionIndex = 0;
async progressToNextRegion(analysis: CanaryDecision): Promise<void> {
if (analysis.decision === 'rollback') {
await this.rollback(analysis.reason);
return;
}
if (analysis.decision === 'hold') {
console.log(`Holding in region ${this.getCurrentRegion()}: ${analysis.reason}`);
return;
}
// Mark current region as complete
await this.markRegionComplete(this.getCurrentRegion());
// Move to next region
this.currentRegionIndex++;
if (this.currentRegionIndex >= this.regions.length) {
await this.finalizeGlobalRollout();
return;
}
const nextRegion = this.getCurrentRegion();
console.log(`Starting canary in region: ${nextRegion}`);
await this.enableRegion(nextRegion);
}
private async enableRegion(region: string): Promise<void> {
// Update KV with region-specific routing
await this.kv.put(`canary-enabled:${region}`, 'true');
// Edge workers check this key
// If enabled for region, route to canary
}
private getCurrentRegion(): string {
return this.regions[this.currentRegionIndex];
}
}
Edge Worker Implementation:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const clientRegion = request.cf?.region || 'unknown';
// Check if canary is enabled for this region
const canaryEnabled = await env.KV.get(`canary-enabled:${clientRegion}`);
if (canaryEnabled !== 'true') {
// Route to stable
return fetchStableOrigin(request);
}
// Canary enabled for region - do percentage-based split
const percentage = parseInt(await env.KV.get('canary-percentage') || '50');
const userId = getUserId(request);
const bucket = hash(userId) % 100;
if (bucket < percentage) {
return fetchCanaryOrigin(request);
}
return fetchStableOrigin(request);
}
};
Strategy 3: User-Segment-Based (Cohort Rollout)
Internal → Beta Users → Premium → Free
Why This Works:
- Internal users (employees) catch bugs first
- Beta users opt-in to instability
- Premium users get stable experience
- Free users get latest features (lower risk tolerance)
Implementation:
class CohortBasedRollout {
async determineCanaryEligibility(user: User): Promise<boolean> {
// Internal users always get canary
if (user.email.endsWith('@company.com')) {
return true;
}
// Check beta program enrollment
if (user.betaProgram === true) {
return true;
}
// Check rollout stage
const currentStage = await this.kv.get('canary-stage');
switch (currentStage) {
case 'internal':
return user.email.endsWith('@company.com');
case 'beta':
return user.betaProgram === true;
case 'premium':
return user.tier === 'premium' || user.tier === 'enterprise';
case 'free':
return true; // All users
default:
return false;
}
}
}
Edge Worker Integration:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const userId = getUserIdFromCookie(request);
if (!userId) {
// Anonymous user - use percentage-based routing
return percentageBasedRouting(request, env);
}
// Fetch user data (cached at edge)
const user = await fetchUserData(userId, env);
if (!user) {
return percentageBasedRouting(request, env);
}
// Check canary eligibility
const eligible = await determineCanaryEligibility(user, env);
if (eligible) {
return fetchCanaryOrigin(request, env);
}
return fetchStableOrigin(request, env);
}
};
async function determineCanaryEligibility(user: any, env: Env): Promise<boolean> {
const stage = await env.KV.get('canary-stage');
if (stage === 'internal') {
return user.email?.endsWith('@company.com') || false;
}
if (stage === 'beta') {
return user.betaProgram === true;
}
if (stage === 'premium') {
return ['premium', 'enterprise'].includes(user.tier);
}
if (stage === 'free') {
return true;
}
return false;
}
Comparison Table
| Strategy | Blast Radius | Detection Speed | Complexity | Best For |
|---|---|---|---|---|
| Percentage-Based | 5-50% of users | Fast (5-10 min) | Low | Standard deploys, high traffic apps |
| Region-Based | One region at a time | Medium (15-30 min per region) | Medium | Global apps with regional isolation |
| Cohort-Based | Specific user segments | Slow (hours to days) | High | B2B SaaS, tiered products |
| Hybrid (Cohort + Percentage) | Controlled subsets | Fast within cohort | High | Enterprise apps with beta programs |
Verdict: Use percentage-based for most deploys, region-based for global apps, cohort-based for high-stakes enterprise products.
Rollback Mechanisms and Timing
When canary analysis detects a regression, the rollback must happen in <60 seconds. Here's how production systems do it.
Atomic Rollback: KV Store Pointer Swap
class AtomicRollbackController {
async rollback(reason: string): Promise<void> {
const startTime = Date.now();
console.error(`[ROLLBACK INITIATED] ${reason}`);
// Step 1: Disable canary routing (atomic operation)
await this.kv.put('canary-percentage', '0');
// Step 2: Purge canary assets from CDN (parallel)
const purgePromises = [
this.purgeCDN('/dist-v48/*'),
this.purgeCDN('/api/v48/*'),
];
await Promise.all(purgePromises);
// Step 3: Scale down canary origin (don't wait)
this.scaleDownCanary().catch(err => {
console.error('Failed to scale down canary:', err);
});
// Step 4: Create incident
await this.createIncident({
title: `Canary Rollback: ${this.deploymentId}`,
severity: 'high',
description: reason,
tags: ['canary', 'rollback', 'automated'],
});
// Step 5: Notify team (Slack + PagerDuty)
await this.notifyTeam(reason);
const duration = Date.now() - startTime;
console.log(`[ROLLBACK COMPLETE] Duration: ${duration}ms`);
// Metrics
await this.metrics.increment('canary.rollback', 1, {
deployment: this.deploymentId,
reason: this.categorizeReason(reason),
});
await this.metrics.histogram('canary.rollback_duration', duration);
}
private async purgeCDN(pattern: string): Promise<void> {
// Cloudflare example
const response = await fetch(
`https://api.cloudflare.com/client/v4/zones/${this.zoneId}/purge_cache`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${this.cfApiToken}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
files: [pattern],
}),
}
);
if (!response.ok) {
throw new Error(`CDN purge failed: ${response.statusText}`);
}
}
private categorizeReason(reason: string): string {
if (reason.includes('error rate')) return 'error_rate';
if (reason.includes('latency')) return 'latency';
if (reason.includes('hydration')) return 'hydration';
if (reason.includes('anomaly')) return 'anomaly';
return 'unknown';
}
}
The CDN Propagation Problem
Challenge: Even after you set canary-percentage = 0, edge workers at 300+ PoPs need to read the updated value. KV stores have eventual consistency (30-60 seconds).
Solution 1: Pessimistic Rollback (Purge Everything)
async function emergencyRollback(): Promise<void> {
// Nuclear option: purge ALL HTML from CDN
await purgeCDN('*.html');
// This forces all edge workers to fetch fresh HTML from origin
// Origin will render with stable version
// Downside: Cache hit rate drops to 0%, origin load spikes
}
Solution 2: Edge Worker Cache Bypass
// Cloudflare Worker
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Check KV with low TTL
const canaryPercentage = await env.KV.get('canary-percentage', {
cacheTtl: 5, // Only cache for 5 seconds
});
// If canary disabled, bypass all canary logic
if (canaryPercentage === '0') {
return fetchStableOrigin(request);
}
// Normal routing logic
// ...
}
};
Solution 3: Push-Based Invalidation
class EdgeInvalidationService {
async broadcastRollback(deploymentId: string): Promise<void> {
// Use pub/sub to notify all edge workers
await this.pubsub.publish('rollback', {
deployment: deploymentId,
timestamp: Date.now(),
});
// Edge workers subscribe to this channel
// They immediately disable canary routing
}
}
// In edge worker
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Check local in-memory flag (updated via pub/sub)
if (env.ROLLBACK_ACTIVE) {
return fetchStableOrigin(request);
}
// Normal routing
// ...
}
};
// Pub/sub listener (runs in edge worker)
env.PUBSUB.subscribe('rollback', (message) => {
console.log(`Rollback received for deployment ${message.deployment}`);
env.ROLLBACK_ACTIVE = true;
});
Rollback Timing Breakdown:
T+0s Analyzer detects regression
└─ Decision: ROLLBACK
T+0.5s Update KV store (canary-percentage = 0)
└─ Atomic write completes
T+1s Edge workers start seeing updated value
└─ New requests route to stable
T+2s CDN cache purge initiated (parallel)
├─ HTML: /dist-v48/*.html
├─ JS: /dist-v48/*.js
└─ CSS: /dist-v48/*.css
T+30s CDN purge propagated globally
└─ All edges serve stable assets
T+60s Rollback complete
├─ 100% traffic on stable
├─ Incident created
└─ Team notified
The Hidden Cost:
During rollback, some users will still hit canary for 30-60 seconds. This is unavoidable due to CDN propagation delays.
Mitigation:
Use client-side error recovery:
// Injected in all HTML (stable + canary)
window.addEventListener('error', async (event) => {
const errorCount = parseInt(sessionStorage.getItem('error-count') || '0');
if (errorCount > 3) {
// Too many errors - possible bad deployment
console.warn('Excessive errors detected, forcing reload with cache bypass');
// Reload with cache bypass
window.location.reload(true);
// Or redirect to stable explicitly
window.location.href = `${window.location.href}?force-stable=1`;
}
sessionStorage.setItem('error-count', (errorCount + 1).toString());
});
Verdict: Atomic KV update + CDN purge + client-side recovery = <60s rollback window.
Feature Flags vs Canary Deployments
A common question: "Can't we just use feature flags instead of canary deployments?"
Short Answer: No, but they complement each other.
Comparison
| Aspect | Feature Flags | Canary Deployment |
|---|---|---|
| What Changes | Application behavior | Entire codebase |
| Scope | Single feature/code path | All code, including infrastructure |
| Granularity | Per-user, per-feature | Per-request, per-version |
| Rollback Speed | Instant (toggle off) | 30-60s (CDN purge) |
| Bundle Impact | Increases bundle (both paths shipped) | No impact (only one version loaded) |
| Testing Coverage | Only new code path | Entire app (including dependencies) |
| Use Case | A/B testing, gradual feature release | Deployment risk mitigation |
When to Use Feature Flags
- A/B Testing - "Should button be blue or green?"
- Gradual Feature Rollout - "Enable new checkout flow for 10% of users"
- Emergency Kill Switch - "Disable payment processor if it's down"
- User Segmentation - "Show premium features only to paid users"
When to Use Canary Deployments
- Dependency Updates - "Upgraded React 17 → 18, will it break?"
- Build Tool Changes - "Switched Webpack → Vite, are bundles correct?"
- Infrastructure Changes - "Migrated CDN provider, is routing correct?"
- Large Refactors - "Rewrote state management, does it work?"
The Hybrid Approach (Best)
Use both in coordination:
class HybridDeploymentStrategy {
async deploy(version: string, featureFlags: string[]): Promise<void> {
// Step 1: Deploy canary with new features DISABLED
console.log('Deploying canary v48 with features disabled');
await this.deployCanary(version, {
featureFlags: featureFlags.map(f => ({ name: f, enabled: false })),
});
// Step 2: Wait for canary to stabilize (5-10 minutes)
await this.waitForStability();
// Step 3: Enable features gradually via flags
for (const flag of featureFlags) {
console.log(`Enabling feature: ${flag}`);
await this.featureFlagService.enable(flag, {
percentage: 5, // Start at 5%
canaryOnly: true, // Only in canary traffic
});
// Step 4: Monitor feature-specific metrics
await this.monitorFeature(flag, 5 * 60 * 1000); // 5 minutes
// Step 5: If stable, increase to 100%
await this.featureFlagService.enable(flag, {
percentage: 100,
canaryOnly: true,
});
}
// Step 6: If all features stable, progress canary
await this.progressCanary(25); // 5% → 25%
// Step 7: Eventually enable features in stable too
await this.enableFeaturesInStable(featureFlags);
}
}
Real-World Example: Meta's Gatekeeper + Canary System
Meta uses "Gatekeeper" (feature flags) + canary deployments in tandem:
- Deploy new React Native version as canary (code change)
- Keep new features behind Gatekeeper flags (behavior change)
- Stabilize canary (no flags enabled)
- Enable flags at 1% in canary only
- Monitor metrics (crashes, performance)
- Enable flags at 10% → 50% → 100% in canary
- Progress canary to 100% of traffic
- Enable flags in stable version
- Clean up flag code in next release
Verdict: Use canaries for deployment safety, use feature flags for feature safety. Combine them for maximum control.
CDN Cache Invalidation During Canary
CDN caching and canary deployments have a fundamental tension: you want aggressive caching (performance) but you also need instant rollback (safety).
The Caching Problem
Naive Approach:
Cache-Control: public, max-age=3600, s-maxage=86400
Why This Breaks Canaries:
- User requests
/index.htmlat T+0 (canary) - CDN caches it for 24 hours (s-maxage=86400)
- At T+10m, canary is rolled back
- User requests
/index.htmlat T+15m - CDN serves cached canary version (oops)
User sees broken canary for up to 24 hours.
Solution 1: Short TTLs for HTML, Long TTLs for Assets
HTML: Cache-Control: public, max-age=0, s-maxage=60
JS/CSS: Cache-Control: public, max-age=31536000, immutable
Tradeoffs:
- HTML is re-fetched frequently (acceptable - small payload)
- JS/CSS cached forever (good - large payloads, content-addressed)
- Rollback takes 60 seconds max (time for CDN TTL to expire)
Implementation:
// Origin server (Next.js)
export default function handler(req: Request, res: Response) {
if (req.url.endsWith('.html')) {
res.setHeader('Cache-Control', 'public, max-age=0, s-maxage=60, must-revalidate');
} else if (req.url.match(/\.(js|css|woff2)$/)) {
res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
}
// ...
}
Solution 2: Versioned Cache Keys
Edge worker creates different cache keys for canary and stable:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const version = await determineVersion(request, env);
// Create version-specific cache key
const cacheKey = new Request(request.url, {
headers: request.headers,
cf: {
cacheKey: `${version}:${url.pathname}`,
},
});
// Check cache
let response = await caches.default.match(cacheKey);
if (!response) {
// Fetch from origin
response = await fetch(getOriginUrl(url, version));
// Cache with version-specific key
ctx.waitUntil(caches.default.put(cacheKey, response.clone()));
}
return response;
}
};
Why This Works:
- Canary assets:
canary-v48:/index.html - Stable assets:
stable-v47:/index.html - No key collision, both can coexist in cache
- Rollback = change routing, don't purge cache
Solution 3: Selective Cache Purge
Instead of purging all canary assets, purge only critical paths:
async function rollbackCanary(): Promise<void> {
// Only purge HTML (entry points)
await purgeCDN('/index.html');
await purgeCDN('/_next/data/**/*.json'); // Next.js data files
// Don't purge JS/CSS - content-addressed, no collision
// Update routing
await kv.put('canary-percentage', '0');
}
Why This Works:
- HTML is purged (users get stable entry point)
- JS/CSS are content-addressed (v47 vs v48 filenames differ)
- Smaller purge = faster propagation
- Less CDN load
Solution 4: Stale-While-Revalidate for Resilience
Cache-Control: public, max-age=60, stale-while-revalidate=600
Behavior:
- CDN serves cached copy for 60 seconds
- After 60s, CDN serves stale copy while fetching fresh in background
- If origin is down, CDN serves stale up to 600 seconds
Benefits During Rollback:
- Origin load is smoothed (no thundering herd)
- Users get content even during rollback
- Stale canary is better than error page
Downside:
- Users might see canary for up to 60s after rollback
Acceptable tradeoff for most apps.
Comparison
| Strategy | Rollback Speed | Cache Hit Rate | Complexity | Best For |
|---|---|---|---|---|
| Short TTLs | ~60s | Low for HTML, high for assets | Low | Standard setups |
| Versioned Keys | Instant | High for everything | Medium | CDN with edge workers |
| Selective Purge | ~30s | High | Low | Simple CDN setups |
| Stale-While-Revalidate | ~60s (graceful) | High | Low | Resilience-focused |
Verdict: Use versioned cache keys if you have edge workers, otherwise use short TTLs + selective purge.
Client-Side Version Detection
The client needs to know which version it's running for telemetry and error attribution. Here's how to implement it.
Version Injection in HTML
// SSR render (Next.js)
export async function getServerSideProps(context) {
const version = process.env.APP_VERSION || 'unknown';
const buildId = process.env.BUILD_ID || 'unknown';
return {
props: {
version,
buildId,
},
};
}
export default function App({ version, buildId, Component, pageProps }) {
return (
<>
<Head>
<meta name="app-version" content={version} />
<meta name="build-id" content={buildId} />
</Head>
<Script
id="version-init"
strategy="beforeInteractive"
dangerouslySetInnerHTML={{
__html: `
window.__APP_VERSION__ = '${version}';
window.__BUILD_ID__ = '${buildId}';
window.__CANARY_ASSIGNMENT__ = document.cookie.match(/x-canary-version=([^;]+)/)?.[1] || 'unknown';
`,
}}
/>
<Component {...pageProps} />
</>
);
}
Client-Side Telemetry Tagging
class TelemetryClient {
private version: string;
private buildId: string;
private canaryAssignment: string;
constructor() {
this.version = window.__APP_VERSION__ || 'unknown';
this.buildId = window.__BUILD_ID__ || 'unknown';
this.canaryAssignment = window.__CANARY_ASSIGNMENT__ || 'unknown';
}
trackEvent(name: string, properties: Record<string, any> = {}): void {
const enriched = {
...properties,
version: this.version,
buildId: this.buildId,
canaryAssignment: this.canaryAssignment,
timestamp: Date.now(),
sessionId: this.getSessionId(),
};
// Send to analytics
fetch('/api/telemetry', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
event: name,
properties: enriched,
}),
});
}
trackError(error: Error, context: Record<string, any> = {}): void {
this.trackEvent('error', {
...context,
errorMessage: error.message,
errorStack: error.stack,
errorName: error.name,
});
}
trackPerformance(metric: string, value: number, context: Record<string, any> = {}): void {
this.trackEvent('performance', {
...context,
metric,
value,
});
}
private getSessionId(): string {
let sessionId = sessionStorage.getItem('session-id');
if (!sessionId) {
sessionId = crypto.randomUUID();
sessionStorage.setItem('session-id', sessionId);
}
return sessionId;
}
}
// Global instance
export const telemetry = new TelemetryClient();
// React Error Boundary
export class ErrorBoundary extends React.Component {
componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
telemetry.trackError(error, {
componentStack: errorInfo.componentStack,
boundary: 'app',
});
}
render() {
return this.props.children;
}
}
// Performance monitoring
if (typeof window !== 'undefined') {
// Core Web Vitals
import('web-vitals').then(({ onCLS, onFID, onLCP, onFCP, onTTFB, onINP }) => {
onCLS((metric) => telemetry.trackPerformance('cls', metric.value));
onINP((metric) => telemetry.trackPerformance('inp', metric.value));
onLCP((metric) => telemetry.trackPerformance('lcp', metric.value));
onFCP((metric) => telemetry.trackPerformance('fcp', metric.value));
onTTFB((metric) => telemetry.trackPerformance('ttfb', metric.value));
});
// React hydration timing
const hydrationStart = performance.now();
window.addEventListener('load', () => {
const hydrationEnd = performance.now();
const hydrationTime = hydrationEnd - hydrationStart;
telemetry.trackPerformance('hydration_time', hydrationTime);
});
}
Version Mismatch Detection
class VersionMismatchDetector {
private expectedVersion: string;
private checkInterval: number = 60000; // Check every minute
constructor() {
this.expectedVersion = window.__APP_VERSION__;
this.startMonitoring();
}
private startMonitoring(): void {
setInterval(async () => {
await this.checkVersion();
}, this.checkInterval);
}
private async checkVersion(): Promise<void> {
try {
const response = await fetch('/api/version', {
headers: {
'X-Client-Version': this.expectedVersion,
},
});
const data = await response.json();
if (data.currentVersion !== this.expectedVersion) {
console.warn('Version mismatch detected', {
client: this.expectedVersion,
server: data.currentVersion,
});
// Show update notification
this.showUpdateNotification();
}
} catch (error) {
console.error('Version check failed', error);
}
}
private showUpdateNotification(): void {
const notification = document.createElement('div');
notification.innerHTML = `
<div style="
position: fixed;
bottom: 20px;
right: 20px;
background: #1a1a1a;
color: white;
padding: 16px;
border-radius: 8px;
box-shadow: 0 4px 12px rgba(0,0,0,0.3);
z-index: 9999;
">
<p style="margin: 0 0 8px 0;">A new version is available</p>
<button onclick="window.location.reload()" style="
background: #0070f3;
color: white;
border: none;
padding: 8px 16px;
border-radius: 4px;
cursor: pointer;
">
Reload
</button>
</div>
`;
document.body.appendChild(notification);
}
}
// Initialize
if (typeof window !== 'undefined') {
new VersionMismatchDetector();
}
Verdict: Inject version in HTML, tag all telemetry, monitor for mismatches, prompt user to reload when versions drift.
Production Incidents and Lessons Learned
Real-world canary failures and how they were detected/resolved.
Incident 1: Infinite Re-Render Loop
What Happened:
Deployed canary with React 18 upgrade. Used useEffect without dependency array in a component rendered 1000+ times per page (list items). Each render triggered another effect, causing infinite loop.
// Buggy code in v48 canary
function ListItem({ id }) {
const [data, setData] = useState(null);
useEffect(() => {
fetch(`/api/items/${id}`)
.then(res => res.json())
.then(setData);
// Missing dependency array - runs on every render!
});
return <div>{data?.name}</div>;
}
Detection:
- Canary deployed at 5% (22.5K RPS)
- Within 90 seconds, CPU usage spiked to 100% on client devices
- INP (Interaction to Next Paint) jumped from 80ms → 4500ms
- Browser tab crashes increased 50x
Canary Analysis System Caught It:
T+1m:30s Anomaly detected: INP > 2000ms (baseline: 85ms)
Effect size: 2.8 (extremely large)
Confidence: 0.98
Decision: ROLLBACK
Rollback:
- Automated rollback triggered at T+1m:45s
- KV updated: canary-percentage = 0
- CDN purged: /dist-v48/*.html
- Total affected users: ~135K (90s * 1500 RPS)
- Damage: 47 users reported crashes, 0 churn
Root Cause:
React 18's concurrent rendering changed useEffect timing. Missing dependency array became fatal.
Prevention:
- Added ESLint rule:
react-hooks/exhaustive-deps(enforced in CI) - Added performance regression tests: Lighthouse CI checks INP < 200ms
- Improved canary sensitivity: INP threshold lowered to +500ms (was +1000ms)
Incident 2: Hydration Mismatch on Timezone Boundaries
What Happened:
Deployed canary that rendered server timestamp using Date.now(). Server was in UTC, client rendered in local timezone.
// Buggy code
function Timestamp() {
const now = Date.now();
return (
<div>
Last updated: {new Date(now).toLocaleString()}
</div>
);
}
Why This Broke:
- SSR rendered: "Last updated: 3/15/2024, 10:00:00 PM UTC"
- Client hydrated: "Last updated: 3/15/2024, 3:00:00 PM PST"
- React saw mismatch, threw hydration error
Detection:
- Canary deployed to 5%
- Hydration errors started appearing immediately
- Error rate: 2.3% (much higher than threshold of 0.01%)
- Weird part: Only affected users in PST/MST timezones (US West Coast)
Why Stratification Saved Us:
Naive analysis would have shown: "2.3% of canary users have errors, but canary sample is small, could be noise."
Stratified analysis showed:
- US-West region: 8.5% error rate (CRITICAL)
- US-East region: 0.03% error rate (normal)
- EU region: 0.02% error rate (normal)
Rollback:
- T+2m: Stratified analysis flagged US-West regression
- T+2m:15s: Automated rollback triggered
- Total affected: ~45K users (mostly US West Coast)
Root Cause:
Server timestamp rendered during SSR used server's timezone. Client rendered in user's timezone. Hydration failed.
Fix:
function Timestamp() {
const [mounted, setMounted] = useState(false);
useEffect(() => {
setMounted(true);
}, []);
// Server: render nothing (or placeholder)
// Client: render actual timestamp
if (!mounted) {
return <div>Last updated: ...</div>;
}
return (
<div>
Last updated: {new Date().toLocaleString()}
</div>
);
}
Prevention:
- Added hydration error monitoring (track
console.errorfor "Hydration failed") - Added timezone-aware test suite (run Playwright tests with TZ=America/Los_Angeles)
- Improved documentation: "Never render Date.now() in SSR without timezone handling"
Incident 3: CDN Cache Stampede During Rollback
What Happened:
Deployed canary, caught a bug, rolled back. During rollback, purged ALL HTML from CDN. 450K RPS suddenly hit origin.
Timeline:
T+0 Deploy canary (5%)
T+5m Detect error rate regression
T+5m:15s Initiate rollback
T+5m:20s CDN cache purge sent: /dist-v48/*.html
T+5m:25s Purge propagates globally
T+5m:30s ALL HTML REQUESTS HIT ORIGIN
Origin load: 8K RPS → 450K RPS
Origin crashes (OOM)
T+5m:45s Site down (503 errors)
T+8m Auto-scaling kicks in (slow)
T+12m Site recovers
Damage:
- 6.5 minutes of total outage (worse than canary bug itself)
- 2.9M failed requests
- Significant revenue impact
Root Cause:
- Purged too aggressively (all HTML, not just canary)
- No request coalescing at CDN edge
- Origin not prepared for cache miss spike
Fix 1: Selective Purge
async function rollback() {
// Before: purge all HTML
// await purgeCDN('*.html');
// After: only purge canary-specific HTML
await purgeCDN('/dist-v48/*.html');
// Stable HTML stays cached
}
Fix 2: Request Coalescing
// Cloudflare Worker
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
// Create coalescing key
const coalesceKey = `coalesce:${url.pathname}`;
// Check if request is already in-flight
const inFlight = await env.KV.get(coalesceKey);
if (inFlight) {
// Wait for in-flight request to complete
await sleep(100);
// Try cache again
const cached = await caches.default.match(request);
if (cached) return cached;
}
// Mark request as in-flight
await env.KV.put(coalesceKey, 'true', { expirationTtl: 10 });
// Fetch from origin
const response = await fetch(request);
// Cache result
await caches.default.put(request, response.clone());
// Clear in-flight flag
await env.KV.delete(coalesceKey);
return response;
}
};
Fix 3: Stale-While-Revalidate
res.setHeader('Cache-Control', 'public, max-age=60, stale-while-revalidate=600, stale-if-error=3600');
CDN serves stale content during origin outage.
Prevention:
- Added origin load shedding (reject requests if CPU > 80%)
- Added CDN-level rate limiting (max 10K RPS to origin)
- Changed rollback strategy to gradual (100% → 50% → 0% over 2 minutes)
Incident 4: Mobile Network Timeout Cascade
What Happened:
Deployed canary with larger bundle size (+120KB gzipped). Desktop users unaffected. Mobile users on 3G experienced cascading timeouts.
Metrics:
- Desktop LCP: 1.2s (stable) → 1.3s (canary) ✅
- Mobile 4G LCP: 1.8s (stable) → 2.1s (canary) ⚠️
- Mobile 3G LCP: 3.2s (stable) → 8.7s (canary) ❌
Why Stratification Caught This:
Overall LCP increase: +300ms (below threshold of +500ms)
But stratified by connection type:
- WiFi: +50ms (fine)
- 4G: +300ms (acceptable)
- 3G: +5500ms (CRITICAL)
Root Cause:
- 3G bandwidth: ~1Mbps (theoretical), 400Kbps (real-world)
- 120KB extra bundle = 2.4s extra download on 3G
- Timeout set to 5s - download took 8.7s - timeout fired
- User got white screen
Fix:
// Adaptive bundle loading
async function loadAppBundle() {
const connection = (navigator as any).connection;
const effectiveType = connection?.effectiveType || '4g';
if (effectiveType === 'slow-2g' || effectiveType === '2g' || effectiveType === '3g') {
// Load minimal bundle for slow connections
await import('./bundles/minimal.js');
} else {
// Load full bundle
await import('./bundles/full.js');
}
}
Prevention:
- Added connection-type stratification to canary analysis
- Added bundle size regression tests (fail if >50KB increase)
- Added network throttling to CI (test on simulated 3G)
Tradeoffs and Engineering Decisions
Every canary architecture decision involves tradeoffs. Here are the critical ones.
Tradeoff 1: Canary Percentage (Risk vs Sample Size)
Lower Percentage (1-5%):
- ✅ Smaller blast radius (fewer affected users)
- ✅ Lower risk of revenue impact
- ❌ Smaller sample size (takes longer to detect issues)
- ❌ Higher false positive rate (noise dominates signal)
Higher Percentage (10-25%):
- ✅ Larger sample size (faster detection)
- ✅ Lower false positive rate
- ❌ Larger blast radius
- ❌ Higher risk if canary is bad
Decision:
Start at 5% for 5-10 minutes (detect critical issues), then jump to 25-50% (validate at scale). This balances risk and detection speed.
| App Type | Initial % | First Jump | Final Jump |
|---|---|---|---|
| E-commerce | 2% | 10% | 50% → 100% |
| Social media | 5% | 25% | 100% |
| SaaS dashboard | 10% | 50% | 100% |
| Internal tool | 25% | 100% | N/A |
Tradeoff 2: Decision Latency (Speed vs Accuracy)
Fast Decisions (30-second windows):
- ✅ Catch bad deploys quickly
- ✅ Minimize blast radius
- ❌ Higher false positive rate (noise)
- ❌ May not catch slow-burn issues
Slow Decisions (5-minute windows):
- ✅ Better statistical significance
- ✅ Lower false positive rate
- ❌ Slow to catch critical bugs
- ❌ More users affected before rollback
Decision:
Use hybrid approach:
const DECISION_CONFIG = {
// Critical metrics: fast decisions
errorRate: {
window: 30, // 30 seconds
threshold: 0.05, // 0.05% increase
action: 'rollback_immediate',
},
hydrationErrors: {
window: 30,
threshold: 0.01,
action: 'rollback_immediate',
},
// Performance metrics: slower decisions
lcp: {
window: 300, // 5 minutes
threshold: 500, // +500ms
action: 'rollback_delayed',
},
cls: {
window: 300,
threshold: 0.1,
action: 'hold',
},
};
Verdict: Fast decisions for errors, slower decisions for performance.
Tradeoff 3: Edge Logic Complexity (Flexibility vs Maintainability)
Simple Edge Logic:
- Percentage-based routing only
- No stratification
- No user segmentation
Pros:
- Easy to reason about
- Low edge CPU usage
- Easy to debug
Cons:
- No fine-grained control
- Can't do region-based rollouts
- Can't do cohort-based testing
Complex Edge Logic:
- Percentage + region + cohort + feature flags
- Real-time KV lookups
- User data fetching
Pros:
- Maximum control
- Can do sophisticated rollouts
- Can segment by any dimension
Cons:
- Edge worker CPU limits (50ms timeout)
- More KV reads = higher latency
- Harder to debug
Decision:
Start simple, add complexity as needed.
Phase 1 (MVP):
// Just percentage-based
const pct = parseInt(await kv.get('canary-percentage'));
const bucket = hash(userId) % 100;
return bucket < pct ? 'canary' : 'stable';
Phase 2 (Add Regions):
const enabled = await kv.get(`canary-enabled:${region}`);
if (!enabled) return 'stable';
// ... percentage logic
Phase 3 (Add Cohorts):
const user = await fetchUser(userId);
const eligible = checkCohortEligibility(user);
if (!eligible) return 'stable';
// ... percentage + region logic
Verdict: Start with simple percentage-based, add sophistication only when product needs it.
Tradeoff 4: Automated vs Manual Rollback
Fully Automated:
- ✅ Fast (60-second rollback)
- ✅ Works 24/7 (no human needed)
- ❌ False positives block good deploys
- ❌ Can't handle nuanced situations
Manual Only:
- ✅ Humans make nuanced decisions
- ✅ No false positives
- ❌ Slow (humans not always available)
- ❌ Blast radius grows during detection time
Decision:
Automated rollback with human override:
class RollbackController {
async executeRollback(reason: string, confidence: number): Promise<void> {
if (confidence > 0.95) {
// Very confident - auto-rollback immediately
console.log(`AUTO-ROLLBACK (confidence: ${confidence})`);
await this.rollback(reason);
await this.notifyTeam(`Auto-rollback executed: ${reason}`);
} else if (confidence > 0.7) {
// Moderately confident - notify team, give 2 minutes to override
console.log(`ROLLBACK PENDING (confidence: ${confidence})`);
await this.notifyTeam(`Rollback pending in 2 minutes: ${reason}. Reply 'CANCEL' to abort.`);
await sleep(120000); // Wait 2 minutes
// Check if human canceled
const canceled = await this.kv.get('rollback-canceled');
if (canceled === 'true') {
console.log('Rollback canceled by human');
await this.kv.delete('rollback-canceled');
return;
}
// No cancellation - proceed
await this.rollback(reason);
} else {
// Low confidence - just alert humans
console.log(`ROLLBACK SUGGESTED (confidence: ${confidence})`);
await this.notifyTeam(`Canary showing issues: ${reason}. Manual review recommended.`);
}
}
}
Rollback Override (Slack Bot):
[3:42 PM] CanaryBot:
⚠️ ROLLBACK PENDING in 2 minutes
Reason: Error rate increased by 0.12% (threshold: 0.05%)
Confidence: 0.78
React with ❌ to cancel rollback
User: [clicks ❌]
[3:42 PM] CanaryBot:
✅ Rollback canceled. Canary will continue.
Verdict: Use automated rollback with confidence thresholds and human override capability.
Summary: Key Architectural Insights
Building production-grade canary deployments for frontends requires rethinking backend patterns:
-
Traffic Splitting Must Happen at the Edge
- CDN edge workers give you <1ms routing decisions
- Origin-based splitting amplifies load and limits geographic control
- DNS-based splitting is too slow for rollbacks
-
Static Assets Break Traditional Canary Models
- Content-addressed assets prevent cache collisions
- Versioned CDN paths enable canary/stable coexistence
- Short HTML TTLs + long asset TTLs balance performance and rollback speed
-
Hydration Is Your Biggest Risk
- SSR/SSG creates timing windows where version mismatches cause failures
- Version injection in HTML + client-side checks prevent drift
- Monitor hydration errors separately from runtime errors
-
Stratified Analysis Prevents False Negatives
- Raw metric comparisons miss region/device-specific regressions
- Statistical tests (t-test, effect size) reduce false positives
- Anomaly detection catches edge cases that thresholds miss
-
Rollback Speed Matters More Than Deployment Speed
- CDN propagation delays mean bad code lives for 30-60s minimum
- Client-side error recovery bridges the rollback window
- Aggressive cache purging causes origin stampedes (gradual rollback is safer)
-
Canaries and Feature Flags Complement Each Other
- Canaries validate entire codebase (including deps, tooling, infra)
- Feature flags control individual code paths
- Hybrid approach: deploy canary with flags off, enable flags incrementally
-
Automation Is Non-Negotiable at Scale
- Humans can't make decisions in 60-second windows
- Bayesian inference + confidence thresholds enable safe automation
- Manual override prevents robots from blocking good deploys
-
The Real Cost Is Operational Complexity
- Edge compute adds $5-10K/month at scale
- Monitoring/telemetry is $10-15K/month
- Engineering time to build + maintain the system is 3-6 months
- But the cost of a bad deploy (downtime, churn, reputation) is 10-100x higher
Final Thought:
Canary deployments for frontends are fundamentally harder than backend canaries because:
- Static assets create caching complexity
- Client-side execution means you can't control the runtime
- Hydration creates version synchronization problems
- CDN propagation delays limit rollback speed
But the investment is worth it. At scale, canaries are the difference between "we caught it in 90 seconds" and "40,000 users churned before we noticed."
The architecture outlined here—edge-based routing, stratified analysis, automated decisions, gradual rollout—is what companies like Netflix, Vercel, and Cloudflare use in production. It's not simple, but it works.
References and Further Reading
Industry Engineering Blogs:
- Netflix TechBlog: "Automated Canary Analysis at Netflix with Kayenta"
- Uber Engineering: "Introducing Domain-Oriented Microservice Architecture"
- Meta Engineering: "Building Reliable Systems with Gatekeeper"
- Cloudflare: "How We Use Edge Workers for Progressive Rollouts"
Papers:
- "Continuous Delivery and Progressive Deployment" (Google SRE Book)
- "Statistical Analysis of A/B Test Results" (Microsoft Research)
Tools:
- Cloudflare Workers (edge compute)
- Vercel Edge Functions (edge routing)
- LaunchDarkly (feature flags)
- Datadog/New Relic (RUM + APM)
EOF
What did you think?