Canary Deployment Architecture for Frontends: Production-Grade Strategies at Scale

June 18, 2026231 min read0 views

canary deployment

frontend architecture

progressive delivery

feature flags

deployment strategies

Canary Deployment Architecture for Frontends: Production-Grade Strategies at Scale

Introduction: Why Frontend Canaries Are Fundamentally Different

When Netflix pushed a buggy JavaScript bundle that caused infinite re-render loops for 2% of their traffic, they caught it in 90 seconds. When a smaller company did the same without canary deployments, they discovered the issue after 40,000 users had already experienced crashes, leading to a 12% same-day churn spike.

The difference? A production-grade canary deployment architecture built specifically for frontend constraints.

Most engineers understand canary deployments for backend services: route 5% of traffic to new code, compare error rates, proceed or rollback. Simple. But frontends break this mental model in fundamental ways:

The Frontend Canary Problem:

Static Assets Are Immutable - You can't "gradually deploy" a JavaScript bundle. Once it's on the CDN, it's there. You need traffic splitting, not deployment splitting.
Browser Caching Creates Version Chaos - User A might load index.html (new) but main.js (old). User B might have the opposite. You're not deploying one version, you're managing version matrices.
Client-Side State Persists Across Deployments - A user's localStorage, IndexedDB, and ServiceWorker cache might be from v47, while your canary is v51. Migrations happen in the browser, not on deploy.
Hydration Failures Are Silent - SSR/SSG apps can serve perfectly fine HTML from the canary, then crash during hydration. Traditional monitoring catches this too late.
CDN Cache Invalidation Takes Time - You can't instantly rollback a frontend deployment. CDN propagation means bad code lives for 30-300 seconds minimum, affecting thousands of requests.

This article explains how companies like Vercel, Cloudflare, and Netflix build frontend canary systems that account for these constraints while maintaining sub-100ms decision loops and zero-downtime rollbacks.

We'll cover the architecture decisions that matter: edge-based traffic splitting, client-side version detection, hydration monitoring, cache invalidation strategies, and the automation systems that make decisions faster than humans can.

Scale Context: Production Reality

Before diving into architecture, let's establish realistic production constraints for a hyper-scale frontend:

Traffic Profile:

DAU: 50M daily active users
Peak RPS: 450K requests/second (main HTML)
Asset Requests: 2.8M RPS (JS/CSS/images/fonts)
Geographic Distribution: 180+ countries, 60% mobile
CDN PoPs: 300+ edge locations worldwide
Simultaneous Deploys: 40-60 per day (feature teams + hotfixes)

Frontend Architecture:

Framework: Next.js 14 (App Router) with React Server Components
Rendering: Hybrid SSR + SSG + ISR
Bundle Size: 850KB initial (gzipped), 3.2MB total (all routes)
Code Splits: 120+ dynamic chunks
API Calls per Page: 6-12 (BFF aggregation)
WebSocket Connections: 8M concurrent (realtime features)

Deployment Constraints:

Build Time: 4-8 minutes (full production build)
CDN Propagation: 30-90 seconds (global edge cache)
Canary Duration: 5-45 minutes (depends on confidence)
Rollback SLA: <2 minutes (detection + action)
Acceptable Error Budget: 0.1% additional error rate during canary

Monitoring Requirements:

Metric Collection Latency: <10 seconds
Decision Loop: <60 seconds
Sample Size for Statistical Significance: Minimum 10K requests
False Positive Rate: <1% (automated rollback)

Cost Constraints:

CDN Bandwidth: $0.08/GB (1.2PB/month = $96K/month)
Edge Compute: $0.50 per million requests
Monitoring: $12K/month (APM + RUM + logs)
Canary Overhead: Must stay <5% of infrastructure costs

At this scale, a naive canary implementation breaks. You need purpose-built architecture.

High-Level Architecture: Frontend Canary System

A production-grade frontend canary system has seven layers:

┌─────────────────────────────────────────────────────────────────┐
│                         USER REQUEST                             │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 1: DNS / Global Load Balancer                            │
│  - Geographic routing (latency-based)                           │
│  - DDoS protection                                              │
│  - Health checks                                                │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 2: CDN Edge (300+ PoPs)                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Edge Worker (V8 Isolate)                                │   │
│  │  - Traffic splitting logic                               │   │
│  │  - Version assignment (cookie/header)                    │   │
│  │  - Cache key variation                                   │   │
│  │  - Client hints inspection                               │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Canary Decision: 95% → stable / 5% → canary                   │
└─────────┬────────────────────────────────┬─────────────────────┘
          │                                │
          ▼                                ▼
┌──────────────────┐              ┌──────────────────┐
│  STABLE ORIGIN   │              │  CANARY ORIGIN   │
│                  │              │                  │
│  /dist-v47/      │              │  /dist-v48/      │
│  main.js         │              │  main.js         │
│  index.html      │              │  index.html      │
│  _next/chunks/   │              │  _next/chunks/   │
└──────────────────┘              └──────────────────┘
          │                                │
          └────────────┬───────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 3: Origin Servers (Kubernetes)                           │
│  - SSR rendering (Node.js pods)                                 │
│  - API BFF (GraphQL aggregation)                                │
│  - Server Components execution                                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 4: Client-Side Instrumentation                           │
│  - Version detection (injected in HTML)                         │
│  - Performance metrics (Core Web Vitals)                        │
│  - Error tracking (window.onerror, React error boundaries)      │
│  - Hydration timing (React profiling)                           │
│  - Navigation timing (PerformanceObserver)                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 5: Telemetry Pipeline                                    │
│  - Structured logging (JSON)                                    │
│  - Metrics aggregation (Prometheus/Datadog)                     │
│  - Real-time streaming (Kafka → Flink)                          │
│  - Time-series database (InfluxDB/TimescaleDB)                  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 6: Canary Analysis Engine                                │
│  - Statistical comparison (two-sample t-test)                   │
│  - Anomaly detection (IQR, Z-score)                             │
│  - Threshold evaluation (SLO-based)                             │
│  - Confidence scoring (Bayesian inference)                      │
│  - Automated decision (proceed/hold/rollback)                   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  LAYER 7: Deployment Orchestration                              │
│  - Progressive rollout (5% → 25% → 50% → 100%)                 │
│  - CDN cache purge (selective invalidation)                     │
│  - Feature flag coordination (LaunchDarkly/custom)              │
│  - Rollback execution (atomic pointer swap)                     │
│  - Incident automation (PagerDuty/Slack)                        │
└─────────────────────────────────────────────────────────────────┘

Key Architectural Principles:

Edge-First Traffic Splitting - Decision happens at CDN edge (not origin). This prevents origin load amplification and enables <1ms routing decisions.
Sticky Sessions via Cookie - Once a user is assigned canary/stable, they stay there for the entire session. Prevents A/B switching mid-session which causes hydration failures.
Separate Asset Paths - Canary and stable assets live in different CDN paths (/dist-v47/ vs /dist-v48/). No shared cache keys. This prevents version collisions.
Client-Side Version Injection - Every HTML response includes <meta name="app-version" content="v48-canary">. Enables client-side telemetry tagging and error attribution.
Real-Time Metrics Streaming - Telemetry flows through Kafka to Flink for sub-10-second aggregation. Batch processing (5-minute windows) is too slow for canary decisions.
Automated Decision Loop - Humans approve the deploy, but robots decide rollout progression. Statistical tests run every 30 seconds. If canary fails, rollback happens in <60 seconds without human intervention.

Traffic Splitting Mechanisms: Edge vs Origin vs Client

There are four places you can implement canary traffic splitting. Each has different tradeoffs.

1. DNS-Based Splitting (DON'T USE)

user → DNS (weighted records) → 95% to stable-lb.example.com
                              → 5% to canary-lb.example.com

Why This Fails:

DNS caching (60s-3600s TTL) means rollback takes minutes to hours
Client-side DNS resolvers ignore weights
No session stickiness
Geographic distribution is uneven

Verdict: Never use DNS for frontend canaries. It's too slow and unpredictable.

2. Load Balancer-Based Splitting (LEGACY)

user → ALB/NLB → weighted target groups → 95% stable pods
                                        → 5% canary pods

Why This Works (Sort Of):

Session stickiness via cookies
Fast rollback (<10 seconds)
Origin-level control

Why This Fails at Scale:

Load balancer becomes a bottleneck (L4/L7 inspection overhead)
No geographic granularity (all regions get same canary %)
Origin load amplification (cache misses hit origin harder)
Doesn't work with CDN-cached static assets

Verdict: Works for SSR-heavy apps without CDN, but not optimal for modern frontends.

3. CDN-Based Splitting (GOOD)

graph TB
    User[User Request] --> CDN[CDN Edge PoP]
    CDN --> Cache{Asset in<br/>Edge Cache?}

    Cache -->|Yes| Return[Return Cached Asset]
    Cache -->|No| EdgeLogic[Edge Worker Logic]

    EdgeLogic --> Hash{Hash user ID<br/>% 100}
    Hash -->|"< 5"| Canary[Set canary cookie<br/>Route to /dist-v48/]
    Hash -->|">= 5"| Stable[Set stable cookie<br/>Route to /dist-v47/]

    Canary --> FetchCanary[Fetch from Canary Origin]
    Stable --> FetchStable[Fetch from Stable Origin]

    FetchCanary --> CacheCanary[Cache with key:<br/>canary-v48-/path]
    FetchStable --> CacheStable[Cache with key:<br/>stable-v47-/path]

    CacheCanary --> Return
    CacheStable --> Return

Implementation (Cloudflare Workers):

// Deployed to 300+ edge PoPs
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);
    const cookies = parseCookies(request.headers.get('Cookie') || '');

    // Check if user already has a version assignment
    let assignedVersion = cookies['x-canary-version'];

    if (!assignedVersion) {
      // New user - assign based on hash
      const userId = cookies['user_id'] || generateAnonymousId();
      const hash = hashCode(userId);
      const bucket = Math.abs(hash) % 100;

      // Get current canary percentage from KV (updated by control plane)
      const canaryPct = parseInt(await env.KV.get('canary-percentage') || '5');

      assignedVersion = bucket < canaryPct ? 'canary' : 'stable';
    }

    // Get version numbers from KV
    const stableVersion = await env.KV.get('stable-version'); // "v47"
    const canaryVersion = await env.KV.get('canary-version'); // "v48"

    const targetVersion = assignedVersion === 'canary'
      ? canaryVersion
      : stableVersion;

    // Rewrite URL to version-specific path
    const originPath = `/dist-${targetVersion}${url.pathname}`;
    const originUrl = new URL(originPath, env.ORIGIN_URL);

    // Add cache key variation
    const cacheKey = new Request(originUrl.toString(), {
      headers: request.headers,
      cf: {
        cacheKey: `${assignedVersion}-${targetVersion}-${url.pathname}`
      }
    });

    // Check edge cache first
    let response = await caches.default.match(cacheKey);

    if (!response) {
      // Cache miss - fetch from origin
      response = await fetch(originUrl, {
        headers: {
          ...request.headers,
          'X-Canary-Version': assignedVersion,
          'X-App-Version': targetVersion
        }
      });

      // Cache the response (if cacheable)
      if (response.ok && response.headers.get('Cache-Control')) {
        const cachedResponse = response.clone();
        ctx.waitUntil(caches.default.put(cacheKey, cachedResponse));
      }
    }

    // Inject version cookie in response
    const newResponse = new Response(response.body, response);
    newResponse.headers.set(
      'Set-Cookie',
      `x-canary-version=${assignedVersion}; Path=/; Max-Age=86400; SameSite=Lax; Secure`
    );

    // Add version header for telemetry
    newResponse.headers.set('X-Served-Version', targetVersion);
    newResponse.headers.set('X-Canary-Assignment', assignedVersion);

    return newResponse;
  }
};

function hashCode(str: string): number {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    const char = str.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash = hash & hash; // Convert to 32-bit integer
  }
  return hash;
}

function parseCookies(cookieHeader: string): Record<string, string> {
  return Object.fromEntries(
    cookieHeader.split(';').map(c => {
      const [key, ...v] = c.trim().split('=');
      return [key, v.join('=')];
    })
  );
}

function generateAnonymousId(): string {
  return crypto.randomUUID();
}

Why This Works:

Decision happens at edge (300+ PoPs, <1ms latency)
User stickiness via cookie (session-consistent)
Separate cache keys prevent version collisions
Dynamic canary percentage (KV store update → instant effect)
Works for both HTML and static assets

Why This Still Has Limitations:

Edge Worker CPU limits (50ms execution time)
KV eventual consistency (can take 60s to propagate globally)
Requires CDN that supports edge compute (Cloudflare, Fastly, AWS CloudFront Functions)

Verdict: This is the industry standard for frontend canaries at scale.

4. Client-Side Splitting (AVOID)

// In initial HTML
<script>
  const canaryPct = 5;
  const bucket = Math.floor(Math.random() * 100);
  const version = bucket < canaryPct ? 'v48' : 'v47';

  // Dynamically load versioned bundle
  const script = document.createElement('script');
  script.src = `/dist-${version}/main.js`;
  document.head.appendChild(script);
</script>

Why This Fails:

Initial HTML is already cached (can't control version)
Random assignment changes on every page load (no stickiness)
No SSR support
Breaks preloading and resource hints
Hurts Core Web Vitals (delayed script execution)

Verdict: Only use as a last resort if you have zero backend control.

Frontend-Specific Canary Challenges

Backend canary deployments are stateless: send request, get response, compare metrics. Frontend canaries have three fundamental problems that break this model.

Challenge 1: Static Asset Cache Coherence

The Problem:

You deploy canary v48. A user requests:

index.html (v48, canary) ← Edge routes to canary origin
main.js (v47, stable) ← Browser cache hit from yesterday
chunk-profile.js (v48, canary) ← Code-split route, cache miss

Now the user is running v48 HTML + v47 main bundle + v48 profile chunk. Webpack module federation crashes because chunk manifests don't align.

Why This Happens:

CDN cache and browser cache have different TTLs:

HTML: Cache-Control: public, max-age=0, must-revalidate (always check origin)
JS bundles: Cache-Control: public, max-age=31536000, immutable (cache forever)

When you deploy canary, HTML immediately points to v48 assets, but browser cache still has v47 bundles.

Solution 1: Content-Addressed Assets (Standard)

Every asset has a hash in its filename:

main.a3f5d2b9.js    ← v47 stable
main.c7e9f1a4.js    ← v48 canary

When HTML changes version, it references different asset URLs. Browser cache is keyed by URL, so no collision.

Webpack/Next.js automatically does this:

// next.config.js
module.exports = {
  generateBuildId: async () => {
    // Use git commit SHA as build ID
    return execSync('git rev-parse HEAD').toString().trim();
  },

  // Generates: /_next/static/<buildId>/pages/index.js
}

Solution 2: Versioned Asset Paths

CDN edge worker rewrites paths:

// Canary: /dist-v48/_next/static/chunks/main.js
// Stable: /dist-v47/_next/static/chunks/main.js

Both can coexist in CDN cache with different cache keys.

The Hydration Problem:

Even with content-addressed assets, Server-Side Rendering creates timing issues.

Scenario:

User requests /product/123 at 10:00:00 AM
Edge routes to canary origin (v48)
SSR renders React tree with v48 code
HTML sent to client with <script src="/main.c7e9f1a4.js">
Client fetches main.js at 10:00:02 AM
During those 2 seconds, canary fails and gets rolled back
CDN edge now routes to stable origin (v47)
But HTML already references v48 assets
Hydration error: Client-side React tree doesn't match server-rendered HTML

Solution: Version Pinning in HTML

// SSR render time
const response = await renderToString(
  <App version="v48" buildId="c7e9f1a4" />
);

// Inject version in HTML
const html = `
<!DOCTYPE html>
<html>
<head>
  <meta name="app-version" content="v48" data-build-id="c7e9f1a4">
  <script>
    // Client checks version before hydration
    window.__APP_VERSION__ = 'v48';
    window.__BUILD_ID__ = 'c7e9f1a4';
  </script>

  <!-- Asset URLs include version -->
  <script src="/dist-v48/main.c7e9f1a4.js"></script>
</head>
<body>
  <div id="root">${response}</div>
</body>
</html>
`;

Client-side version check:

// Runs before React hydration
async function validateVersion() {
  const expectedVersion = window.__APP_VERSION__;
  const expectedBuildId = window.__BUILD_ID__;

  // Check if version is still valid
  const response = await fetch('/api/version-check', {
    headers: {
      'X-Client-Version': expectedVersion,
      'X-Build-ID': expectedBuildId
    }
  });

  const { valid, currentVersion } = await response.json();

  if (!valid) {
    console.warn(`Version mismatch: expected ${expectedVersion}, current ${currentVersion}`);

    // Option 1: Hard reload (loses client state)
    window.location.reload();

    // Option 2: Soft migration (preserve state)
    await migrateClientState(expectedVersion, currentVersion);
  }
}

validateVersion().then(() => {
  // Safe to hydrate
  hydrateRoot(document.getElementById('root'), <App />);
});

The Cost:

Extra API call before hydration (adds 50-150ms to TTI)
Complicates deployment (version registry service)
Can cause reload loops if migration fails

Better Solution: Server-Sent Version Hints

CDN edge injects version into HTML response:

// Cloudflare Worker
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await fetch(request);

    // Only inject for HTML responses
    if (!response.headers.get('Content-Type')?.includes('text/html')) {
      return response;
    }

    const html = await response.text();

    // Get current version from KV (updated on rollback)
    const currentCanaryVersion = await env.KV.get('canary-version');
    const currentStableVersion = await env.KV.get('stable-version');

    // Inject version check script
    const versionCheckScript = `
      <script>
        (function() {
          const serverVersion = '${currentCanaryVersion}';
          const clientVersion = document.querySelector('meta[name="app-version"]')?.content;

          if (serverVersion !== clientVersion) {
            console.warn('Version drift detected, reloading...');
            window.location.reload();
          }
        })();
      </script>
    `;

    const modifiedHtml = html.replace('</head>', `${versionCheckScript}</head>`);

    return new Response(modifiedHtml, {
      status: response.status,
      headers: response.headers
    });
  }
};

Verdict: Content-addressed assets + versioned paths + edge version checks = cache coherence.

Challenge 2: Canary Metrics Are Skewed

The Sampling Bias Problem:

You deploy canary to 5% of traffic. After 10 minutes:

Canary error rate: 0.15%
Stable error rate: 0.08%

Should you rollback? Not necessarily.

Why Metrics Are Biased:

Geographic Distribution: Canary users might be disproportionately in regions with slower networks (affects LCP/FID).
Device Distribution: If canary hash function correlates with user IDs, and newer users tend to have better devices, canary will show better performance even if code is identical.
Bot Traffic: Bots don't have cookies, so they get reassigned canary/stable on every request. This dilutes real user metrics.
Time-of-Day Effects: If you start canary at 2 PM EST, US traffic dominates. By 10 PM EST, APAC traffic dominates. Different usage patterns skew metrics.
Sample Size Disparity: 5% canary = 22.5K RPS. 95% stable = 427.5K RPS. Small sample sizes have higher variance.

Solution: Stratified Sampling + Statistical Tests

interface CanaryMetrics {
  version: 'canary' | 'stable';
  timestamp: number;

  // Stratification dimensions
  region: string;        // "us-east", "eu-west", "ap-southeast"
  deviceType: string;    // "mobile", "desktop", "tablet"
  connectionType: string; // "4g", "3g", "wifi", "unknown"

  // Core metrics
  errorRate: number;     // errors / total requests
  p50Latency: number;    // milliseconds
  p95Latency: number;
  p99Latency: number;

  // Core Web Vitals
  lcp: number;           // Largest Contentful Paint
  fid: number;           // First Input Delay (deprecated, use INP)
  inp: number;           // Interaction to Next Paint
  cls: number;           // Cumulative Layout Shift

  // Hydration metrics
  hydrationTime: number; // milliseconds
  hydrationError: boolean;

  // Sample size
  sampleSize: number;
}

class CanaryAnalyzer {
  async compareMetrics(
    canaryMetrics: CanaryMetrics[],
    stableMetrics: CanaryMetrics[]
  ): Promise<AnalysisResult> {
    // Group metrics by stratification dimensions
    const canaryByStrata = this.stratify(canaryMetrics);
    const stableByStrata = this.stratify(stableMetrics);

    const results: StratumResult[] = [];

    // Compare each stratum independently
    for (const stratum of Object.keys(canaryByStrata)) {
      const canaryData = canaryByStrata[stratum];
      const stableData = stableByStrata[stratum];

      if (!stableData) {
        console.warn(`No stable data for stratum ${stratum}`);
        continue;
      }

      // Require minimum sample size for statistical significance
      if (canaryData.sampleSize < 1000 || stableData.sampleSize < 1000) {
        console.warn(`Insufficient sample size for stratum ${stratum}`);
        continue;
      }

      // Perform two-sample t-test for each metric
      const errorRateTest = this.twoSampleTTest(
        canaryData.errorRates,
        stableData.errorRates
      );

      const p95LatencyTest = this.twoSampleTTest(
        canaryData.p95Latencies,
        stableData.p95Latencies
      );

      const lcpTest = this.twoSampleTTest(
        canaryData.lcps,
        stableData.lcps
      );

      results.push({
        stratum,
        errorRateDelta: errorRateTest.delta,
        errorRateSignificant: errorRateTest.pValue < 0.05,
        p95LatencyDelta: p95LatencyTest.delta,
        p95LatencySignificant: p95LatencyTest.pValue < 0.05,
        lcpDelta: lcpTest.delta,
        lcpSignificant: lcpTest.pValue < 0.05,
      });
    }

    // Aggregate results across strata
    return this.aggregateResults(results);
  }

  private stratify(metrics: CanaryMetrics[]): Record<string, AggregatedMetrics> {
    const stratified: Record<string, CanaryMetrics[]> = {};

    for (const metric of metrics) {
      // Create stratum key
      const key = `${metric.region}:${metric.deviceType}:${metric.connectionType}`;

      if (!stratified[key]) {
        stratified[key] = [];
      }

      stratified[key].push(metric);
    }

    // Aggregate within each stratum
    const aggregated: Record<string, AggregatedMetrics> = {};

    for (const [key, metrics] of Object.entries(stratified)) {
      aggregated[key] = {
        errorRates: metrics.map(m => m.errorRate),
        p95Latencies: metrics.map(m => m.p95Latency),
        lcps: metrics.map(m => m.lcp),
        sampleSize: metrics.reduce((sum, m) => sum + m.sampleSize, 0),
      };
    }

    return aggregated;
  }

  private twoSampleTTest(
    sample1: number[],
    sample2: number[]
  ): { delta: number; pValue: number } {
    const mean1 = this.mean(sample1);
    const mean2 = this.mean(sample2);
    const delta = mean1 - mean2;

    const variance1 = this.variance(sample1);
    const variance2 = this.variance(sample2);

    const n1 = sample1.length;
    const n2 = sample2.length;

    // Welch's t-test (unequal variances)
    const tStatistic = delta / Math.sqrt(variance1 / n1 + variance2 / n2);

    // Degrees of freedom (Welch-Satterthwaite equation)
    const df = Math.pow(variance1 / n1 + variance2 / n2, 2) /
      (Math.pow(variance1 / n1, 2) / (n1 - 1) + Math.pow(variance2 / n2, 2) / (n2 - 1));

    // Calculate p-value (simplified - use stats library in production)
    const pValue = this.tTestPValue(tStatistic, df);

    return { delta, pValue };
  }

  private mean(values: number[]): number {
    return values.reduce((sum, v) => sum + v, 0) / values.length;
  }

  private variance(values: number[]): number {
    const mean = this.mean(values);
    const squaredDiffs = values.map(v => Math.pow(v - mean, 2));
    return this.mean(squaredDiffs);
  }

  private tTestPValue(tStatistic: number, df: number): number {
    // Use jStat or math.js for actual t-distribution CDF
    // Simplified approximation for example
    return 2 * (1 - this.normalCDF(Math.abs(tStatistic)));
  }

  private normalCDF(x: number): number {
    // Standard normal CDF approximation
    return 0.5 * (1 + this.erf(x / Math.sqrt(2)));
  }

  private erf(x: number): number {
    // Error function approximation
    const sign = x >= 0 ? 1 : -1;
    x = Math.abs(x);

    const a1 = 0.254829592;
    const a2 = -0.284496736;
    const a3 = 1.421413741;
    const a4 = -1.453152027;
    const a5 = 1.061405429;
    const p = 0.3275911;

    const t = 1 / (1 + p * x);
    const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);

    return sign * y;
  }

  private aggregateResults(results: StratumResult[]): AnalysisResult {
    // Weighted average by sample size
    let totalSampleSize = 0;
    let weightedErrorDelta = 0;
    let weightedLatencyDelta = 0;
    let weightedLcpDelta = 0;

    for (const result of results) {
      const weight = result.sampleSize || 1;
      totalSampleSize += weight;
      weightedErrorDelta += result.errorRateDelta * weight;
      weightedLatencyDelta += result.p95LatencyDelta * weight;
      weightedLcpDelta += result.lcpDelta * weight;
    }

    return {
      overallErrorRateDelta: weightedErrorDelta / totalSampleSize,
      overallP95LatencyDelta: weightedLatencyDelta / totalSampleSize,
      overallLcpDelta: weightedLcpDelta / totalSampleSize,
      significantRegressions: results.filter(r =>
        (r.errorRateSignificant && r.errorRateDelta > 0) ||
        (r.p95LatencySignificant && r.p95LatencyDelta > 0) ||
        (r.lcpSignificant && r.lcpDelta > 0)
      ).length,
      recommendation: this.makeRecommendation(results),
    };
  }

  private makeRecommendation(results: StratumResult[]): 'proceed' | 'hold' | 'rollback' {
    // Rollback if ANY stratum shows significant regression in critical metrics
    const criticalRegressions = results.filter(r =>
      (r.errorRateSignificant && r.errorRateDelta > 0.01) || // >1% error rate increase
      (r.lcpSignificant && r.lcpDelta > 500) // >500ms LCP increase
    );

    if (criticalRegressions.length > 0) {
      return 'rollback';
    }

    // Hold if minor regressions in non-critical strata
    const minorRegressions = results.filter(r =>
      (r.p95LatencySignificant && r.p95LatencyDelta > 100) // >100ms latency increase
    );

    if (minorRegressions.length > results.length * 0.2) { // >20% of strata
      return 'hold';
    }

    return 'proceed';
  }
}

Production Thresholds (Netflix-scale):

Metric	Threshold	Action
Error Rate	>0.05% increase	Immediate rollback
LCP	>300ms increase	Rollback
INP	>100ms increase	Rollback
Hydration Errors	>0.01% occurrence	Rollback
P95 API Latency	>200ms increase	Hold, investigate
Memory Usage	>50MB increase	Hold, investigate
Bundle Size	>100KB increase	Review, no auto-rollback

Verdict: Don't compare raw metrics. Use stratified sampling + statistical tests + domain-specific thresholds.

Challenge 3: Client-Side State Migrations

The Problem:

You deploy canary v48 which changes localStorage schema:

// v47 (stable)
localStorage.setItem('cart', JSON.stringify({
  items: [{ id: 1, qty: 2 }]
}));

// v48 (canary)
localStorage.setItem('cart', JSON.stringify({
  version: 2,
  items: [{ productId: 1, quantity: 2, addedAt: Date.now() }]
}));

A user loads v47 (stable), adds items to cart, then navigates to a new page which gets routed to v48 (canary). The canary code reads localStorage, finds old schema, crashes.

Why This Happens:

Unlike backend deployments where you can run database migrations atomically, frontend state lives in user's browser across deployments. Canary and stable can read/write the same storage.

Solution 1: Defensive Reads with Schema Versioning

interface CartV1 {
  items: Array<{ id: number; qty: number }>;
}

interface CartV2 {
  version: 2;
  items: Array<{
    productId: number;
    quantity: number;
    addedAt: number;
  }>;
}

type Cart = CartV1 | CartV2;

function readCart(): CartV2 {
  const raw = localStorage.getItem('cart');

  if (!raw) {
    return { version: 2, items: [] };
  }

  try {
    const data = JSON.parse(raw) as Cart;

    // Check version
    if ('version' in data && data.version === 2) {
      return data; // Already v2
    }

    // Migrate v1 → v2
    const v1 = data as CartV1;
    const migrated: CartV2 = {
      version: 2,
      items: v1.items.map(item => ({
        productId: item.id,
        quantity: item.qty,
        addedAt: Date.now(), // Best guess
      })),
    };

    // Write migrated version
    localStorage.setItem('cart', JSON.stringify(migrated));

    return migrated;
  } catch (error) {
    console.error('Failed to read cart', error);

    // Corrupted data - reset
    const empty: CartV2 = { version: 2, items: [] };
    localStorage.setItem('cart', JSON.stringify(empty));
    return empty;
  }
}

The Conflict Problem:

User opens two tabs:

Tab 1: Stable (v47) - writes v1 schema
Tab 2: Canary (v48) - reads, migrates to v2, writes v2 schema
Tab 1: Writes again - overwrites with v1 schema
Tab 2: Reads - sees v1 again, re-migrates

This causes data loss and migration loops.

Solution 2: Versioned Keys

function readCart(version: string): CartV2 {
  const key = `cart:${version}`;
  const raw = localStorage.getItem(key);

  if (raw) {
    return JSON.parse(raw);
  }

  // Try to migrate from previous version
  const previousVersion = getPreviousVersion(version);
  if (previousVersion) {
    const previousKey = `cart:${previousVersion}`;
    const previousRaw = localStorage.getItem(previousKey);

    if (previousRaw) {
      const migrated = migrateCart(JSON.parse(previousRaw), previousVersion, version);
      localStorage.setItem(key, JSON.stringify(migrated));
      return migrated;
    }
  }

  return { version: 2, items: [] };
}

function writeCart(version: string, cart: CartV2): void {
  const key = `cart:${version}`;
  localStorage.setItem(key, JSON.stringify(cart));
}

The Storage Explosion Problem:

If you keep versioned keys indefinitely, localStorage fills up (5-10MB limit). You need garbage collection:

function cleanupOldVersions(currentVersion: string): void {
  const allKeys = Object.keys(localStorage);
  const cartKeys = allKeys.filter(k => k.startsWith('cart:'));

  for (const key of cartKeys) {
    const [, version] = key.split(':');

    if (version !== currentVersion && isOlderThan(version, currentVersion, 7)) {
      // Delete versions older than 7 days
      localStorage.removeItem(key);
    }
  }
}

Solution 3: Backend-Synchronized State (Best)

Don't rely on client storage for critical state. Sync to backend:

class CartManager {
  private version: string;
  private userId: string;

  async loadCart(): Promise<CartV2> {
    // Try local cache first
    const cached = this.readLocalCache();
    if (cached && !this.isStale(cached)) {
      return cached;
    }

    // Fetch from backend (source of truth)
    const response = await fetch('/api/cart', {
      headers: { 'X-App-Version': this.version }
    });

    const cart = await response.json();

    // Update local cache
    this.writeLocalCache(cart);

    return cart;
  }

  async updateCart(updates: Partial<CartV2>): Promise<void> {
    // Optimistic update
    const current = await this.loadCart();
    const updated = { ...current, ...updates };
    this.writeLocalCache(updated);

    // Sync to backend
    try {
      await fetch('/api/cart', {
        method: 'PUT',
        headers: {
          'Content-Type': 'application/json',
          'X-App-Version': this.version,
        },
        body: JSON.stringify(updated),
      });
    } catch (error) {
      // Rollback optimistic update
      this.writeLocalCache(current);
      throw error;
    }
  }

  private readLocalCache(): CartV2 | null {
    const key = `cart:cache`;
    const raw = localStorage.getItem(key);
    return raw ? JSON.parse(raw) : null;
  }

  private writeLocalCache(cart: CartV2): void {
    localStorage.setItem('cart:cache', JSON.stringify({
      data: cart,
      timestamp: Date.now(),
      version: this.version,
    }));
  }

  private isStale(cached: { timestamp: number; version: string }): boolean {
    const age = Date.now() - cached.timestamp;
    return age > 60000 || cached.version !== this.version;
  }
}

Verdict: Use versioned schemas + backend sync for critical state. Accept client-only state will occasionally break during canary and require resets.

Canary Analysis System: Automated Decision Making

The entire canary system lives or dies on the analysis pipeline. Humans approve the deploy, but robots must decide progression, because:

Canary windows are short (5-45 minutes)
Decisions need to happen every 30-60 seconds
Metrics are noisy (need statistical significance)
False positives are expensive (block good deploys)
False negatives are catastrophic (let bad code reach 100%)

Here's how production systems do it:

Architecture: Real-Time Metrics Pipeline

graph LR
    Client[Browser/Device] -->|RUM Beacon| Ingestion[Kafka Ingestion]
    CDNLogs[CDN Access Logs] -->|Stream| Ingestion
    OriginLogs[Origin Server Logs] -->|Stream| Ingestion

    Ingestion --> Flink[Flink Stream Processing]

    Flink --> Aggregate[Time-Window Aggregation<br/>30s tumbling windows]

    Aggregate --> Stratify[Stratification<br/>by region/device/connection]

    Stratify --> TSDB[(InfluxDB/TimescaleDB<br/>Time-Series Storage)]

    TSDB --> Analyzer[Canary Analyzer Service]

    Analyzer --> StatTests[Statistical Tests<br/>t-test, Mann-Whitney]

    StatTests --> Anomaly[Anomaly Detection<br/>IQR, Z-score]

    Anomaly --> Threshold[Threshold Evaluation<br/>SLO-based]

    Threshold --> Bayes[Bayesian Confidence<br/>Prior + Evidence]

    Bayes --> Decision{Decision}

    Decision -->|Proceed| Progression[Progressive Rollout<br/>5% → 25% → 50% → 100%]
    Decision -->|Hold| Monitor[Continue Monitoring]
    Decision -->|Rollback| Rollback[Automated Rollback<br/>+ Incident Creation]

    Progression --> UpdateKV[Update KV Store<br/>canary-percentage]
    Rollback --> UpdateKV

    UpdateKV --> CDNEdge[CDN Edge Workers]

Implementation: Canary Analyzer Service

interface MetricSnapshot {
  timestamp: number;
  version: 'canary' | 'stable';
  stratum: {
    region: string;
    deviceType: string;
    connectionType: string;
  };
  metrics: {
    requests: number;
    errors: number;
    errorRate: number;
    p50Latency: number;
    p95Latency: number;
    p99Latency: number;
    lcp: number;
    inp: number;
    cls: number;
    hydrationErrors: number;
    jsErrors: number;
  };
}

interface CanaryDecision {
  timestamp: number;
  decision: 'proceed' | 'hold' | 'rollback';
  confidence: number; // 0-1
  reason: string;
  metrics: {
    errorRateDelta: number;
    latencyDelta: number;
    lcpDelta: number;
  };
  recommendation: {
    nextPercentage?: number; // If proceeding
    holdDuration?: number; // If holding
    rollbackReason?: string; // If rolling back
  };
}

class CanaryAnalyzerService {
  private tsdb: TimeSeriesDB;
  private config: CanaryConfig;

  constructor(tsdb: TimeSeriesDB, config: CanaryConfig) {
    this.tsdb = tsdb;
    this.config = config;
  }

  async analyze(deploymentId: string): Promise<CanaryDecision> {
    // Fetch metrics from last 5 minutes
    const endTime = Date.now();
    const startTime = endTime - (5 * 60 * 1000);

    const canaryMetrics = await this.tsdb.query({
      measurement: 'frontend_metrics',
      tags: {
        deployment_id: deploymentId,
        version: 'canary',
      },
      timeRange: [startTime, endTime],
    });

    const stableMetrics = await this.tsdb.query({
      measurement: 'frontend_metrics',
      tags: {
        version: 'stable',
      },
      timeRange: [startTime, endTime],
    });

    // Check minimum sample size
    const canaryRequests = canaryMetrics.reduce((sum, m) => sum + m.metrics.requests, 0);

    if (canaryRequests < this.config.minSampleSize) {
      return {
        timestamp: Date.now(),
        decision: 'hold',
        confidence: 0,
        reason: `Insufficient sample size: ${canaryRequests} < ${this.config.minSampleSize}`,
        metrics: { errorRateDelta: 0, latencyDelta: 0, lcpDelta: 0 },
        recommendation: {
          holdDuration: 60000, // Wait 1 more minute
        },
      };
    }

    // Stratify metrics
    const canaryByStrata = this.stratify(canaryMetrics);
    const stableByStrata = this.stratify(stableMetrics);

    // Run analysis per stratum
    const stratumResults: StratumAnalysis[] = [];

    for (const stratum of Object.keys(canaryByStrata)) {
      const canaryData = canaryByStrata[stratum];
      const stableData = stableByStrata[stratum];

      if (!stableData) continue;

      const result = await this.analyzeStratum(canaryData, stableData, stratum);
      stratumResults.push(result);
    }

    // Aggregate results
    const aggregated = this.aggregateStratumResults(stratumResults);

    // Run anomaly detection
    const anomalies = await this.detectAnomalies(deploymentId, canaryMetrics);

    // Bayesian decision making
    const decision = this.makeBayesianDecision(aggregated, anomalies);

    return decision;
  }

  private stratify(metrics: MetricSnapshot[]): Record<string, MetricSnapshot[]> {
    const stratified: Record<string, MetricSnapshot[]> = {};

    for (const metric of metrics) {
      const key = `${metric.stratum.region}:${metric.stratum.deviceType}:${metric.stratum.connectionType}`;

      if (!stratified[key]) {
        stratified[key] = [];
      }

      stratified[key].push(metric);
    }

    return stratified;
  }

  private async analyzeStratum(
    canaryMetrics: MetricSnapshot[],
    stableMetrics: MetricSnapshot[],
    stratum: string
  ): Promise<StratumAnalysis> {
    // Extract metric arrays
    const canaryErrors = canaryMetrics.map(m => m.metrics.errorRate);
    const stableErrors = stableMetrics.map(m => m.metrics.errorRate);

    const canaryP95 = canaryMetrics.map(m => m.metrics.p95Latency);
    const stableP95 = stableMetrics.map(m => m.metrics.p95Latency);

    const canaryLCP = canaryMetrics.map(m => m.metrics.lcp);
    const stableLCP = stableMetrics.map(m => m.metrics.lcp);

    // Statistical tests
    const errorTest = this.welchTTest(canaryErrors, stableErrors);
    const latencyTest = this.welchTTest(canaryP95, stableP95);
    const lcpTest = this.welchTTest(canaryLCP, stableLCP);

    // Calculate effect sizes (Cohen's d)
    const errorEffectSize = this.cohensD(canaryErrors, stableErrors);
    const latencyEffectSize = this.cohensD(canaryP95, stableP95);
    const lcpEffectSize = this.cohensD(canaryLCP, stableLCP);

    return {
      stratum,
      sampleSize: canaryMetrics.reduce((sum, m) => sum + m.metrics.requests, 0),
      errorRate: {
        canaryMean: this.mean(canaryErrors),
        stableMean: this.mean(stableErrors),
        delta: errorTest.delta,
        pValue: errorTest.pValue,
        significant: errorTest.pValue < 0.05,
        effectSize: errorEffectSize,
      },
      p95Latency: {
        canaryMean: this.mean(canaryP95),
        stableMean: this.mean(stableP95),
        delta: latencyTest.delta,
        pValue: latencyTest.pValue,
        significant: latencyTest.pValue < 0.05,
        effectSize: latencyEffectSize,
      },
      lcp: {
        canaryMean: this.mean(canaryLCP),
        stableMean: this.mean(stableLCP),
        delta: lcpTest.delta,
        pValue: lcpTest.pValue,
        significant: lcpTest.pValue < 0.05,
        effectSize: lcpEffectSize,
      },
    };
  }

  private welchTTest(sample1: number[], sample2: number[]): TTestResult {
    const mean1 = this.mean(sample1);
    const mean2 = this.mean(sample2);
    const delta = mean1 - mean2;

    const variance1 = this.variance(sample1);
    const variance2 = this.variance(sample2);

    const n1 = sample1.length;
    const n2 = sample2.length;

    const standardError = Math.sqrt(variance1 / n1 + variance2 / n2);
    const tStatistic = delta / standardError;

    // Degrees of freedom
    const df = Math.pow(variance1 / n1 + variance2 / n2, 2) /
      (Math.pow(variance1 / n1, 2) / (n1 - 1) + Math.pow(variance2 / n2, 2) / (n2 - 1));

    // p-value (use stats library in production)
    const pValue = this.tDistributionCDF(Math.abs(tStatistic), df);

    return { delta, pValue, tStatistic };
  }

  private cohensD(sample1: number[], sample2: number[]): number {
    const mean1 = this.mean(sample1);
    const mean2 = this.mean(sample2);

    const variance1 = this.variance(sample1);
    const variance2 = this.variance(sample2);

    const n1 = sample1.length;
    const n2 = sample2.length;

    // Pooled standard deviation
    const pooledSD = Math.sqrt(
      ((n1 - 1) * variance1 + (n2 - 1) * variance2) / (n1 + n2 - 2)
    );

    return (mean1 - mean2) / pooledSD;
  }

  private async detectAnomalies(
    deploymentId: string,
    canaryMetrics: MetricSnapshot[]
  ): Promise<Anomaly[]> {
    const anomalies: Anomaly[] = [];

    // Get historical baseline (last 7 days)
    const baseline = await this.tsdb.query({
      measurement: 'frontend_metrics',
      tags: { version: 'stable' },
      timeRange: [Date.now() - (7 * 24 * 60 * 60 * 1000), Date.now()],
      aggregation: 'mean',
    });

    // Calculate IQR for each metric
    const errorRates = baseline.map(m => m.metrics.errorRate).sort((a, b) => a - b);
    const errorIQR = this.calculateIQR(errorRates);

    const latencies = baseline.map(m => m.metrics.p95Latency).sort((a, b) => a - b);
    const latencyIQR = this.calculateIQR(latencies);

    // Check canary metrics against baseline
    for (const metric of canaryMetrics) {
      // Error rate anomaly
      if (metric.metrics.errorRate > errorIQR.q3 + 1.5 * errorIQR.iqr) {
        anomalies.push({
          type: 'error_rate',
          severity: 'critical',
          value: metric.metrics.errorRate,
          threshold: errorIQR.q3 + 1.5 * errorIQR.iqr,
          message: `Error rate ${metric.metrics.errorRate.toFixed(4)} exceeds threshold`,
        });
      }

      // Latency anomaly
      if (metric.metrics.p95Latency > latencyIQR.q3 + 1.5 * latencyIQR.iqr) {
        anomalies.push({
          type: 'latency',
          severity: 'warning',
          value: metric.metrics.p95Latency,
          threshold: latencyIQR.q3 + 1.5 * latencyIQR.iqr,
          message: `P95 latency ${metric.metrics.p95Latency}ms exceeds threshold`,
        });
      }

      // Hydration error spike
      if (metric.metrics.hydrationErrors > 0) {
        anomalies.push({
          type: 'hydration_error',
          severity: 'critical',
          value: metric.metrics.hydrationErrors,
          threshold: 0,
          message: `Hydration errors detected: ${metric.metrics.hydrationErrors}`,
        });
      }
    }

    return anomalies;
  }

  private calculateIQR(sortedValues: number[]): { q1: number; q3: number; iqr: number } {
    const n = sortedValues.length;
    const q1Index = Math.floor(n * 0.25);
    const q3Index = Math.floor(n * 0.75);

    const q1 = sortedValues[q1Index];
    const q3 = sortedValues[q3Index];

    return { q1, q3, iqr: q3 - q1 };
  }

  private makeBayesianDecision(
    stratumResults: StratumAnalysis[],
    anomalies: Anomaly[]
  ): CanaryDecision {
    // Prior probability (based on historical rollback rate)
    const priorRollbackRate = 0.08; // 8% of canaries get rolled back
    let posteriorRollbackProb = priorRollbackRate;

    // Update based on statistical tests
    const criticalRegressions = stratumResults.filter(r =>
      (r.errorRate.significant && r.errorRate.delta > this.config.thresholds.errorRate) ||
      (r.lcp.significant && r.lcp.delta > this.config.thresholds.lcp)
    );

    if (criticalRegressions.length > 0) {
      // Strong evidence of regression
      posteriorRollbackProb = 0.95;
    }

    // Update based on anomalies
    const criticalAnomalies = anomalies.filter(a => a.severity === 'critical');

    if (criticalAnomalies.length > 0) {
      posteriorRollbackProb = Math.max(posteriorRollbackProb, 0.9);
    }

    // Effect size consideration
    const largeEffectSizes = stratumResults.filter(r =>
      Math.abs(r.errorRate.effectSize) > 0.8 || // Large effect
      Math.abs(r.lcp.effectSize) > 0.8
    );

    if (largeEffectSizes.length > 0 && posteriorRollbackProb < 0.5) {
      posteriorRollbackProb = 0.5; // Moderate confidence
    }

    // Make decision
    let decision: 'proceed' | 'hold' | 'rollback';
    let reason: string;
    let recommendation: any = {};

    if (posteriorRollbackProb > 0.7) {
      decision = 'rollback';
      reason = `High rollback probability (${posteriorRollbackProb.toFixed(2)}). `;

      if (criticalAnomalies.length > 0) {
        reason += `Critical anomalies: ${criticalAnomalies.map(a => a.message).join(', ')}`;
      } else {
        reason += `Significant regressions in ${criticalRegressions.length} strata`;
      }

      recommendation.rollbackReason = reason;
    } else if (posteriorRollbackProb > 0.3 || largeEffectSizes.length > 0) {
      decision = 'hold';
      reason = `Moderate rollback probability (${posteriorRollbackProb.toFixed(2)}). Collecting more data.`;
      recommendation.holdDuration = 120000; // Hold for 2 minutes
    } else {
      decision = 'proceed';
      reason = `Low rollback probability (${posteriorRollbackProb.toFixed(2)}). Metrics within acceptable range.`;
      recommendation.nextPercentage = this.calculateNextPercentage(stratumResults);
    }

    // Calculate aggregate deltas
    const totalSampleSize = stratumResults.reduce((sum, r) => sum + r.sampleSize, 0);
    const weightedErrorDelta = stratumResults.reduce(
      (sum, r) => sum + r.errorRate.delta * r.sampleSize,
      0
    ) / totalSampleSize;

    const weightedLatencyDelta = stratumResults.reduce(
      (sum, r) => sum + r.p95Latency.delta * r.sampleSize,
      0
    ) / totalSampleSize;

    const weightedLcpDelta = stratumResults.reduce(
      (sum, r) => sum + r.lcp.delta * r.sampleSize,
      0
    ) / totalSampleSize;

    return {
      timestamp: Date.now(),
      decision,
      confidence: 1 - posteriorRollbackProb,
      reason,
      metrics: {
        errorRateDelta: weightedErrorDelta,
        latencyDelta: weightedLatencyDelta,
        lcpDelta: weightedLcpDelta,
      },
      recommendation,
    };
  }

  private calculateNextPercentage(stratumResults: StratumAnalysis[]): number {
    // Conservative progression if any warnings
    const warnings = stratumResults.filter(r =>
      (r.errorRate.significant && r.errorRate.delta > 0) ||
      (r.p95Latency.significant && r.p95Latency.delta > 100) ||
      (r.lcp.significant && r.lcp.delta > 200)
    );

    if (warnings.length > 0) {
      return 25; // Go to 25% cautiously
    }

    // Aggressive progression if clear improvements
    const improvements = stratumResults.filter(r =>
      (r.errorRate.significant && r.errorRate.delta < 0) ||
      (r.p95Latency.significant && r.p95Latency.delta < -50) ||
      (r.lcp.significant && r.lcp.delta < -100)
    );

    if (improvements.length > stratumResults.length * 0.5) {
      return 100; // Go to 100% quickly
    }

    // Default: gradual progression
    return 50;
  }

  private mean(values: number[]): number {
    return values.reduce((sum, v) => sum + v, 0) / values.length;
  }

  private variance(values: number[]): number {
    const mean = this.mean(values);
    return values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / (values.length - 1);
  }

  private tDistributionCDF(t: number, df: number): number {
    // Simplified - use jStat or math.js in production
    return 2 * (1 - this.normalCDF(t));
  }

  private normalCDF(z: number): number {
    return 0.5 * (1 + this.erf(z / Math.sqrt(2)));
  }

  private erf(x: number): number {
    // Abramowitz and Stegun approximation
    const sign = x >= 0 ? 1 : -1;
    x = Math.abs(x);

    const a1 = 0.254829592;
    const a2 = -0.284496736;
    const a3 = 1.421413741;
    const a4 = -1.453152027;
    const a5 = 1.061405429;
    const p = 0.3275911;

    const t = 1 / (1 + p * x);
    const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);

    return sign * y;
  }
}

interface CanaryConfig {
  minSampleSize: number;
  thresholds: {
    errorRate: number;
    latency: number;
    lcp: number;
    inp: number;
  };
}

interface StratumAnalysis {
  stratum: string;
  sampleSize: number;
  errorRate: MetricComparison;
  p95Latency: MetricComparison;
  lcp: MetricComparison;
}

interface MetricComparison {
  canaryMean: number;
  stableMean: number;
  delta: number;
  pValue: number;
  significant: boolean;
  effectSize: number;
}

interface TTestResult {
  delta: number;
  pValue: number;
  tStatistic: number;
}

interface Anomaly {
  type: string;
  severity: 'critical' | 'warning';
  value: number;
  threshold: number;
  message: string;
}

Decision Loop Cadence:

┌─────────────────────────────────────────────────────────────┐
│  Canary Timeline (Progressive Rollout)                       │
└─────────────────────────────────────────────────────────────┘

T+0m    Deploy canary (5% traffic)
        ├─ Edge workers updated
        ├─ Metrics start flowing
        └─ Analysis: HOLD (insufficient data)

T+2m    First decision point
        ├─ Sample size: 15K requests
        ├─ Analysis: PROCEED (no regressions)
        └─ Action: Increase to 25%

T+5m    Second decision point
        ├─ Sample size: 75K requests
        ├─ Analysis: HOLD (latency spike in EU)
        └─ Action: Hold at 25%, investigate

T+8m    Third decision point
        ├─ Sample size: 180K requests
        ├─ Analysis: PROCEED (spike was CDN issue, resolved)
        └─ Action: Increase to 50%

T+12m   Fourth decision point
        ├─ Sample size: 450K requests
        ├─ Analysis: PROCEED (metrics within bounds)
        └─ Action: Increase to 100%

T+15m   Canary complete
        ├─ Total requests: 1.2M
        ├─ Final error rate delta: +0.003% (acceptable)
        └─ Mark as stable, promote to production

Rollback Scenario:

T+0m    Deploy canary (5% traffic)

T+2m    First decision point
        ├─ Sample size: 15K requests
        ├─ Analysis: ROLLBACK
        │   ├─ Error rate: 0.24% (stable: 0.08%)
        │   ├─ Delta: +0.16% (threshold: +0.05%)
        │   └─ Confidence: 0.95
        └─ Action: IMMEDIATE ROLLBACK

T+2m:30s Rollback initiated
        ├─ Update KV: canary-percentage = 0
        ├─ Edge workers stop routing to canary
        ├─ CDN cache purge: /dist-v48/*
        └─ Incident created in PagerDuty

T+3m    Rollback complete
        ├─ 100% traffic on stable (v47)
        ├─ Canary origin scaled down
        └─ Engineering team notified

Verdict: Automated analysis must run every 30-60 seconds with Bayesian decision-making and anomaly detection. False negative (missed regression) is worse than false positive (blocked good deploy).

Progressive Rollout Strategies

Once the analyzer says "proceed," you need to gradually increase canary traffic. The progression strategy depends on risk tolerance and confidence.

Strategy 1: Percentage-Based (Standard)

5% → 25% → 50% → 100%

Timing:

Hold at 5% for 5-10 minutes (minimum viable sample)
Hold at 25% for 5-10 minutes (catch medium-impact bugs)
Hold at 50% for 5-10 minutes (validate at scale)
Jump to 100% if all clear

Implementation:

class ProgressiveRolloutController {
  private currentPercentage: number = 5;
  private deploymentStartTime: number = Date.now();

  async progressToNextStage(analysis: CanaryDecision): Promise<void> {
    if (analysis.decision === 'rollback') {
      await this.rollback(analysis.reason);
      return;
    }

    if (analysis.decision === 'hold') {
      console.log(`Holding at ${this.currentPercentage}%: ${analysis.reason}`);
      return;
    }

    // Proceed to next stage
    const nextPercentage = this.calculateNextPercentage();

    if (nextPercentage === this.currentPercentage) {
      console.log(`Already at ${this.currentPercentage}%, no progression`);
      return;
    }

    console.log(`Progressing from ${this.currentPercentage}% to ${nextPercentage}%`);

    await this.updateCanaryPercentage(nextPercentage);
    this.currentPercentage = nextPercentage;

    if (nextPercentage === 100) {
      await this.finalizeDeployment();
    }
  }

  private calculateNextPercentage(): number {
    const stages = [5, 25, 50, 100];
    const currentIndex = stages.indexOf(this.currentPercentage);

    if (currentIndex === -1 || currentIndex === stages.length - 1) {
      return this.currentPercentage;
    }

    const timeAtCurrentStage = Date.now() - this.deploymentStartTime;
    const minDuration = 5 * 60 * 1000; // 5 minutes

    if (timeAtCurrentStage < minDuration) {
      return this.currentPercentage; // Not ready to progress
    }

    return stages[currentIndex + 1];
  }

  private async updateCanaryPercentage(percentage: number): Promise<void> {
    // Update KV store (propagates to all edge workers)
    await this.kv.put('canary-percentage', percentage.toString());

    // Log event
    await this.logger.info('canary_progression', {
      from: this.currentPercentage,
      to: percentage,
      timestamp: Date.now(),
    });

    // Send metrics
    await this.metrics.gauge('canary.percentage', percentage);
  }

  private async rollback(reason: string): Promise<void> {
    console.error(`Rolling back canary: ${reason}`);

    // Set percentage to 0 (disables canary routing)
    await this.updateCanaryPercentage(0);

    // Purge canary assets from CDN
    await this.purgeCDNCache('/dist-v48/*');

    // Create incident
    await this.createIncident({
      title: 'Canary Rollback: v48',
      severity: 'high',
      description: reason,
    });

    // Notify team
    await this.notifyTeam('Canary rolled back', reason);
  }

  private async finalizeDeployment(): Promise<void> {
    console.log('Canary successful, finalizing deployment');

    // Update stable version pointer
    await this.kv.put('stable-version', 'v48');

    // Purge old stable assets
    await this.purgeCDNCache('/dist-v47/*');

    // Update deployment status
    await this.db.updateDeployment({
      id: this.deploymentId,
      status: 'completed',
      completedAt: Date.now(),
    });

    // Notify team
    await this.notifyTeam('Deployment complete', 'Canary reached 100% successfully');
  }
}

Strategy 2: Region-Based (Geographic Rollout)

US-West → US-East → EU → APAC → Global

Why This Works:

Different regions have different peak times (natural load distribution)
Can catch region-specific bugs (locale, timezone, network)
Limits blast radius (if US-West fails, APAC is unaffected)

Implementation:

class RegionBasedRollout {
  private regions = ['us-west', 'us-east', 'eu-west', 'eu-central', 'ap-southeast', 'ap-northeast'];
  private currentRegionIndex = 0;

  async progressToNextRegion(analysis: CanaryDecision): Promise<void> {
    if (analysis.decision === 'rollback') {
      await this.rollback(analysis.reason);
      return;
    }

    if (analysis.decision === 'hold') {
      console.log(`Holding in region ${this.getCurrentRegion()}: ${analysis.reason}`);
      return;
    }

    // Mark current region as complete
    await this.markRegionComplete(this.getCurrentRegion());

    // Move to next region
    this.currentRegionIndex++;

    if (this.currentRegionIndex >= this.regions.length) {
      await this.finalizeGlobalRollout();
      return;
    }

    const nextRegion = this.getCurrentRegion();
    console.log(`Starting canary in region: ${nextRegion}`);

    await this.enableRegion(nextRegion);
  }

  private async enableRegion(region: string): Promise<void> {
    // Update KV with region-specific routing
    await this.kv.put(`canary-enabled:${region}`, 'true');

    // Edge workers check this key
    // If enabled for region, route to canary
  }

  private getCurrentRegion(): string {
    return this.regions[this.currentRegionIndex];
  }
}

Edge Worker Implementation:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const clientRegion = request.cf?.region || 'unknown';

    // Check if canary is enabled for this region
    const canaryEnabled = await env.KV.get(`canary-enabled:${clientRegion}`);

    if (canaryEnabled !== 'true') {
      // Route to stable
      return fetchStableOrigin(request);
    }

    // Canary enabled for region - do percentage-based split
    const percentage = parseInt(await env.KV.get('canary-percentage') || '50');
    const userId = getUserId(request);
    const bucket = hash(userId) % 100;

    if (bucket < percentage) {
      return fetchCanaryOrigin(request);
    }

    return fetchStableOrigin(request);
  }
};

Strategy 3: User-Segment-Based (Cohort Rollout)

Internal → Beta Users → Premium → Free

Why This Works:

Internal users (employees) catch bugs first
Beta users opt-in to instability
Premium users get stable experience
Free users get latest features (lower risk tolerance)

Implementation:

class CohortBasedRollout {
  async determineCanaryEligibility(user: User): Promise<boolean> {
    // Internal users always get canary
    if (user.email.endsWith('@company.com')) {
      return true;
    }

    // Check beta program enrollment
    if (user.betaProgram === true) {
      return true;
    }

    // Check rollout stage
    const currentStage = await this.kv.get('canary-stage');

    switch (currentStage) {
      case 'internal':
        return user.email.endsWith('@company.com');

      case 'beta':
        return user.betaProgram === true;

      case 'premium':
        return user.tier === 'premium' || user.tier === 'enterprise';

      case 'free':
        return true; // All users

      default:
        return false;
    }
  }
}

Edge Worker Integration:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const userId = getUserIdFromCookie(request);

    if (!userId) {
      // Anonymous user - use percentage-based routing
      return percentageBasedRouting(request, env);
    }

    // Fetch user data (cached at edge)
    const user = await fetchUserData(userId, env);

    if (!user) {
      return percentageBasedRouting(request, env);
    }

    // Check canary eligibility
    const eligible = await determineCanaryEligibility(user, env);

    if (eligible) {
      return fetchCanaryOrigin(request, env);
    }

    return fetchStableOrigin(request, env);
  }
};

async function determineCanaryEligibility(user: any, env: Env): Promise<boolean> {
  const stage = await env.KV.get('canary-stage');

  if (stage === 'internal') {
    return user.email?.endsWith('@company.com') || false;
  }

  if (stage === 'beta') {
    return user.betaProgram === true;
  }

  if (stage === 'premium') {
    return ['premium', 'enterprise'].includes(user.tier);
  }

  if (stage === 'free') {
    return true;
  }

  return false;
}

Comparison Table

Strategy	Blast Radius	Detection Speed	Complexity	Best For
Percentage-Based	5-50% of users	Fast (5-10 min)	Low	Standard deploys, high traffic apps
Region-Based	One region at a time	Medium (15-30 min per region)	Medium	Global apps with regional isolation
Cohort-Based	Specific user segments	Slow (hours to days)	High	B2B SaaS, tiered products
Hybrid (Cohort + Percentage)	Controlled subsets	Fast within cohort	High	Enterprise apps with beta programs

Verdict: Use percentage-based for most deploys, region-based for global apps, cohort-based for high-stakes enterprise products.

Rollback Mechanisms and Timing

When canary analysis detects a regression, the rollback must happen in <60 seconds. Here's how production systems do it.

Atomic Rollback: KV Store Pointer Swap

class AtomicRollbackController {
  async rollback(reason: string): Promise<void> {
    const startTime = Date.now();

    console.error(`[ROLLBACK INITIATED] ${reason}`);

    // Step 1: Disable canary routing (atomic operation)
    await this.kv.put('canary-percentage', '0');

    // Step 2: Purge canary assets from CDN (parallel)
    const purgePromises = [
      this.purgeCDN('/dist-v48/*'),
      this.purgeCDN('/api/v48/*'),
    ];

    await Promise.all(purgePromises);

    // Step 3: Scale down canary origin (don't wait)
    this.scaleDownCanary().catch(err => {
      console.error('Failed to scale down canary:', err);
    });

    // Step 4: Create incident
    await this.createIncident({
      title: `Canary Rollback: ${this.deploymentId}`,
      severity: 'high',
      description: reason,
      tags: ['canary', 'rollback', 'automated'],
    });

    // Step 5: Notify team (Slack + PagerDuty)
    await this.notifyTeam(reason);

    const duration = Date.now() - startTime;
    console.log(`[ROLLBACK COMPLETE] Duration: ${duration}ms`);

    // Metrics
    await this.metrics.increment('canary.rollback', 1, {
      deployment: this.deploymentId,
      reason: this.categorizeReason(reason),
    });

    await this.metrics.histogram('canary.rollback_duration', duration);
  }

  private async purgeCDN(pattern: string): Promise<void> {
    // Cloudflare example
    const response = await fetch(
      `https://api.cloudflare.com/client/v4/zones/${this.zoneId}/purge_cache`,
      {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${this.cfApiToken}`,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          files: [pattern],
        }),
      }
    );

    if (!response.ok) {
      throw new Error(`CDN purge failed: ${response.statusText}`);
    }
  }

  private categorizeReason(reason: string): string {
    if (reason.includes('error rate')) return 'error_rate';
    if (reason.includes('latency')) return 'latency';
    if (reason.includes('hydration')) return 'hydration';
    if (reason.includes('anomaly')) return 'anomaly';
    return 'unknown';
  }
}

The CDN Propagation Problem

Challenge: Even after you set canary-percentage = 0, edge workers at 300+ PoPs need to read the updated value. KV stores have eventual consistency (30-60 seconds).

Solution 1: Pessimistic Rollback (Purge Everything)

async function emergencyRollback(): Promise<void> {
  // Nuclear option: purge ALL HTML from CDN
  await purgeCDN('*.html');

  // This forces all edge workers to fetch fresh HTML from origin
  // Origin will render with stable version

  // Downside: Cache hit rate drops to 0%, origin load spikes
}

Solution 2: Edge Worker Cache Bypass

// Cloudflare Worker
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Check KV with low TTL
    const canaryPercentage = await env.KV.get('canary-percentage', {
      cacheTtl: 5, // Only cache for 5 seconds
    });

    // If canary disabled, bypass all canary logic
    if (canaryPercentage === '0') {
      return fetchStableOrigin(request);
    }

    // Normal routing logic
    // ...
  }
};

Solution 3: Push-Based Invalidation

class EdgeInvalidationService {
  async broadcastRollback(deploymentId: string): Promise<void> {
    // Use pub/sub to notify all edge workers
    await this.pubsub.publish('rollback', {
      deployment: deploymentId,
      timestamp: Date.now(),
    });

    // Edge workers subscribe to this channel
    // They immediately disable canary routing
  }
}

// In edge worker
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Check local in-memory flag (updated via pub/sub)
    if (env.ROLLBACK_ACTIVE) {
      return fetchStableOrigin(request);
    }

    // Normal routing
    // ...
  }
};

// Pub/sub listener (runs in edge worker)
env.PUBSUB.subscribe('rollback', (message) => {
  console.log(`Rollback received for deployment ${message.deployment}`);
  env.ROLLBACK_ACTIVE = true;
});

Rollback Timing Breakdown:

T+0s     Analyzer detects regression
         └─ Decision: ROLLBACK

T+0.5s   Update KV store (canary-percentage = 0)
         └─ Atomic write completes

T+1s     Edge workers start seeing updated value
         └─ New requests route to stable

T+2s     CDN cache purge initiated (parallel)
         ├─ HTML: /dist-v48/*.html
         ├─ JS: /dist-v48/*.js
         └─ CSS: /dist-v48/*.css

T+30s    CDN purge propagated globally
         └─ All edges serve stable assets

T+60s    Rollback complete
         ├─ 100% traffic on stable
         ├─ Incident created
         └─ Team notified

The Hidden Cost:

During rollback, some users will still hit canary for 30-60 seconds. This is unavoidable due to CDN propagation delays.

Mitigation:

Use client-side error recovery:

// Injected in all HTML (stable + canary)
window.addEventListener('error', async (event) => {
  const errorCount = parseInt(sessionStorage.getItem('error-count') || '0');

  if (errorCount > 3) {
    // Too many errors - possible bad deployment
    console.warn('Excessive errors detected, forcing reload with cache bypass');

    // Reload with cache bypass
    window.location.reload(true);

    // Or redirect to stable explicitly
    window.location.href = `${window.location.href}?force-stable=1`;
  }

  sessionStorage.setItem('error-count', (errorCount + 1).toString());
});

Verdict: Atomic KV update + CDN purge + client-side recovery = <60s rollback window.

Feature Flags vs Canary Deployments

A common question: "Can't we just use feature flags instead of canary deployments?"

Short Answer: No, but they complement each other.

Comparison

Aspect	Feature Flags	Canary Deployment
What Changes	Application behavior	Entire codebase
Scope	Single feature/code path	All code, including infrastructure
Granularity	Per-user, per-feature	Per-request, per-version
Rollback Speed	Instant (toggle off)	30-60s (CDN purge)
Bundle Impact	Increases bundle (both paths shipped)	No impact (only one version loaded)
Testing Coverage	Only new code path	Entire app (including dependencies)
Use Case	A/B testing, gradual feature release	Deployment risk mitigation

When to Use Feature Flags

A/B Testing - "Should button be blue or green?"
Gradual Feature Rollout - "Enable new checkout flow for 10% of users"
Emergency Kill Switch - "Disable payment processor if it's down"
User Segmentation - "Show premium features only to paid users"

When to Use Canary Deployments

Dependency Updates - "Upgraded React 17 → 18, will it break?"
Build Tool Changes - "Switched Webpack → Vite, are bundles correct?"
Infrastructure Changes - "Migrated CDN provider, is routing correct?"
Large Refactors - "Rewrote state management, does it work?"

The Hybrid Approach (Best)

Use both in coordination:

class HybridDeploymentStrategy {
  async deploy(version: string, featureFlags: string[]): Promise<void> {
    // Step 1: Deploy canary with new features DISABLED
    console.log('Deploying canary v48 with features disabled');

    await this.deployCanary(version, {
      featureFlags: featureFlags.map(f => ({ name: f, enabled: false })),
    });

    // Step 2: Wait for canary to stabilize (5-10 minutes)
    await this.waitForStability();

    // Step 3: Enable features gradually via flags
    for (const flag of featureFlags) {
      console.log(`Enabling feature: ${flag}`);

      await this.featureFlagService.enable(flag, {
        percentage: 5, // Start at 5%
        canaryOnly: true, // Only in canary traffic
      });

      // Step 4: Monitor feature-specific metrics
      await this.monitorFeature(flag, 5 * 60 * 1000); // 5 minutes

      // Step 5: If stable, increase to 100%
      await this.featureFlagService.enable(flag, {
        percentage: 100,
        canaryOnly: true,
      });
    }

    // Step 6: If all features stable, progress canary
    await this.progressCanary(25); // 5% → 25%

    // Step 7: Eventually enable features in stable too
    await this.enableFeaturesInStable(featureFlags);
  }
}

Real-World Example: Meta's Gatekeeper + Canary System

Meta uses "Gatekeeper" (feature flags) + canary deployments in tandem:

Deploy new React Native version as canary (code change)
Keep new features behind Gatekeeper flags (behavior change)
Stabilize canary (no flags enabled)
Enable flags at 1% in canary only
Monitor metrics (crashes, performance)
Enable flags at 10% → 50% → 100% in canary
Progress canary to 100% of traffic
Enable flags in stable version
Clean up flag code in next release

Verdict: Use canaries for deployment safety, use feature flags for feature safety. Combine them for maximum control.

CDN Cache Invalidation During Canary

CDN caching and canary deployments have a fundamental tension: you want aggressive caching (performance) but you also need instant rollback (safety).

The Caching Problem

Naive Approach:

Cache-Control: public, max-age=3600, s-maxage=86400

Why This Breaks Canaries:

User requests /index.html at T+0 (canary)
CDN caches it for 24 hours (s-maxage=86400)
At T+10m, canary is rolled back
User requests /index.html at T+15m
CDN serves cached canary version (oops)

User sees broken canary for up to 24 hours.

Solution 1: Short TTLs for HTML, Long TTLs for Assets

HTML:   Cache-Control: public, max-age=0, s-maxage=60
JS/CSS: Cache-Control: public, max-age=31536000, immutable

Tradeoffs:

HTML is re-fetched frequently (acceptable - small payload)
JS/CSS cached forever (good - large payloads, content-addressed)
Rollback takes 60 seconds max (time for CDN TTL to expire)

Implementation:

// Origin server (Next.js)
export default function handler(req: Request, res: Response) {
  if (req.url.endsWith('.html')) {
    res.setHeader('Cache-Control', 'public, max-age=0, s-maxage=60, must-revalidate');
  } else if (req.url.match(/\.(js|css|woff2)$/)) {
    res.setHeader('Cache-Control', 'public, max-age=31536000, immutable');
  }

  // ...
}

Solution 2: Versioned Cache Keys

Edge worker creates different cache keys for canary and stable:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);
    const version = await determineVersion(request, env);

    // Create version-specific cache key
    const cacheKey = new Request(request.url, {
      headers: request.headers,
      cf: {
        cacheKey: `${version}:${url.pathname}`,
      },
    });

    // Check cache
    let response = await caches.default.match(cacheKey);

    if (!response) {
      // Fetch from origin
      response = await fetch(getOriginUrl(url, version));

      // Cache with version-specific key
      ctx.waitUntil(caches.default.put(cacheKey, response.clone()));
    }

    return response;
  }
};

Why This Works:

Canary assets: canary-v48:/index.html
Stable assets: stable-v47:/index.html
No key collision, both can coexist in cache
Rollback = change routing, don't purge cache

Solution 3: Selective Cache Purge

Instead of purging all canary assets, purge only critical paths:

async function rollbackCanary(): Promise<void> {
  // Only purge HTML (entry points)
  await purgeCDN('/index.html');
  await purgeCDN('/_next/data/**/*.json'); // Next.js data files

  // Don't purge JS/CSS - content-addressed, no collision

  // Update routing
  await kv.put('canary-percentage', '0');
}

Why This Works:

HTML is purged (users get stable entry point)
JS/CSS are content-addressed (v47 vs v48 filenames differ)
Smaller purge = faster propagation
Less CDN load

Solution 4: Stale-While-Revalidate for Resilience

Cache-Control: public, max-age=60, stale-while-revalidate=600

Behavior:

CDN serves cached copy for 60 seconds
After 60s, CDN serves stale copy while fetching fresh in background
If origin is down, CDN serves stale up to 600 seconds

Benefits During Rollback:

Origin load is smoothed (no thundering herd)
Users get content even during rollback
Stale canary is better than error page

Downside:

Users might see canary for up to 60s after rollback

Acceptable tradeoff for most apps.

Comparison

Strategy	Rollback Speed	Cache Hit Rate	Complexity	Best For
Short TTLs	~60s	Low for HTML, high for assets	Low	Standard setups
Versioned Keys	Instant	High for everything	Medium	CDN with edge workers
Selective Purge	~30s	High	Low	Simple CDN setups
Stale-While-Revalidate	~60s (graceful)	High	Low	Resilience-focused

Verdict: Use versioned cache keys if you have edge workers, otherwise use short TTLs + selective purge.

Client-Side Version Detection

The client needs to know which version it's running for telemetry and error attribution. Here's how to implement it.

Version Injection in HTML

// SSR render (Next.js)
export async function getServerSideProps(context) {
  const version = process.env.APP_VERSION || 'unknown';
  const buildId = process.env.BUILD_ID || 'unknown';

  return {
    props: {
      version,
      buildId,
    },
  };
}

export default function App({ version, buildId, Component, pageProps }) {
  return (
    <>
      <Head>
        <meta name="app-version" content={version} />
        <meta name="build-id" content={buildId} />
      </Head>

      <Script
        id="version-init"
        strategy="beforeInteractive"
        dangerouslySetInnerHTML={{
          __html: `
            window.__APP_VERSION__ = '${version}';
            window.__BUILD_ID__ = '${buildId}';
            window.__CANARY_ASSIGNMENT__ = document.cookie.match(/x-canary-version=([^;]+)/)?.[1] || 'unknown';
          `,
        }}
      />

      <Component {...pageProps} />
    </>
  );
}

Client-Side Telemetry Tagging

class TelemetryClient {
  private version: string;
  private buildId: string;
  private canaryAssignment: string;

  constructor() {
    this.version = window.__APP_VERSION__ || 'unknown';
    this.buildId = window.__BUILD_ID__ || 'unknown';
    this.canaryAssignment = window.__CANARY_ASSIGNMENT__ || 'unknown';
  }

  trackEvent(name: string, properties: Record<string, any> = {}): void {
    const enriched = {
      ...properties,
      version: this.version,
      buildId: this.buildId,
      canaryAssignment: this.canaryAssignment,
      timestamp: Date.now(),
      sessionId: this.getSessionId(),
    };

    // Send to analytics
    fetch('/api/telemetry', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        event: name,
        properties: enriched,
      }),
    });
  }

  trackError(error: Error, context: Record<string, any> = {}): void {
    this.trackEvent('error', {
      ...context,
      errorMessage: error.message,
      errorStack: error.stack,
      errorName: error.name,
    });
  }

  trackPerformance(metric: string, value: number, context: Record<string, any> = {}): void {
    this.trackEvent('performance', {
      ...context,
      metric,
      value,
    });
  }

  private getSessionId(): string {
    let sessionId = sessionStorage.getItem('session-id');

    if (!sessionId) {
      sessionId = crypto.randomUUID();
      sessionStorage.setItem('session-id', sessionId);
    }

    return sessionId;
  }
}

// Global instance
export const telemetry = new TelemetryClient();

// React Error Boundary
export class ErrorBoundary extends React.Component {
  componentDidCatch(error: Error, errorInfo: React.ErrorInfo) {
    telemetry.trackError(error, {
      componentStack: errorInfo.componentStack,
      boundary: 'app',
    });
  }

  render() {
    return this.props.children;
  }
}

// Performance monitoring
if (typeof window !== 'undefined') {
  // Core Web Vitals
  import('web-vitals').then(({ onCLS, onFID, onLCP, onFCP, onTTFB, onINP }) => {
    onCLS((metric) => telemetry.trackPerformance('cls', metric.value));
    onINP((metric) => telemetry.trackPerformance('inp', metric.value));
    onLCP((metric) => telemetry.trackPerformance('lcp', metric.value));
    onFCP((metric) => telemetry.trackPerformance('fcp', metric.value));
    onTTFB((metric) => telemetry.trackPerformance('ttfb', metric.value));
  });

  // React hydration timing
  const hydrationStart = performance.now();

  window.addEventListener('load', () => {
    const hydrationEnd = performance.now();
    const hydrationTime = hydrationEnd - hydrationStart;

    telemetry.trackPerformance('hydration_time', hydrationTime);
  });
}

Version Mismatch Detection

class VersionMismatchDetector {
  private expectedVersion: string;
  private checkInterval: number = 60000; // Check every minute

  constructor() {
    this.expectedVersion = window.__APP_VERSION__;
    this.startMonitoring();
  }

  private startMonitoring(): void {
    setInterval(async () => {
      await this.checkVersion();
    }, this.checkInterval);
  }

  private async checkVersion(): Promise<void> {
    try {
      const response = await fetch('/api/version', {
        headers: {
          'X-Client-Version': this.expectedVersion,
        },
      });

      const data = await response.json();

      if (data.currentVersion !== this.expectedVersion) {
        console.warn('Version mismatch detected', {
          client: this.expectedVersion,
          server: data.currentVersion,
        });

        // Show update notification
        this.showUpdateNotification();
      }
    } catch (error) {
      console.error('Version check failed', error);
    }
  }

  private showUpdateNotification(): void {
    const notification = document.createElement('div');
    notification.innerHTML = `
      <div style="
        position: fixed;
        bottom: 20px;
        right: 20px;
        background: #1a1a1a;
        color: white;
        padding: 16px;
        border-radius: 8px;
        box-shadow: 0 4px 12px rgba(0,0,0,0.3);
        z-index: 9999;
      ">
        <p style="margin: 0 0 8px 0;">A new version is available</p>
        <button onclick="window.location.reload()" style="
          background: #0070f3;
          color: white;
          border: none;
          padding: 8px 16px;
          border-radius: 4px;
          cursor: pointer;
        ">
          Reload
        </button>
      </div>
    `;

    document.body.appendChild(notification);
  }
}

// Initialize
if (typeof window !== 'undefined') {
  new VersionMismatchDetector();
}

Verdict: Inject version in HTML, tag all telemetry, monitor for mismatches, prompt user to reload when versions drift.

Production Incidents and Lessons Learned

Real-world canary failures and how they were detected/resolved.

Incident 1: Infinite Re-Render Loop

What Happened:

Deployed canary with React 18 upgrade. Used useEffect without dependency array in a component rendered 1000+ times per page (list items). Each render triggered another effect, causing infinite loop.

// Buggy code in v48 canary
function ListItem({ id }) {
  const [data, setData] = useState(null);

  useEffect(() => {
    fetch(`/api/items/${id}`)
      .then(res => res.json())
      .then(setData);
    // Missing dependency array - runs on every render!
  });

  return <div>{data?.name}</div>;
}

Detection:

Canary deployed at 5% (22.5K RPS)
Within 90 seconds, CPU usage spiked to 100% on client devices
INP (Interaction to Next Paint) jumped from 80ms → 4500ms
Browser tab crashes increased 50x

Canary Analysis System Caught It:

T+1m:30s  Anomaly detected: INP > 2000ms (baseline: 85ms)
          Effect size: 2.8 (extremely large)
          Confidence: 0.98
          Decision: ROLLBACK

Rollback:

Automated rollback triggered at T+1m:45s
KV updated: canary-percentage = 0
CDN purged: /dist-v48/*.html
Total affected users: ~135K (90s * 1500 RPS)
Damage: 47 users reported crashes, 0 churn

Root Cause:

React 18's concurrent rendering changed useEffect timing. Missing dependency array became fatal.

Prevention:

Added ESLint rule: react-hooks/exhaustive-deps (enforced in CI)
Added performance regression tests: Lighthouse CI checks INP < 200ms
Improved canary sensitivity: INP threshold lowered to +500ms (was +1000ms)

Incident 2: Hydration Mismatch on Timezone Boundaries

What Happened:

Deployed canary that rendered server timestamp using Date.now(). Server was in UTC, client rendered in local timezone.

// Buggy code
function Timestamp() {
  const now = Date.now();

  return (
    <div>
      Last updated: {new Date(now).toLocaleString()}
    </div>
  );
}

Why This Broke:

SSR rendered: "Last updated: 3/15/2024, 10:00:00 PM UTC"
Client hydrated: "Last updated: 3/15/2024, 3:00:00 PM PST"
React saw mismatch, threw hydration error

Detection:

Canary deployed to 5%
Hydration errors started appearing immediately
Error rate: 2.3% (much higher than threshold of 0.01%)
Weird part: Only affected users in PST/MST timezones (US West Coast)

Why Stratification Saved Us:

Naive analysis would have shown: "2.3% of canary users have errors, but canary sample is small, could be noise."

Stratified analysis showed:

US-West region: 8.5% error rate (CRITICAL)
US-East region: 0.03% error rate (normal)
EU region: 0.02% error rate (normal)

Rollback:

T+2m: Stratified analysis flagged US-West regression
T+2m:15s: Automated rollback triggered
Total affected: ~45K users (mostly US West Coast)

Root Cause:

Server timestamp rendered during SSR used server's timezone. Client rendered in user's timezone. Hydration failed.

Fix:

function Timestamp() {
  const [mounted, setMounted] = useState(false);

  useEffect(() => {
    setMounted(true);
  }, []);

  // Server: render nothing (or placeholder)
  // Client: render actual timestamp
  if (!mounted) {
    return <div>Last updated: ...</div>;
  }

  return (
    <div>
      Last updated: {new Date().toLocaleString()}
    </div>
  );
}

Prevention:

Added hydration error monitoring (track console.error for "Hydration failed")
Added timezone-aware test suite (run Playwright tests with TZ=America/Los_Angeles)
Improved documentation: "Never render Date.now() in SSR without timezone handling"

Incident 3: CDN Cache Stampede During Rollback

What Happened:

Deployed canary, caught a bug, rolled back. During rollback, purged ALL HTML from CDN. 450K RPS suddenly hit origin.

Timeline:

T+0      Deploy canary (5%)
T+5m     Detect error rate regression
T+5m:15s Initiate rollback
T+5m:20s CDN cache purge sent: /dist-v48/*.html
T+5m:25s Purge propagates globally
T+5m:30s ALL HTML REQUESTS HIT ORIGIN
         Origin load: 8K RPS → 450K RPS
         Origin crashes (OOM)
T+5m:45s Site down (503 errors)
T+8m     Auto-scaling kicks in (slow)
T+12m    Site recovers

Damage:

6.5 minutes of total outage (worse than canary bug itself)
2.9M failed requests
Significant revenue impact

Root Cause:

Purged too aggressively (all HTML, not just canary)
No request coalescing at CDN edge
Origin not prepared for cache miss spike

Fix 1: Selective Purge

async function rollback() {
  // Before: purge all HTML
  // await purgeCDN('*.html');

  // After: only purge canary-specific HTML
  await purgeCDN('/dist-v48/*.html');

  // Stable HTML stays cached
}

Fix 2: Request Coalescing

// Cloudflare Worker
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    // Create coalescing key
    const coalesceKey = `coalesce:${url.pathname}`;

    // Check if request is already in-flight
    const inFlight = await env.KV.get(coalesceKey);

    if (inFlight) {
      // Wait for in-flight request to complete
      await sleep(100);

      // Try cache again
      const cached = await caches.default.match(request);
      if (cached) return cached;
    }

    // Mark request as in-flight
    await env.KV.put(coalesceKey, 'true', { expirationTtl: 10 });

    // Fetch from origin
    const response = await fetch(request);

    // Cache result
    await caches.default.put(request, response.clone());

    // Clear in-flight flag
    await env.KV.delete(coalesceKey);

    return response;
  }
};

Fix 3: Stale-While-Revalidate

res.setHeader('Cache-Control', 'public, max-age=60, stale-while-revalidate=600, stale-if-error=3600');

CDN serves stale content during origin outage.

Prevention:

Added origin load shedding (reject requests if CPU > 80%)
Added CDN-level rate limiting (max 10K RPS to origin)
Changed rollback strategy to gradual (100% → 50% → 0% over 2 minutes)

Incident 4: Mobile Network Timeout Cascade

What Happened:

Deployed canary with larger bundle size (+120KB gzipped). Desktop users unaffected. Mobile users on 3G experienced cascading timeouts.

Metrics:

Desktop LCP: 1.2s (stable) → 1.3s (canary) ✅
Mobile 4G LCP: 1.8s (stable) → 2.1s (canary) ⚠️
Mobile 3G LCP: 3.2s (stable) → 8.7s (canary) ❌

Why Stratification Caught This:

Overall LCP increase: +300ms (below threshold of +500ms)

But stratified by connection type:

WiFi: +50ms (fine)
4G: +300ms (acceptable)
3G: +5500ms (CRITICAL)

Root Cause:

3G bandwidth: ~1Mbps (theoretical), 400Kbps (real-world)
120KB extra bundle = 2.4s extra download on 3G
Timeout set to 5s - download took 8.7s - timeout fired
User got white screen

Fix:

// Adaptive bundle loading
async function loadAppBundle() {
  const connection = (navigator as any).connection;
  const effectiveType = connection?.effectiveType || '4g';

  if (effectiveType === 'slow-2g' || effectiveType === '2g' || effectiveType === '3g') {
    // Load minimal bundle for slow connections
    await import('./bundles/minimal.js');
  } else {
    // Load full bundle
    await import('./bundles/full.js');
  }
}

Prevention:

Added connection-type stratification to canary analysis
Added bundle size regression tests (fail if >50KB increase)
Added network throttling to CI (test on simulated 3G)

Tradeoffs and Engineering Decisions

Every canary architecture decision involves tradeoffs. Here are the critical ones.

Tradeoff 1: Canary Percentage (Risk vs Sample Size)

Lower Percentage (1-5%):

✅ Smaller blast radius (fewer affected users)
✅ Lower risk of revenue impact
❌ Smaller sample size (takes longer to detect issues)
❌ Higher false positive rate (noise dominates signal)

Higher Percentage (10-25%):

✅ Larger sample size (faster detection)
✅ Lower false positive rate
❌ Larger blast radius
❌ Higher risk if canary is bad

Decision:

Start at 5% for 5-10 minutes (detect critical issues), then jump to 25-50% (validate at scale). This balances risk and detection speed.

App Type	Initial %	First Jump	Final Jump
E-commerce	2%	10%	50% → 100%
Social media	5%	25%	100%
SaaS dashboard	10%	50%	100%
Internal tool	25%	100%	N/A

Tradeoff 2: Decision Latency (Speed vs Accuracy)

Fast Decisions (30-second windows):

✅ Catch bad deploys quickly
✅ Minimize blast radius
❌ Higher false positive rate (noise)
❌ May not catch slow-burn issues

Slow Decisions (5-minute windows):

✅ Better statistical significance
✅ Lower false positive rate
❌ Slow to catch critical bugs
❌ More users affected before rollback

Decision:

Use hybrid approach:

const DECISION_CONFIG = {
  // Critical metrics: fast decisions
  errorRate: {
    window: 30,      // 30 seconds
    threshold: 0.05, // 0.05% increase
    action: 'rollback_immediate',
  },
  hydrationErrors: {
    window: 30,
    threshold: 0.01,
    action: 'rollback_immediate',
  },

  // Performance metrics: slower decisions
  lcp: {
    window: 300,     // 5 minutes
    threshold: 500,  // +500ms
    action: 'rollback_delayed',
  },
  cls: {
    window: 300,
    threshold: 0.1,
    action: 'hold',
  },
};

Verdict: Fast decisions for errors, slower decisions for performance.

Tradeoff 3: Edge Logic Complexity (Flexibility vs Maintainability)

Simple Edge Logic:

Percentage-based routing only
No stratification
No user segmentation

Pros:

Easy to reason about
Low edge CPU usage
Easy to debug

Cons:

No fine-grained control
Can't do region-based rollouts
Can't do cohort-based testing

Complex Edge Logic:

Percentage + region + cohort + feature flags
Real-time KV lookups
User data fetching

Pros:

Maximum control
Can do sophisticated rollouts
Can segment by any dimension

Cons:

Edge worker CPU limits (50ms timeout)
More KV reads = higher latency
Harder to debug

Decision:

Start simple, add complexity as needed.

Phase 1 (MVP):

// Just percentage-based
const pct = parseInt(await kv.get('canary-percentage'));
const bucket = hash(userId) % 100;
return bucket < pct ? 'canary' : 'stable';

Phase 2 (Add Regions):

const enabled = await kv.get(`canary-enabled:${region}`);
if (!enabled) return 'stable';
// ... percentage logic

Phase 3 (Add Cohorts):

const user = await fetchUser(userId);
const eligible = checkCohortEligibility(user);
if (!eligible) return 'stable';
// ... percentage + region logic

Verdict: Start with simple percentage-based, add sophistication only when product needs it.

Tradeoff 4: Automated vs Manual Rollback

Fully Automated:

✅ Fast (60-second rollback)
✅ Works 24/7 (no human needed)
❌ False positives block good deploys
❌ Can't handle nuanced situations

Manual Only:

✅ Humans make nuanced decisions
✅ No false positives
❌ Slow (humans not always available)
❌ Blast radius grows during detection time

Decision:

Automated rollback with human override:

class RollbackController {
  async executeRollback(reason: string, confidence: number): Promise<void> {
    if (confidence > 0.95) {
      // Very confident - auto-rollback immediately
      console.log(`AUTO-ROLLBACK (confidence: ${confidence})`);
      await this.rollback(reason);
      await this.notifyTeam(`Auto-rollback executed: ${reason}`);
    } else if (confidence > 0.7) {
      // Moderately confident - notify team, give 2 minutes to override
      console.log(`ROLLBACK PENDING (confidence: ${confidence})`);
      await this.notifyTeam(`Rollback pending in 2 minutes: ${reason}. Reply 'CANCEL' to abort.`);

      await sleep(120000); // Wait 2 minutes

      // Check if human canceled
      const canceled = await this.kv.get('rollback-canceled');

      if (canceled === 'true') {
        console.log('Rollback canceled by human');
        await this.kv.delete('rollback-canceled');
        return;
      }

      // No cancellation - proceed
      await this.rollback(reason);
    } else {
      // Low confidence - just alert humans
      console.log(`ROLLBACK SUGGESTED (confidence: ${confidence})`);
      await this.notifyTeam(`Canary showing issues: ${reason}. Manual review recommended.`);
    }
  }
}

Rollback Override (Slack Bot):

[3:42 PM] CanaryBot:
⚠️ ROLLBACK PENDING in 2 minutes
Reason: Error rate increased by 0.12% (threshold: 0.05%)
Confidence: 0.78
React with ❌ to cancel rollback

User: [clicks ❌]

[3:42 PM] CanaryBot:
✅ Rollback canceled. Canary will continue.

Verdict: Use automated rollback with confidence thresholds and human override capability.

Summary: Key Architectural Insights

Building production-grade canary deployments for frontends requires rethinking backend patterns:

Traffic Splitting Must Happen at the Edge
- CDN edge workers give you <1ms routing decisions
- Origin-based splitting amplifies load and limits geographic control
- DNS-based splitting is too slow for rollbacks
Static Assets Break Traditional Canary Models
- Content-addressed assets prevent cache collisions
- Versioned CDN paths enable canary/stable coexistence
- Short HTML TTLs + long asset TTLs balance performance and rollback speed
Hydration Is Your Biggest Risk
- SSR/SSG creates timing windows where version mismatches cause failures
- Version injection in HTML + client-side checks prevent drift
- Monitor hydration errors separately from runtime errors
Stratified Analysis Prevents False Negatives
- Raw metric comparisons miss region/device-specific regressions
- Statistical tests (t-test, effect size) reduce false positives
- Anomaly detection catches edge cases that thresholds miss
Rollback Speed Matters More Than Deployment Speed
- CDN propagation delays mean bad code lives for 30-60s minimum
- Client-side error recovery bridges the rollback window
- Aggressive cache purging causes origin stampedes (gradual rollback is safer)
Canaries and Feature Flags Complement Each Other
- Canaries validate entire codebase (including deps, tooling, infra)
- Feature flags control individual code paths
- Hybrid approach: deploy canary with flags off, enable flags incrementally
Automation Is Non-Negotiable at Scale
- Humans can't make decisions in 60-second windows
- Bayesian inference + confidence thresholds enable safe automation
- Manual override prevents robots from blocking good deploys
The Real Cost Is Operational Complexity
- Edge compute adds $5-10K/month at scale
- Monitoring/telemetry is $10-15K/month
- Engineering time to build + maintain the system is 3-6 months
- But the cost of a bad deploy (downtime, churn, reputation) is 10-100x higher

Final Thought:

Canary deployments for frontends are fundamentally harder than backend canaries because:

Static assets create caching complexity
Client-side execution means you can't control the runtime
Hydration creates version synchronization problems
CDN propagation delays limit rollback speed

But the investment is worth it. At scale, canaries are the difference between "we caught it in 90 seconds" and "40,000 users churned before we noticed."

The architecture outlined here—edge-based routing, stratified analysis, automated decisions, gradual rollout—is what companies like Netflix, Vercel, and Cloudflare use in production. It's not simple, but it works.

References and Further Reading

Industry Engineering Blogs:

Netflix TechBlog: "Automated Canary Analysis at Netflix with Kayenta"
Uber Engineering: "Introducing Domain-Oriented Microservice Architecture"
Meta Engineering: "Building Reliable Systems with Gatekeeper"
Cloudflare: "How We Use Edge Workers for Progressive Rollouts"

Papers:

"Continuous Delivery and Progressive Deployment" (Google SRE Book)
"Statistical Analysis of A/B Test Results" (Microsoft Research)

Tools:

Cloudflare Workers (edge compute)
Vercel Edge Functions (edge routing)
LaunchDarkly (feature flags)
Datadog/New Relic (RUM + APM)

EOF

What did you think?