The "Rebuild vs Refactor" Decision Framework

February 25, 20263 min read4 views

rebuild vs refactor

software architecture

technical debt

system design

engineering leadership

architecture decisions

organizational design

product engineering

long term maintenance

The "Rebuild vs Refactor" Decision Framework

A structured way to evaluate whether to rewrite a system or incrementally improve it — with real signals, stakeholder communication, and risk mapping

The Most Expensive Question in Software

"Should we rewrite this?"

I've seen this question destroy teams. I've seen rewrites that saved companies and rewrites that killed them. I've seen refactors that worked miracles and refactors that just delayed the inevitable.

The problem isn't that rewrites are always bad (they're not) or that refactors are always better (they're not). The problem is that most teams make this decision based on frustration, not analysis. They're tired of the old system, excited about new technology, and convinced that "this time we'll do it right."

This framework won't tell you what to do. It will help you make the decision with clear eyes, communicate it to stakeholders who don't understand the tradeoffs, and map the risks so you're not surprised when things go wrong.

Why Rewrites Fail (And Sometimes Succeed)

The Rewrite Trap

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE CLASSIC REWRITE FAILURE PATTERN                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Month 0: "The old system is unmaintainable!"                               │
│           Team is frustrated, velocity is low                               │
│           Everyone agrees: we need to rewrite                               │
│                                                                              │
│  Month 3: "The new system is so much cleaner!"                              │
│           Team is excited, making good progress                             │
│           Old system still running, still getting patches                   │
│                                                                              │
│  Month 6: "We're almost at feature parity!"                                 │
│           But edge cases keep appearing                                     │
│           Old system keeps getting features new one doesn't have            │
│                                                                              │
│  Month 9: "Just a few more months..."                                       │
│           Business pressure mounting                                        │
│           Team is tired of building what already exists                     │
│           New system has its own bugs now                                   │
│                                                                              │
│  Month 12: "We need to ship SOMETHING"                                      │
│            Launch new system with missing features                          │
│            Users complain about regressions                                 │
│            Now maintaining TWO broken systems                               │
│                                                                              │
│  Month 18: "Let's just go back to the old system"                           │
│            Or: "Let's rewrite the rewrite"                                  │
│            Team demoralized, business trust destroyed                       │
│                                                                              │
│  Total cost: 18 months, team morale, business trust                        │
│  Could have refactored incrementally in 6 months                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

When Rewrites Actually Work

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SUCCESSFUL REWRITE PATTERNS                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Pattern 1: The Platform Shift                                              │
│  ─────────────────────────────                                               │
│  • Moving from desktop to web                                               │
│  • Moving from web to mobile-first                                          │
│  • Moving from on-prem to cloud-native                                      │
│  Why it works: The old system literally can't run in the new environment   │
│                                                                              │
│  Pattern 2: The Scope Reduction                                             │
│  ──────────────────────────────                                              │
│  • Rewriting 20% of features that provide 80% of value                     │
│  • Deliberately not rebuilding legacy features                             │
│  • New system is smaller than old system                                   │
│  Why it works: Less to build = actually finishable                         │
│                                                                              │
│  Pattern 3: The Strangler Fig                                               │
│  ────────────────────────────                                                │
│  • Building new system piece by piece                                       │
│  • Each piece replaces part of old system                                  │
│  • Never a "big bang" cutover                                              │
│  Why it works: Risk is distributed, can stop anytime                       │
│                                                                              │
│  Pattern 4: The Domain Expertise                                            │
│  ────────────────────────────────                                            │
│  • Team has deep knowledge of the domain                                   │
│  • Requirements are well understood                                        │
│  • Few surprises in what needs to be built                                 │
│  Why it works: The hard part isn't coding, it's knowing what to build      │
│                                                                              │
│  Pattern 5: The Burning Platform                                            │
│  ────────────────────────────────                                            │
│  • Old technology is truly end-of-life                                     │
│  • Security vulnerabilities can't be patched                               │
│  • Vendor going out of business                                            │
│  Why it works: No choice but to move, full organizational commitment       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

The Decision Framework

Step 1: Diagnose the Real Problem

Before deciding HOW to fix something, understand WHAT is actually broken.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PROBLEM DIAGNOSIS MATRIX                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SYMPTOM                        │  POSSIBLE CAUSES                          │
│  ────────────────────────────────────────────────────────────────────       │
│  "Development is slow"          │  • Tech debt in specific areas           │
│                                 │  • Poor tooling (not code)               │
│                                 │  • Unclear requirements                  │
│                                 │  • Team knowledge gaps                   │
│                                 │  • Process problems                      │
│                                 │  • Actually: architecture is fine        │
│                                                                              │
│  "Code is unmaintainable"       │  • Specific modules are bad              │
│                                 │  • All modules are bad                   │
│                                 │  • Documentation is missing              │
│                                 │  • Original authors left                 │
│                                 │  • Actually: learning curve is normal    │
│                                                                              │
│  "We can't add features"        │  • Architecture doesn't support them     │
│                                 │  • Too much coupling                     │
│                                 │  • Missing extension points              │
│                                 │  • Actually: features conflict           │
│                                                                              │
│  "Performance is bad"           │  • Database queries                      │
│                                 │  • Missing caching                       │
│                                 │  • Algorithm inefficiency                │
│                                 │  • Infrastructure undersized             │
│                                 │  • Actually: architecture is fine        │
│                                                                              │
│  "Technology is outdated"       │  • Security vulnerabilities              │
│                                 │  • Can't hire developers                 │
│                                 │  • Missing modern features               │
│                                 │  • Actually: still works fine            │
│                                                                              │
│  Most problems that feel like "we need a rewrite" have specific,           │
│  addressable causes that a rewrite won't automatically fix.                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 2: Measure the Current State

// System Health Assessment Scorecard

interface SystemHealthAssessment {
  // Technical Health (1-5 scale)
  technical: {
    codeQuality: number;           // Static analysis, code review difficulty
    testCoverage: number;          // Percentage and quality of tests
    deployability: number;         // How easy/safe is deployment
    observability: number;         // Can you understand what's happening
    security: number;              // Vulnerability count, update currency
  };

  // Velocity Health
  velocity: {
    timeToProduction: number;      // Idea → live for small change
    bugFixTime: number;            // Time to fix typical bug
    onboardingTime: number;        // Weeks until new dev is productive
    incidentFrequency: number;     // Production incidents per month
    changeFailureRate: number;     // % of changes that cause problems
  };

  // Business Health
  business: {
    featureVelocity: number;       // Features shipped per quarter
    customerSatisfaction: number;  // Related to system quality
    competitivePosition: number;   // Can we match competitor features
    operationalCost: number;       // Infrastructure + maintenance cost
    opportunityCost: number;       // What we can't build because of system
  };

  // Team Health
  team: {
    morale: number;                // Team satisfaction with system
    retention: number;             // Are people leaving because of system
    skillMatch: number;            // Does team have skills for system
    knowledgeDistribution: number; // Bus factor, knowledge silos
  };
}

// Example scoring
const currentSystemHealth: SystemHealthAssessment = {
  technical: {
    codeQuality: 2,           // "It works but it's ugly"
    testCoverage: 1,          // "What tests?"
    deployability: 3,         // "We can deploy, but it's scary"
    observability: 2,         // "We find out about problems from users"
    security: 2,              // "Some outdated dependencies"
  },
  velocity: {
    timeToProduction: 2,      // "Two weeks for a button change"
    bugFixTime: 3,            // "Usually a day or two"
    onboardingTime: 1,        // "Months before people are effective"
    incidentFrequency: 2,     // "Weekly 'events'"
    changeFailureRate: 2,     // "30% of deploys need rollback"
  },
  business: {
    featureVelocity: 2,       // "Behind on roadmap"
    customerSatisfaction: 3,   // "It works, but complaints"
    competitivePosition: 2,    // "Missing table stakes features"
    operationalCost: 2,        // "High for what we get"
    opportunityCost: 1,        // "Can't pursue new opportunities"
  },
  team: {
    morale: 2,                 // "Nobody wants to work on this"
    retention: 2,              // "Lost 2 people citing the codebase"
    skillMatch: 3,             // "We know the stack, just not this code"
    knowledgeDistribution: 1,  // "Only Sarah understands payments"
  },
};

function calculateOverallHealth(assessment: SystemHealthAssessment): {
  score: number;
  verdict: 'healthy' | 'concerning' | 'critical';
  worstAreas: string[];
} {
  const allScores = [
    ...Object.values(assessment.technical),
    ...Object.values(assessment.velocity),
    ...Object.values(assessment.business),
    ...Object.values(assessment.team),
  ];

  const average = allScores.reduce((a, b) => a + b, 0) / allScores.length;

  // Find worst areas
  const worstAreas: string[] = [];
  for (const [category, scores] of Object.entries(assessment)) {
    for (const [metric, score] of Object.entries(scores)) {
      if (score <= 1) {
        worstAreas.push(`${category}.${metric}`);
      }
    }
  }

  return {
    score: average,
    verdict: average >= 3.5 ? 'healthy' : average >= 2.5 ? 'concerning' : 'critical',
    worstAreas,
  };
}

Step 3: Evaluate Your Options

┌─────────────────────────────────────────────────────────────────────────────┐
│                    OPTION EVALUATION FRAMEWORK                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Option A: Do Nothing (Baseline)                                            │
│  ──────────────────────────────────                                          │
│  Estimated cost over 2 years:                                               │
│  • Ongoing maintenance: $___                                                │
│  • Lost velocity (features not built): $___                                 │
│  • Incidents and firefighting: $___                                         │
│  • Team turnover: $___                                                      │
│  • Security/compliance risk: $___                                           │
│  Total: $___                                                                │
│                                                                              │
│  Option B: Targeted Refactoring                                             │
│  ──────────────────────────────────                                          │
│  Scope: Improve specific problem areas without restructuring               │
│  • Effort to implement: ___ person-months                                  │
│  • Risk of failure: Low/Medium/High                                        │
│  • Expected improvement: ___% of problems addressed                        │
│  • Time until benefit: ___ months                                          │
│  • Can stop/pivot: Yes/No                                                  │
│                                                                              │
│  Option C: Incremental Rebuild (Strangler Fig)                              │
│  ──────────────────────────────────────────────                              │
│  Scope: Replace system piece by piece                                      │
│  • Effort to implement: ___ person-months                                  │
│  • Risk of failure: Low/Medium/High                                        │
│  • Expected improvement: ___% of problems addressed                        │
│  • Time until first benefit: ___ months                                    │
│  • Time until complete: ___ months                                         │
│  • Can stop/pivot: Yes (at module boundaries)                              │
│                                                                              │
│  Option D: Full Rewrite                                                     │
│  ──────────────────────────                                                  │
│  Scope: Build new system from scratch, switch over                         │
│  • Effort to implement: ___ person-months                                  │
│  • Risk of failure: Low/Medium/High                                        │
│  • Expected improvement: ___% of problems addressed                        │
│  • Time until benefit: ___ months (big bang)                               │
│  • Can stop/pivot: No (sunk cost)                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 4: Apply the Decision Criteria

// Decision criteria with weights

interface DecisionFactors {
  // Favors Refactoring
  refactoringFactors: {
    problemsAreLocalized: boolean;         // Issues in specific modules
    teamKnowsCodebase: boolean;            // Institutional knowledge exists
    businessCantWait: boolean;             // Need improvements now
    uncertainRequirements: boolean;        // Don't know what we're building toward
    limitedResources: boolean;             // Can't afford parallel effort
    workingInProduction: boolean;          // System fundamentally works
  };

  // Favors Rewriting
  rewriteFactors: {
    fundamentalArchitectureProblem: boolean; // Core design is wrong
    technologyEndOfLife: boolean;            // Platform is dying
    teamHasNoKnowledge: boolean;             // Original authors long gone
    securityUnfixable: boolean;              // Can't patch vulnerabilities
    requirementsClear: boolean;              // Know exactly what to build
    canAffordParallelDev: boolean;           // Resources for both systems
    cleanBreakPossible: boolean;             // Can cut over without migration
  };

  // Red flags for rewriting
  rewriteRedFlags: {
    movingTarget: boolean;                   // Requirements keep changing
    featureParityRequired: boolean;          // Must match all old features
    noOneUnderstandsOldSystem: boolean;      // Can't know what to rebuild
    excitementDriven: boolean;               // Motivated by new tech, not problems
    underestimatedScope: boolean;            // "It's simpler than the old one"
    singleBigBang: boolean;                  // No incremental path planned
  };
}

function evaluateDecision(factors: DecisionFactors): {
  recommendation: 'refactor' | 'rewrite' | 'strangler' | 'do-nothing';
  confidence: 'high' | 'medium' | 'low';
  reasoning: string[];
} {
  const reasoning: string[] = [];

  // Count red flags for rewrite
  const redFlagCount = Object.values(factors.rewriteRedFlags).filter(Boolean).length;

  if (redFlagCount >= 3) {
    reasoning.push(`${redFlagCount} red flags for rewriting detected`);
    return {
      recommendation: 'refactor',
      confidence: 'high',
      reasoning: [
        ...reasoning,
        'Too many risk factors for a full rewrite',
        'Consider targeted refactoring or strangler fig pattern'
      ],
    };
  }

  // Strong refactor signals
  const refactorScore = Object.values(factors.refactoringFactors).filter(Boolean).length;
  const rewriteScore = Object.values(factors.rewriteFactors).filter(Boolean).length;

  if (refactorScore >= 4 && rewriteScore <= 2) {
    return {
      recommendation: 'refactor',
      confidence: 'high',
      reasoning: [
        'Problems are localized and addressable',
        'Team has necessary knowledge',
        'Incremental improvement is lower risk'
      ],
    };
  }

  if (rewriteScore >= 5 && refactorScore <= 2) {
    // Check if strangler is possible
    if (!factors.rewriteFactors.cleanBreakPossible) {
      return {
        recommendation: 'strangler',
        confidence: 'medium',
        reasoning: [
          'Fundamental issues require rebuilding',
          'But clean cutover not possible',
          'Strangler fig pattern reduces risk'
        ],
      };
    }

    return {
      recommendation: 'rewrite',
      confidence: 'medium',
      reasoning: [
        'Fundamental architecture problems',
        'Technology or knowledge blockers',
        'Clear requirements and resources available'
      ],
    };
  }

  // Mixed signals
  return {
    recommendation: 'strangler',
    confidence: 'low',
    reasoning: [
      'Mixed signals - neither clear refactor nor rewrite',
      'Strangler fig pattern allows course correction',
      'Start with highest-value module replacement'
    ],
  };
}

The Strangler Fig: Your Default Strategy

When in doubt, choose the strangler fig pattern. It's almost always the right answer.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STRANGLER FIG PATTERN                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Phase 1: Identify Boundaries                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Old System                                                          │    │
│  │  ┌──────────┬──────────┬──────────┬──────────┬──────────┐          │    │
│  │  │  Auth    │  Users   │  Orders  │ Payments │ Reports  │          │    │
│  │  │          │          │          │          │          │          │    │
│  │  └──────────┴──────────┴──────────┴──────────┴──────────┘          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Identify: Which modules can be extracted independently?                    │
│  Start with: High-value, low-coupling, well-understood modules             │
│                                                                              │
│  Phase 2: Build First Module                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              Facade/Router                                           │    │
│  │                   │                                                  │    │
│  │         ┌─────────┴─────────┐                                       │    │
│  │         ▼                   ▼                                       │    │
│  │  ┌──────────┐    ┌─────────────────────────────────────────┐       │    │
│  │  │  NEW     │    │  Old System (minus Auth)                 │       │    │
│  │  │  Auth    │    │  ┌──────────┬──────────┬──────────┐     │       │    │
│  │  │  Module  │    │  │  Users   │  Orders  │ Payments │     │       │    │
│  │  └──────────┘    │  └──────────┴──────────┴──────────┘     │       │    │
│  │                   └─────────────────────────────────────────┘       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Phase 3: Continue Module by Module                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │              Facade/Router                                           │    │
│  │                   │                                                  │    │
│  │    ┌──────────────┼──────────────┐                                  │    │
│  │    ▼              ▼              ▼                                  │    │
│  │  ┌────────┐  ┌────────┐   ┌───────────────────┐                    │    │
│  │  │  NEW   │  │  NEW   │   │  Old System       │                    │    │
│  │  │  Auth  │  │ Orders │   │  ┌──────────────┐ │                    │    │
│  │  └────────┘  └────────┘   │  │  Payments    │ │                    │    │
│  │                           │  └──────────────┘ │                    │    │
│  │                           └───────────────────┘                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Phase 4: Complete Migration                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  New System                                                          │    │
│  │  ┌──────────┬──────────┬──────────┬──────────┬──────────┐          │    │
│  │  │  Auth    │  Users   │  Orders  │ Payments │ Reports  │          │    │
│  │  │  (new)   │  (new)   │  (new)   │  (new)   │  (new)   │          │    │
│  │  └──────────┴──────────┴──────────┴──────────┴──────────┘          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  Key: At each phase, the system is fully functional                        │
│       You can stop at any phase if priorities change                       │
│       Risk is distributed across the entire timeline                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Planning the Strangler

// Prioritization framework for strangler fig

interface ModuleAssessment {
  name: string;

  // Value factors
  businessValue: 1 | 2 | 3 | 4 | 5;        // Impact on business outcomes
  painLevel: 1 | 2 | 3 | 4 | 5;            // How much trouble it causes
  changeFrequency: 1 | 2 | 3 | 4 | 5;      // How often we need to modify it

  // Difficulty factors
  coupling: 1 | 2 | 3 | 4 | 5;             // Dependencies on other modules
  complexity: 1 | 2 | 3 | 4 | 5;           // Inherent complexity
  dataComplexity: 1 | 2 | 3 | 4 | 5;       // Data migration difficulty
  unknowns: 1 | 2 | 3 | 4 | 5;             // How well we understand it
}

function prioritizeModules(modules: ModuleAssessment[]): ModuleAssessment[] {
  return modules
    .map(module => ({
      ...module,
      // Higher value score = more valuable to replace
      valueScore: (
        module.businessValue * 2 +
        module.painLevel * 2 +
        module.changeFrequency
      ),
      // Higher difficulty score = harder to replace
      difficultyScore: (
        module.coupling * 2 +
        module.complexity +
        module.dataComplexity +
        module.unknowns
      ),
      // Final score: value / difficulty
      priority: 0,
    }))
    .map(module => ({
      ...module,
      priority: module.valueScore / module.difficultyScore,
    }))
    .sort((a, b) => b.priority - a.priority);
}

// Example
const modules: ModuleAssessment[] = [
  {
    name: 'Authentication',
    businessValue: 4,
    painLevel: 5,           // Security issues, blocking upgrades
    changeFrequency: 2,
    coupling: 5,            // Everything depends on it
    complexity: 3,
    dataComplexity: 2,      // Users table migration
    unknowns: 1,            // Well understood
  },
  {
    name: 'Reporting',
    businessValue: 3,
    painLevel: 4,
    changeFrequency: 4,     // Lots of report requests
    coupling: 1,            // Read-only, no dependencies
    complexity: 2,
    dataComplexity: 1,      // No migration needed
    unknowns: 2,
  },
  {
    name: 'Payments',
    businessValue: 5,
    painLevel: 3,
    changeFrequency: 2,
    coupling: 3,
    complexity: 5,          // Complex business logic
    dataComplexity: 4,      // Financial data migration
    unknowns: 4,            // Original developer left
  },
];

// Result: Reporting first (high value, low coupling),
//         then Auth, then Payments last

Stakeholder Communication

The Executive Summary

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EXECUTIVE COMMUNICATION TEMPLATE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE SITUATION                                                              │
│  ─────────────                                                               │
│  Our [system name] is causing [specific problems]:                          │
│  • [Problem 1]: Costs us [$ or time] per [period]                           │
│  • [Problem 2]: Caused [N] incidents in past [period]                       │
│  • [Problem 3]: Taking [X weeks] for changes that should take [Y days]     │
│                                                                              │
│  THE OPTIONS                                                                 │
│  ───────────                                                                 │
│  Option A: Continue as-is                                                   │
│  • Cost: $[X] over [timeframe]                                              │
│  • Risk: [specific risks]                                                   │
│  • Outcome: Problems continue/worsen                                        │
│                                                                              │
│  Option B: Targeted improvements (RECOMMENDED)                              │
│  • Investment: [Y] person-months                                            │
│  • Timeline: [Z] months for first improvements                              │
│  • Expected outcome: [specific improvements]                                │
│  • Risk: Low - incremental, reversible                                      │
│                                                                              │
│  Option C: Full rebuild                                                     │
│  • Investment: [W] person-months                                            │
│  • Timeline: [V] months before any improvement                              │
│  • Expected outcome: [eventual improvements]                                │
│  • Risk: High - all-or-nothing                                              │
│                                                                              │
│  THE ASK                                                                     │
│  ───────                                                                     │
│  Approve [N] engineers for [M] months to execute Option [B]                │
│  First checkpoint: [date] with [specific deliverable]                      │
│  We'll provide [weekly/monthly] updates on progress                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Translating Technical to Business

// Translation guide for stakeholder communication

const translations = {
  technical: {
    "Technical debt": {
      businessTerm: "Maintenance burden",
      explanation: "Work that slows us down because of past shortcuts",
      metric: "Extra hours per feature due to workarounds",
    },

    "Legacy system": {
      businessTerm: "Older system",
      explanation: "Built with different requirements than we have today",
      metric: "Features we can't build without major changes",
    },

    "Refactoring": {
      businessTerm: "Incremental improvement",
      explanation: "Fixing problems while keeping the system running",
      metric: "Improvements delivered per sprint",
    },

    "Rewrite": {
      businessTerm: "Rebuild from scratch",
      explanation: "Building a new system to replace the current one",
      metric: "Months until any business value delivered",
    },

    "Architecture": {
      businessTerm: "System design",
      explanation: "How the parts of the system fit together",
      metric: "How hard it is to add new features",
    },

    "Coupling": {
      businessTerm: "Dependencies",
      explanation: "How much changing one thing affects other things",
      metric: "Unrelated things that break when we make changes",
    },

    "Test coverage": {
      businessTerm: "Safety net",
      explanation: "How confident we can be that changes don't break things",
      metric: "Bugs found after release vs before",
    },
  },

  // Framing problems in business terms
  problemFraming: {
    slowVelocity: {
      technical: "High cyclomatic complexity and tight coupling",
      business: "Each feature takes 3x longer than it should because engineers have to work around old decisions",
      impact: "We're delivering 4 features per quarter instead of 12",
    },

    frequentBugs: {
      technical: "No unit tests, unclear module boundaries",
      business: "Changes in one area unexpectedly break other areas",
      impact: "2 incidents per week, each costing $X in engineer time and customer trust",
    },

    scalingIssues: {
      technical: "N+1 queries, no caching layer, synchronous processing",
      business: "System slows down as we add more customers",
      impact: "Can't onboard enterprise customers without performance fixes",
    },
  },
};

Managing Expectations

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EXPECTATION MANAGEMENT                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  What stakeholders often expect:                                            │
│  ───────────────────────────────                                             │
│  • A date when everything will be "fixed"                                  │
│  • No disruption during the improvement                                    │
│  • Continued feature delivery at full speed                                │
│  • Zero risk                                                                │
│                                                                              │
│  What you need to communicate:                                              │
│  ───────────────────────────────                                             │
│  • Improvements will be incremental and continuous                         │
│  • Some velocity reduction during improvement (invest now, gain later)     │
│  • Risk exists with ANY option including doing nothing                     │
│  • Regular checkpoints where we'll share progress and adjust               │
│                                                                              │
│  Useful framings:                                                            │
│  ─────────────────                                                           │
│  "Think of it like road construction":                                      │
│  - We could close the highway for 6 months (rewrite)                       │
│  - Or fix one lane at a time while traffic continues (refactor)            │
│  - The second option is slower per-lane but never blocks traffic           │
│                                                                              │
│  "Think of it like paying off debt":                                        │
│  - We can keep paying interest forever (do nothing)                        │
│  - Pay extra each month to pay it down (refactor)                          │
│  - Take out a new loan to pay off the old one (rewrite) - risky            │
│                                                                              │
│  Checkpoint commitments:                                                     │
│  ────────────────────────                                                    │
│  • Week 4: First module extracted and in production                        │
│  • Week 8: Measurable velocity improvement in that area                    │
│  • Week 12: Decision point on continuing vs adjusting approach             │
│  • Monthly: Metrics dashboard showing progress                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Risk Mapping

Risk Categories

// Comprehensive risk assessment

interface RiskAssessment {
  category: string;
  risk: string;
  likelihood: 'low' | 'medium' | 'high';
  impact: 'low' | 'medium' | 'high';
  mitigations: string[];
  triggerSigns: string[];
}

const refactoringRisks: RiskAssessment[] = [
  {
    category: 'Scope',
    risk: 'Problems deeper than expected',
    likelihood: 'medium',
    impact: 'medium',
    mitigations: [
      'Start with investigation spike',
      'Time-box initial refactoring',
      'Have pivot criteria defined upfront',
    ],
    triggerSigns: [
      'Every fix reveals new problems',
      'Estimates consistently wrong',
      'Team morale declining',
    ],
  },
  {
    category: 'Knowledge',
    risk: 'Insufficient understanding of existing system',
    likelihood: 'medium',
    impact: 'high',
    mitigations: [
      'Document as you explore',
      'Pair with experienced team members',
      'Add tests before changing',
    ],
    triggerSigns: [
      'Unexpected breakages',
      'Confusion about why code exists',
      '"Nobody knows why this works"',
    ],
  },
  {
    category: 'Business',
    risk: 'Refactoring slows feature delivery too much',
    likelihood: 'high',
    impact: 'medium',
    mitigations: [
      'Interleave with feature work',
      'Refactor on the path of features',
      'Communicate velocity expectations',
    ],
    triggerSigns: [
      'Stakeholder complaints increase',
      'Pressure to "just ship features"',
      'Team feels guilty about refactoring',
    ],
  },
];

const rewriteRisks: RiskAssessment[] = [
  {
    category: 'Scope',
    risk: 'Second system effect - over-engineering',
    likelihood: 'high',
    impact: 'high',
    mitigations: [
      'Strict MVP definition',
      'Feature parity is not the goal',
      'Ship early and iterate',
    ],
    triggerSigns: [
      '"Let\'s add this feature too while we\'re at it"',
      'Architecture discussions taking weeks',
      'No working code after a month',
    ],
  },
  {
    category: 'Timeline',
    risk: 'Takes much longer than estimated',
    likelihood: 'very high',
    impact: 'high',
    mitigations: [
      '3x your initial estimate',
      'Identify unknowns explicitly',
      'Plan in milestones with off-ramps',
    ],
    triggerSigns: [
      'Every milestone slips',
      '"Just two more weeks" repeatedly',
      'Scope keeps growing',
    ],
  },
  {
    category: 'Business',
    risk: 'Business needs change during rewrite',
    likelihood: 'high',
    impact: 'very high',
    mitigations: [
      'Regular business alignment meetings',
      'Build for current needs, not future guesses',
      'Strangler fig over big bang',
    ],
    triggerSigns: [
      'New requirements incompatible with new design',
      'Competitors ship features you haven\'t built',
      'Stakeholders asking "is it done yet?" frequently',
    ],
  },
  {
    category: 'Team',
    risk: 'Team burns out before completion',
    likelihood: 'medium',
    impact: 'very high',
    mitigations: [
      'Celebrate intermediate milestones',
      'Rotate team members if possible',
      'Mix rewrite with new feature work',
    ],
    triggerSigns: [
      'Cynical comments about the project',
      'Key people asking about other projects',
      'Sick days and PTO increasing',
    ],
  },
  {
    category: 'Parity',
    risk: 'Missing features discovered at launch',
    likelihood: 'very high',
    impact: 'high',
    mitigations: [
      'Exhaustive feature inventory from old system',
      'Beta period with real users',
      'Run systems in parallel',
    ],
    triggerSigns: [
      '"I thought we covered that"',
      'Users reporting missing functionality in beta',
      'Old system behavior was undocumented',
    ],
  },
];

function createRiskMatrix(risks: RiskAssessment[]): void {
  console.log('\nRisk Matrix:');
  console.log('─'.repeat(60));

  const matrix: Record<string, RiskAssessment[]> = {
    'High Impact': risks.filter(r => r.impact === 'high'),
    'Medium Impact': risks.filter(r => r.impact === 'medium'),
    'Low Impact': risks.filter(r => r.impact === 'low'),
  };

  for (const [category, categoryRisks] of Object.entries(matrix)) {
    console.log(`\n${category}:`);
    for (const risk of categoryRisks) {
      const likelihoodEmoji =
        risk.likelihood === 'high' ? '🔴' :
        risk.likelihood === 'medium' ? '🟡' : '🟢';
      console.log(`  ${likelihoodEmoji} ${risk.risk}`);
    }
  }
}

Mitigation Plan Template

## Risk Mitigation Plan

### Risk: [Name of risk]
**Likelihood:** High | Medium | Low
**Impact:** High | Medium | Low

#### Prevention Measures
1. [Action to reduce likelihood]
2. [Action to reduce likelihood]

#### Detection (Trigger Signs)
- [ ] [Observable sign that risk is materializing]
- [ ] [Observable sign that risk is materializing]

#### Response Plan
If this risk materializes:
1. [Immediate action]
2. [Communication to stakeholders]
3. [Course correction options]

#### Contingency
If prevention and response fail:
- [Fallback plan]
- [Acceptable outcome if risk fully materializes]

Decision Documentation

Architecture Decision Record (ADR)

# ADR-023: Approach to Inventory System Modernization

## Status
Accepted

## Context
The current inventory management system was built in 2018 and is causing significant issues:
- Average of 4 hours to implement simple changes
- 2-3 production incidents per month related to inventory sync
- Team velocity decreased 40% year-over-year
- 3 engineers (out of 8) have cited the codebase in exit interviews

We evaluated three options: do nothing, targeted refactoring, and full rewrite.

## Decision
We will use a **strangler fig approach**, replacing modules incrementally over 9 months.

### Module Replacement Order
1. **Reporting module** (months 1-2) - Lowest coupling, high pain
2. **Stock level sync** (months 3-4) - Root cause of most incidents
3. **Order integration** (months 5-6) - Highest business value
4. **Core inventory** (months 7-9) - Most complex, by now we'll have patterns

### Why Not Full Rewrite
- We estimated 12-18 months for full rewrite
- No value delivery until complete
- High risk of scope creep and timeline slip
- Cannot pause for business priorities

### Why Not Pure Refactoring
- Architecture fundamentally doesn't support required features
- Test coverage too low to refactor safely
- Would take longer than replacement for most modules

## Consequences

### Positive
- Value delivered incrementally (first module in production by month 2)
- Can adjust course based on learnings
- Team morale improved by visible progress
- Risk distributed over time

### Negative
- Running two architectures for 9 months
- API compatibility layer needed
- Team must learn both old and new patterns
- Higher short-term complexity

### Risks and Mitigations
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| Integration complexity | High | Dedicated compatibility layer from start |
| Business pressure to stop | Medium | Monthly stakeholder updates with metrics |
| Scope creep in new modules | Medium | Strict MVP per module, defer features |

## Review Points
- Month 3: After first module, evaluate approach
- Month 6: Checkpoint on timeline and scope
- Month 9: Completion assessment

## Participants
- [Names of people involved in decision]
- Approved by: [Technical lead / CTO]
- Date: [Date]

Quick Reference

Decision Cheat Sheet

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REBUILD VS REFACTOR CHEAT SHEET                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  REFACTOR when:                         REWRITE when:                       │
│  ─────────────────                      ──────────────                       │
│  □ Problems are localized               □ Fundamental architecture wrong    │
│  □ Team understands codebase            □ Technology is truly end-of-life   │
│  □ Business can't wait                  □ Nobody understands old system     │
│  □ Requirements unclear                 □ Security unfixable                │
│  □ Limited resources                    □ Requirements crystal clear        │
│  □ System fundamentally works           □ Can afford parallel effort        │
│                                         □ Clean break possible              │
│                                                                              │
│  STRANGLER FIG when:                    DO NOTHING when:                    │
│  ─────────────────────                  ───────────────────                  │
│  □ Mixed signals                        □ Problems are minor annoyances     │
│  □ Can't afford full rewrite risk       □ System is actually fine           │
│  □ Need incremental value               □ No resources for improvement      │
│  □ Want option to pivot                 □ End-of-life planned anyway        │
│  □ Unsure of full scope                 □ Opportunity cost is low           │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────      │
│                                                                              │
│  RED FLAGS - Do NOT rewrite if:                                             │
│  □ "The new tech will solve everything"                                    │
│  □ Nobody understands current system (you'll recreate bugs)                │
│  □ Requirements keep changing                                               │
│  □ Must have 100% feature parity                                            │
│  □ Driven by frustration, not analysis                                     │
│  □ Only plan is "big bang" cutover                                         │
│  □ Team is already burned out                                               │
│                                                                              │
│  ─────────────────────────────────────────────────────────────────────      │
│                                                                              │
│  ALWAYS:                                                                     │
│  ✓ Document the decision and reasoning                                     │
│  ✓ Define success metrics upfront                                          │
│  ✓ Plan checkpoints and off-ramps                                          │
│  ✓ Get stakeholder alignment in writing                                    │
│  ✓ Have a rollback/pivot plan                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Closing Thoughts

The rebuild vs refactor decision isn't really about code—it's about risk management, resource allocation, and organizational patience. The best technical choice is worthless if the organization can't sustain it.

Key principles:

Diagnose before prescribing. Most "we need to rewrite" feelings have specific, addressable causes. Find them.
Default to incremental. The strangler fig pattern is almost always safer than big-bang rewrites. Prove it wrong before choosing otherwise.
Rewrites are a bet. You're betting that you can build a better system faster than you can fix the existing one, AND that requirements won't change, AND that you won't make different mistakes. That's a lot of ands.
Communicate in business terms. Technical debt, coupling, and architecture don't matter to stakeholders. Velocity, incidents, and opportunity cost do.
Plan for failure. Not failure of the project, but failure of your assumptions. What will you do if the rewrite takes 2x longer? What if refactoring reveals the problem is worse than expected?
Document everything. Future you will want to know why this decision was made. Future colleagues will want to know what was considered.

The right answer isn't always "refactor" and isn't always "rewrite." The right answer is the one you arrived at through analysis, communicated clearly, and planned for carefully.

The goal isn't to make the perfect decision. It's to make a defensible decision with clear reasoning, appropriate risk mitigation, and the humility to adjust when new information emerges.

What did you think?