┌─────────────────────────────────────────────────────────────────┐
│              Tight Coupling: What Goes Wrong                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  YOUR CODEBASE                                                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                                                          │    │
│  │  UserService.ts                                          │    │
│  │  ├── import OpenAI from 'openai'                        │    │
│  │  ├── const openai = new OpenAI({ apiKey: '...' })       │    │
│  │  └── openai.chat.completions.create(...)                │    │
│  │                                                          │    │
│  │  ProductService.ts                                       │    │
│  │  ├── import OpenAI from 'openai'                        │    │
│  │  └── openai.chat.completions.create(...)                │    │
│  │                                                          │    │
│  │  SearchService.ts                                        │    │
│  │  ├── import OpenAI from 'openai'                        │    │
│  │  └── openai.embeddings.create(...)                      │    │
│  │                                                          │    │
│  │  SupportBot.ts                                           │    │
│  │  ├── import OpenAI from 'openai'                        │    │
│  │  └── openai.chat.completions.create(...)                │    │
│  │                                                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  Problems:                                                       │
│  ✗ 47 files import OpenAI directly                              │
│  ✗ API key scattered or in shared config                        │
│  ✗ No way to swap providers without 47 file changes             │
│  ✗ No centralized error handling                                │
│  ✗ No cost tracking per feature                                 │
│  ✗ No fallback when OpenAI is down                              │
│  ✗ Can't test without hitting real API (or mocking everywhere)  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Why LLMs Are Different

LLM integrations have unique characteristics that make traditional integration patterns insufficient:

┌─────────────────────────────────────────────────────────────────┐
│              LLM Integration Challenges                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. NON-DETERMINISTIC OUTPUT                                     │
│     • Same input → different output                              │
│     • Makes testing fundamentally different                      │
│     • Behavior changes with model updates                        │
│                                                                  │
│  2. HIGH LATENCY                                                 │
│     • 500ms - 30s response times (vs 50ms for typical APIs)     │
│     • Streaming responses complicate architecture                │
│     • Timeout handling is critical                               │
│                                                                  │
│  3. UNPREDICTABLE COSTS                                          │
│     • Per-token billing                                          │
│     • Costs scale with usage AND input size                      │
│     • A bug can cost thousands in minutes                        │
│                                                                  │
│  4. RAPID EVOLUTION                                              │
│     • New models every few months                                │
│     • Old models deprecated                                      │
│     • API changes (GPT-3.5 → GPT-4 → GPT-4 Turbo → GPT-4o)     │
│                                                                  │
│  5. QUALITY VARIANCE                                             │
│     • Different models excel at different tasks                  │
│     • Need to experiment to find best fit                        │
│     • Quality can degrade unexpectedly                           │
│                                                                  │
│  6. RATE LIMITS AND QUOTAS                                       │
│     • Tokens per minute limits                                   │
│     • Requests per minute limits                                 │
│     • Vary by tier and model                                     │
│                                                                  │
│  7. REGULATORY CONCERNS                                          │
│     • Data residency requirements                                │
│     • PII handling                                               │
│     • May need to switch providers for compliance                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The Volatility Matrix

┌─────────────────────────────────────────────────────────────────┐
│         What Changes and How Often                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│                     FREQUENCY OF CHANGE                          │
│                 Low ◄─────────────────► High                     │
│                                                                  │
│  S  │ Database      │               │ Prompts        │          │
│  T  │ Schema        │               │                │          │
│  A  │               │               │                │          │
│  B  ├───────────────┼───────────────┼────────────────┤          │
│  I  │ Core          │ API           │ Model          │          │
│  L  │ Business      │ Contracts     │ Selection      │          │
│  I  │ Logic         │               │                │          │
│  T  ├───────────────┼───────────────┼────────────────┤          │
│  Y  │               │ UI            │ LLM Provider   │          │
│     │               │ Components    │ API Changes    │          │
│  ▼  │               │               │                │          │
│                                                                  │
│  LESSON: Isolate high-volatility components (LLM-related)       │
│          from low-volatility components (core business logic)   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The Cost of Tight Coupling

Technical Debt Accumulation

// Month 1: "Let's just use OpenAI directly, we can refactor later"
// services/user.service.ts

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function generateUserBio(userInfo: UserInfo): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'system', content: 'Generate a professional bio.' },
      { role: 'user', content: JSON.stringify(userInfo) }
    ],
    max_tokens: 200,
  });

  return response.choices[0].message.content;
}

// Month 6: 47 files later...
// - Same pattern copied everywhere
// - No error handling standardization
// - No retry logic
// - No cost tracking
// - No way to test without mocking OpenAI in every test file
// - PM asks: "Can we try Claude for the chat feature?"
// - Answer: "That's a 2-week refactor"

Hidden Costs

┌─────────────────────────────────────────────────────────────────┐
│              Hidden Costs of Tight Coupling                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  DEVELOPMENT VELOCITY                                            │
│  • Every LLM feature change touches multiple files               │
│  • Testing requires complex mocking                              │
│  • New developers must understand OpenAI API                     │
│  • Code reviews focus on implementation, not business logic      │
│                                                                  │
│  OPERATIONAL COSTS                                               │
│  • Can't easily implement cost controls                          │
│  • No centralized monitoring                                     │
│  • Debugging requires understanding scattered implementations    │
│  • Outages affect everything simultaneously                      │
│                                                                  │
│  BUSINESS AGILITY                                                │
│  • Can't quickly A/B test different models                       │
│  • Vendor lock-in limits negotiation leverage                    │
│  • Compliance requirements may force painful migrations          │
│  • Can't gradually roll out model changes                        │
│                                                                  │
│  TECHNICAL RISK                                                  │
│  • Provider outage = your outage                                 │
│  • API deprecation = emergency rewrite                           │
│  • No fallback capability                                        │
│  • Rate limits hit unexpectedly                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Architectural Principles

Core Principles for LLM Integration

┌─────────────────────────────────────────────────────────────────┐
│              LLM Integration Design Principles                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. DEPENDENCY INVERSION                                         │
│     High-level modules should not depend on LLM providers.      │
│     Both should depend on abstractions.                          │
│                                                                  │
│     ❌ UserService → OpenAI SDK                                  │
│     ✅ UserService → LLMService (interface) ← OpenAIProvider    │
│                                                                  │
│  2. SINGLE POINT OF INTEGRATION                                  │
│     All LLM calls flow through one gateway.                      │
│     Enables centralized control, monitoring, and modification.   │
│                                                                  │
│  3. GRACEFUL DEGRADATION                                         │
│     Every LLM-powered feature must have a fallback.              │
│     The app should work (perhaps with reduced functionality)     │
│     even when the LLM is unavailable.                            │
│                                                                  │
│  4. CAPABILITY, NOT IMPLEMENTATION                               │
│     Define what you need (summarize, classify, generate),        │
│     not how it's done (OpenAI, Claude, local model).             │
│                                                                  │
│  5. CONFIGURATION OVER CODE                                      │
│     Model selection, prompts, and parameters should be           │
│     configurable without code changes.                           │
│                                                                  │
│  6. OBSERVABILITY FIRST                                          │
│     Every LLM call should be traceable, measurable,              │
│     and attributable to a feature and user.                      │
│                                                                  │
│  7. COST AWARENESS                                               │
│     Cost should be a first-class consideration in the            │
│     architecture, not an afterthought.                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Layered Architecture

┌─────────────────────────────────────────────────────────────────┐
│              Decoupled LLM Architecture                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   APPLICATION LAYER                      │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐    │    │
│  │  │ Feature │  │ Feature │  │ Feature │  │ Feature │    │    │
│  │  │    A    │  │    B    │  │    C    │  │    D    │    │    │
│  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘    │    │
│  └───────┼────────────┼────────────┼────────────┼──────────┘    │
│          │            │            │            │                │
│          ▼            ▼            ▼            ▼                │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                  CAPABILITY LAYER                        │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐            │    │
│  │  │Summarizer │  │Classifier │  │ Generator │  ...       │    │
│  │  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘            │    │
│  └────────┼──────────────┼──────────────┼───────────────────┘    │
│           │              │              │                        │
│           ▼              ▼              ▼                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    LLM GATEWAY                           │    │
│  │  ┌─────────────────────────────────────────────────┐    │    │
│  │  │ • Request routing       • Rate limiting          │    │    │
│  │  │ • Cost tracking         • Retry logic            │    │    │
│  │  │ • Caching               • Circuit breaker        │    │    │
│  │  │ • Logging/tracing       • Fallback handling      │    │    │
│  │  └─────────────────────────────────────────────────┘    │    │
│  └──────────────────────────┬──────────────────────────────┘    │
│                             │                                    │
│                             ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   PROVIDER LAYER                         │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐    │    │
│  │  │ OpenAI  │  │ Anthro- │  │  Local  │  │  Mock   │    │    │
│  │  │Provider │  │   pic   │  │ (Llama) │  │Provider │    │    │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘    │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

The LLM Abstraction Layer

Provider Interface

// llm/interfaces/provider.interface.ts

export interface LLMMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

export interface LLMCompletionRequest {
  messages: LLMMessage[];
  model?: string;
  maxTokens?: number;
  temperature?: number;
  stream?: boolean;
  metadata?: {
    feature: string;
    userId?: string;
    requestId: string;
  };
}

export interface LLMCompletionResponse {
  content: string;
  model: string;
  usage: {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
  };
  finishReason: 'stop' | 'length' | 'content_filter' | 'error';
  latencyMs: number;
  cost: number;
  cached: boolean;
}

export interface LLMEmbeddingRequest {
  input: string | string[];
  model?: string;
  metadata?: {
    feature: string;
    requestId: string;
  };
}

export interface LLMEmbeddingResponse {
  embeddings: number[][];
  model: string;
  usage: {
    totalTokens: number;
  };
  latencyMs: number;
  cost: number;
}

// The provider interface - all providers implement this
export interface LLMProvider {
  name: string;

  complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse>;

  completeStream(
    request: LLMCompletionRequest
  ): AsyncGenerator<string, LLMCompletionResponse>;

  embed(request: LLMEmbeddingRequest): Promise<LLMEmbeddingResponse>;

  isAvailable(): Promise<boolean>;

  getModels(): string[];
}

OpenAI Provider Implementation

// llm/providers/openai.provider.ts

import OpenAI from 'openai';
import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';

export class OpenAIProvider implements LLMProvider {
  name = 'openai';
  private client: OpenAI;
  private costPerToken: Record<string, { input: number; output: number }>;

  constructor(config: { apiKey: string }) {
    this.client = new OpenAI({ apiKey: config.apiKey });

    // Cost per 1K tokens (as of 2024)
    this.costPerToken = {
      'gpt-4o': { input: 0.005, output: 0.015 },
      'gpt-4-turbo': { input: 0.01, output: 0.03 },
      'gpt-4': { input: 0.03, output: 0.06 },
      'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
    };
  }

  async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
    const startTime = Date.now();
    const model = request.model || 'gpt-4o';

    try {
      const response = await this.client.chat.completions.create({
        model,
        messages: request.messages.map(m => ({
          role: m.role,
          content: m.content,
        })),
        max_tokens: request.maxTokens,
        temperature: request.temperature,
      });

      const latencyMs = Date.now() - startTime;
      const usage = response.usage!;
      const cost = this.calculateCost(model, usage.prompt_tokens, usage.completion_tokens);

      return {
        content: response.choices[0].message.content || '',
        model: response.model,
        usage: {
          promptTokens: usage.prompt_tokens,
          completionTokens: usage.completion_tokens,
          totalTokens: usage.total_tokens,
        },
        finishReason: this.mapFinishReason(response.choices[0].finish_reason),
        latencyMs,
        cost,
        cached: false,
      };
    } catch (error) {
      throw this.mapError(error);
    }
  }

  async *completeStream(
    request: LLMCompletionRequest
  ): AsyncGenerator<string, LLMCompletionResponse> {
    const startTime = Date.now();
    const model = request.model || 'gpt-4o';
    let content = '';

    const stream = await this.client.chat.completions.create({
      model,
      messages: request.messages.map(m => ({
        role: m.role,
        content: m.content,
      })),
      max_tokens: request.maxTokens,
      temperature: request.temperature,
      stream: true,
      stream_options: { include_usage: true },
    });

    let finalUsage: any;
    let finishReason: string = 'stop';

    for await (const chunk of stream) {
      if (chunk.choices[0]?.delta?.content) {
        const text = chunk.choices[0].delta.content;
        content += text;
        yield text;
      }
      if (chunk.choices[0]?.finish_reason) {
        finishReason = chunk.choices[0].finish_reason;
      }
      if (chunk.usage) {
        finalUsage = chunk.usage;
      }
    }

    const latencyMs = Date.now() - startTime;
    const cost = finalUsage
      ? this.calculateCost(model, finalUsage.prompt_tokens, finalUsage.completion_tokens)
      : 0;

    return {
      content,
      model,
      usage: {
        promptTokens: finalUsage?.prompt_tokens || 0,
        completionTokens: finalUsage?.completion_tokens || 0,
        totalTokens: finalUsage?.total_tokens || 0,
      },
      finishReason: this.mapFinishReason(finishReason),
      latencyMs,
      cost,
      cached: false,
    };
  }

  async embed(request: LLMEmbeddingRequest): Promise<LLMEmbeddingResponse> {
    const startTime = Date.now();
    const model = request.model || 'text-embedding-3-small';

    const response = await this.client.embeddings.create({
      model,
      input: request.input,
    });

    return {
      embeddings: response.data.map(d => d.embedding),
      model: response.model,
      usage: { totalTokens: response.usage.total_tokens },
      latencyMs: Date.now() - startTime,
      cost: response.usage.total_tokens * 0.00002 / 1000,
    };
  }

  async isAvailable(): Promise<boolean> {
    try {
      await this.client.models.list();
      return true;
    } catch {
      return false;
    }
  }

  getModels(): string[] {
    return ['gpt-4o', 'gpt-4-turbo', 'gpt-4', 'gpt-3.5-turbo'];
  }

  private calculateCost(model: string, inputTokens: number, outputTokens: number): number {
    const pricing = this.costPerToken[model] || this.costPerToken['gpt-4o'];
    return (inputTokens * pricing.input + outputTokens * pricing.output) / 1000;
  }

  private mapFinishReason(reason: string): LLMCompletionResponse['finishReason'] {
    const mapping: Record<string, LLMCompletionResponse['finishReason']> = {
      stop: 'stop',
      length: 'length',
      content_filter: 'content_filter',
    };
    return mapping[reason] || 'error';
  }

  private mapError(error: any): Error {
    // Map provider-specific errors to generic errors
    if (error.status === 429) {
      return new RateLimitError('OpenAI rate limit exceeded', error);
    }
    if (error.status === 503) {
      return new ProviderUnavailableError('OpenAI service unavailable', error);
    }
    return new LLMError('OpenAI request failed', error);
  }
}

// Custom error types
export class LLMError extends Error {
  constructor(message: string, public cause?: Error) {
    super(message);
    this.name = 'LLMError';
  }
}

export class RateLimitError extends LLMError {
  name = 'RateLimitError';
}

export class ProviderUnavailableError extends LLMError {
  name = 'ProviderUnavailableError';
}

Anthropic Provider Implementation

// llm/providers/anthropic.provider.ts

import Anthropic from '@anthropic-ai/sdk';
import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';

export class AnthropicProvider implements LLMProvider {
  name = 'anthropic';
  private client: Anthropic;

  constructor(config: { apiKey: string }) {
    this.client = new Anthropic({ apiKey: config.apiKey });
  }

  async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
    const startTime = Date.now();
    const model = request.model || 'claude-sonnet-4-20250514';

    // Extract system message (Anthropic handles it separately)
    const systemMessage = request.messages.find(m => m.role === 'system')?.content;
    const messages = request.messages
      .filter(m => m.role !== 'system')
      .map(m => ({
        role: m.role as 'user' | 'assistant',
        content: m.content,
      }));

    const response = await this.client.messages.create({
      model,
      max_tokens: request.maxTokens || 4096,
      system: systemMessage,
      messages,
    });

    const latencyMs = Date.now() - startTime;
    const content = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    return {
      content,
      model: response.model,
      usage: {
        promptTokens: response.usage.input_tokens,
        completionTokens: response.usage.output_tokens,
        totalTokens: response.usage.input_tokens + response.usage.output_tokens,
      },
      finishReason: response.stop_reason === 'end_turn' ? 'stop' : 'length',
      latencyMs,
      cost: this.calculateCost(model, response.usage),
      cached: false,
    };
  }

  // ... streaming and embed implementations

  async isAvailable(): Promise<boolean> {
    try {
      // Simple health check
      await this.client.messages.create({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 1,
        messages: [{ role: 'user', content: 'hi' }],
      });
      return true;
    } catch {
      return false;
    }
  }

  getModels(): string[] {
    return ['claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-haiku-20240307'];
  }

  private calculateCost(model: string, usage: { input_tokens: number; output_tokens: number }): number {
    const pricing: Record<string, { input: number; output: number }> = {
      'claude-sonnet-4-20250514': { input: 0.003, output: 0.015 },
      'claude-3-5-sonnet-20241022': { input: 0.003, output: 0.015 },
      'claude-3-haiku-20240307': { input: 0.00025, output: 0.00125 },
    };
    const p = pricing[model] || pricing['claude-sonnet-4-20250514'];
    return (usage.input_tokens * p.input + usage.output_tokens * p.output) / 1000;
  }
}

Gateway Pattern

The LLM Gateway

// llm/gateway/llm.gateway.ts

import {
  LLMProvider,
  LLMCompletionRequest,
  LLMCompletionResponse,
  LLMEmbeddingRequest,
  LLMEmbeddingResponse,
} from '../interfaces';
import { CircuitBreaker } from './circuit-breaker';
import { RateLimiter } from './rate-limiter';
import { Cache } from './cache';
import { CostTracker } from './cost-tracker';
import { Logger } from './logger';

interface GatewayConfig {
  providers: Record<string, LLMProvider>;
  defaultProvider: string;
  fallbackProviders: string[];
  cache?: Cache;
  costTracker?: CostTracker;
  rateLimiter?: RateLimiter;
  logger?: Logger;
}

export class LLMGateway {
  private providers: Record<string, LLMProvider>;
  private defaultProvider: string;
  private fallbackProviders: string[];
  private circuitBreakers: Record<string, CircuitBreaker>;
  private cache?: Cache;
  private costTracker?: CostTracker;
  private rateLimiter?: RateLimiter;
  private logger?: Logger;

  constructor(config: GatewayConfig) {
    this.providers = config.providers;
    this.defaultProvider = config.defaultProvider;
    this.fallbackProviders = config.fallbackProviders;
    this.cache = config.cache;
    this.costTracker = config.costTracker;
    this.rateLimiter = config.rateLimiter;
    this.logger = config.logger;

    // Initialize circuit breakers for each provider
    this.circuitBreakers = {};
    Object.keys(this.providers).forEach(name => {
      this.circuitBreakers[name] = new CircuitBreaker({
        failureThreshold: 5,
        recoveryTimeout: 30000,
      });
    });
  }

  async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
    const startTime = Date.now();
    const requestId = request.metadata?.requestId || crypto.randomUUID();

    // Check rate limits
    if (this.rateLimiter) {
      const allowed = await this.rateLimiter.checkLimit(
        request.metadata?.feature || 'default',
        request.metadata?.userId
      );
      if (!allowed) {
        throw new RateLimitExceededError('Rate limit exceeded');
      }
    }

    // Check cache
    if (this.cache) {
      const cached = await this.cache.get(this.getCacheKey(request));
      if (cached) {
        this.logger?.info('Cache hit', { requestId, feature: request.metadata?.feature });
        return { ...cached, cached: true };
      }
    }

    // Try providers in order
    const providersToTry = [this.defaultProvider, ...this.fallbackProviders];
    let lastError: Error | null = null;

    for (const providerName of providersToTry) {
      const provider = this.providers[providerName];
      const circuitBreaker = this.circuitBreakers[providerName];

      if (!provider || circuitBreaker.isOpen()) {
        continue;
      }

      try {
        this.logger?.info('Attempting provider', { providerName, requestId });

        const response = await circuitBreaker.execute(() =>
          provider.complete(request)
        );

        // Track cost
        if (this.costTracker) {
          await this.costTracker.track({
            provider: providerName,
            model: response.model,
            feature: request.metadata?.feature || 'unknown',
            userId: request.metadata?.userId,
            inputTokens: response.usage.promptTokens,
            outputTokens: response.usage.completionTokens,
            cost: response.cost,
            latencyMs: response.latencyMs,
            requestId,
          });
        }

        // Cache successful response
        if (this.cache && request.temperature === 0) {
          await this.cache.set(this.getCacheKey(request), response);
        }

        this.logger?.info('Request completed', {
          providerName,
          requestId,
          latencyMs: response.latencyMs,
          tokens: response.usage.totalTokens,
          cost: response.cost,
        });

        return response;

      } catch (error) {
        lastError = error as Error;
        this.logger?.warn('Provider failed', {
          providerName,
          requestId,
          error: lastError.message,
        });
        // Continue to next provider
      }
    }

    this.logger?.error('All providers failed', { requestId, error: lastError?.message });
    throw lastError || new Error('All LLM providers failed');
  }

  async *completeStream(
    request: LLMCompletionRequest
  ): AsyncGenerator<string, LLMCompletionResponse> {
    const provider = this.providers[this.defaultProvider];

    // Streaming doesn't support fallback easily, use primary only
    const generator = provider.completeStream(request);

    let chunk = await generator.next();
    while (!chunk.done) {
      yield chunk.value;
      chunk = await generator.next();
    }

    // Track cost for streaming response
    if (this.costTracker && chunk.value) {
      await this.costTracker.track({
        provider: this.defaultProvider,
        model: chunk.value.model,
        feature: request.metadata?.feature || 'unknown',
        inputTokens: chunk.value.usage.promptTokens,
        outputTokens: chunk.value.usage.completionTokens,
        cost: chunk.value.cost,
        latencyMs: chunk.value.latencyMs,
        requestId: request.metadata?.requestId || '',
      });
    }

    return chunk.value;
  }

  private getCacheKey(request: LLMCompletionRequest): string {
    return crypto
      .createHash('sha256')
      .update(JSON.stringify({
        messages: request.messages,
        model: request.model,
        maxTokens: request.maxTokens,
        temperature: request.temperature,
      }))
      .digest('hex');
  }
}

Circuit Breaker Implementation

// llm/gateway/circuit-breaker.ts

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';

interface CircuitBreakerConfig {
  failureThreshold: number;
  recoveryTimeout: number;
  halfOpenRequests?: number;
}

export class CircuitBreaker {
  private state: CircuitState = 'CLOSED';
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime = 0;
  private config: CircuitBreakerConfig;

  constructor(config: CircuitBreakerConfig) {
    this.config = {
      halfOpenRequests: 3,
      ...config,
    };
  }

  isOpen(): boolean {
    if (this.state === 'OPEN') {
      // Check if recovery timeout has passed
      if (Date.now() - this.lastFailureTime >= this.config.recoveryTimeout) {
        this.state = 'HALF_OPEN';
        this.successCount = 0;
        return false;
      }
      return true;
    }
    return false;
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      throw new CircuitOpenError('Circuit breaker is open');
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;

    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.config.halfOpenRequests!) {
        this.state = 'CLOSED';
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.config.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

export class CircuitOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'CircuitOpenError';
  }
}

Capability-Based Design

Define Capabilities, Not Implementations

// llm/capabilities/index.ts

// Define WHAT you need, not HOW it's done
export interface TextSummarizer {
  summarize(text: string, options?: SummarizeOptions): Promise<Summary>;
}

export interface SummarizeOptions {
  maxLength?: number;
  style?: 'brief' | 'detailed' | 'bullet-points';
  targetAudience?: string;
}

export interface Summary {
  text: string;
  keyPoints: string[];
  originalLength: number;
  summaryLength: number;
}

export interface TextClassifier {
  classify(text: string, categories: string[]): Promise<Classification>;
}

export interface Classification {
  category: string;
  confidence: number;
  reasoning?: string;
}

export interface ContentGenerator {
  generate(prompt: GeneratePrompt): Promise<GeneratedContent>;
}

export interface GeneratePrompt {
  type: 'email' | 'blog-post' | 'product-description' | 'social-media';
  context: Record<string, any>;
  tone?: 'formal' | 'casual' | 'professional';
  length?: 'short' | 'medium' | 'long';
}

export interface GeneratedContent {
  content: string;
  suggestions?: string[];
  metadata: {
    wordCount: number;
    readingTimeMinutes: number;
  };
}

export interface SemanticSearch {
  index(documents: Document[]): Promise<void>;
  search(query: string, options?: SearchOptions): Promise<SearchResult[]>;
}

Capability Implementations

// llm/capabilities/summarizer.ts

import { TextSummarizer, SummarizeOptions, Summary } from './index';
import { LLMGateway } from '../gateway';
import { PromptManager } from '../prompts';

export class LLMSummarizer implements TextSummarizer {
  constructor(
    private gateway: LLMGateway,
    private prompts: PromptManager
  ) {}

  async summarize(text: string, options?: SummarizeOptions): Promise<Summary> {
    const prompt = this.prompts.get('summarize', {
      text,
      maxLength: options?.maxLength || 200,
      style: options?.style || 'brief',
      targetAudience: options?.targetAudience || 'general',
    });

    const response = await this.gateway.complete({
      messages: [
        { role: 'system', content: prompt.system },
        { role: 'user', content: prompt.user },
      ],
      temperature: 0.3,
      metadata: {
        feature: 'summarizer',
        requestId: crypto.randomUUID(),
      },
    });

    // Parse structured response
    const parsed = this.parseResponse(response.content);

    return {
      text: parsed.summary,
      keyPoints: parsed.keyPoints,
      originalLength: text.length,
      summaryLength: parsed.summary.length,
    };
  }

  private parseResponse(content: string): { summary: string; keyPoints: string[] } {
    // Handle JSON response or extract from text
    try {
      return JSON.parse(content);
    } catch {
      // Fallback: treat entire response as summary
      return {
        summary: content,
        keyPoints: [],
      };
    }
  }
}

// Factory function for dependency injection
export function createSummarizer(gateway: LLMGateway): TextSummarizer {
  const prompts = new PromptManager();
  return new LLMSummarizer(gateway, prompts);
}

Using Capabilities in Application Code

// services/document.service.ts

import { TextSummarizer, TextClassifier } from '../llm/capabilities';

// Application code depends on CAPABILITIES, not LLM details
export class DocumentService {
  constructor(
    private summarizer: TextSummarizer,
    private classifier: TextClassifier,
    private documentRepo: DocumentRepository
  ) {}

  async processDocument(doc: Document): Promise<ProcessedDocument> {
    // Summarize
    const summary = await this.summarizer.summarize(doc.content, {
      style: 'bullet-points',
      maxLength: 500,
    });

    // Classify
    const classification = await this.classifier.classify(doc.content, [
      'legal',
      'financial',
      'technical',
      'marketing',
      'other',
    ]);

    // Store
    const processed = {
      ...doc,
      summary: summary.text,
      keyPoints: summary.keyPoints,
      category: classification.category,
      categoryConfidence: classification.confidence,
    };

    await this.documentRepo.save(processed);
    return processed;
  }
}

// The DocumentService has NO IDEA:
// - Which LLM provider is being used
// - What model is being called
// - What the prompts look like
// - How much it costs
// - How errors are handled
//
// It just uses summarizer.summarize() and classifier.classify()

Graceful Degradation Strategies

Fallback Hierarchy

// llm/fallback/fallback-strategy.ts

export interface FallbackConfig {
  feature: string;
  llmCapability: () => Promise<any>;
  fallbacks: FallbackOption[];
}

export interface FallbackOption {
  name: string;
  condition?: () => boolean | Promise<boolean>;
  execute: () => Promise<any>;
}

export class FallbackStrategy {
  async execute(config: FallbackConfig): Promise<any> {
    // Try LLM first
    try {
      return await config.llmCapability();
    } catch (error) {
      console.warn(`LLM failed for ${config.feature}:`, error);
    }

    // Try fallbacks in order
    for (const fallback of config.fallbacks) {
      try {
        if (fallback.condition) {
          const shouldUse = await fallback.condition();
          if (!shouldUse) continue;
        }

        console.info(`Using fallback ${fallback.name} for ${config.feature}`);
        return await fallback.execute();
      } catch (error) {
        console.warn(`Fallback ${fallback.name} failed:`, error);
      }
    }

    throw new Error(`All options exhausted for ${config.feature}`);
  }
}

// Example usage
const summarizeWithFallback = async (text: string) => {
  const strategy = new FallbackStrategy();

  return strategy.execute({
    feature: 'summarization',
    llmCapability: () => summarizer.summarize(text),
    fallbacks: [
      {
        name: 'cached-summary',
        condition: async () => {
          const cached = await cache.get(`summary:${hash(text)}`);
          return !!cached;
        },
        execute: async () => cache.get(`summary:${hash(text)}`),
      },
      {
        name: 'extractive-summary',
        execute: async () => extractiveSummarize(text), // Non-LLM algorithm
      },
      {
        name: 'first-paragraph',
        execute: async () => ({
          text: text.split('\n\n')[0],
          keyPoints: [],
          originalLength: text.length,
          summaryLength: text.split('\n\n')[0].length,
        }),
      },
    ],
  });
};

Feature-Level Degradation

// features/ai-features.ts

export class AIFeatureManager {
  private featureStatus: Map<string, FeatureStatus> = new Map();

  constructor(private llmGateway: LLMGateway) {
    this.initializeHealthChecks();
  }

  private initializeHealthChecks() {
    // Periodic health checks
    setInterval(async () => {
      const available = await this.llmGateway.isAvailable();
      this.updateAllFeatures(available ? 'full' : 'degraded');
    }, 30000);
  }

  getFeatureMode(feature: string): 'full' | 'degraded' | 'disabled' {
    return this.featureStatus.get(feature)?.mode || 'full';
  }

  setFeatureMode(feature: string, mode: 'full' | 'degraded' | 'disabled') {
    this.featureStatus.set(feature, { mode, updatedAt: new Date() });
  }

  private updateAllFeatures(mode: 'full' | 'degraded') {
    for (const [feature] of this.featureStatus) {
      // Don't override manually disabled features
      if (this.featureStatus.get(feature)?.mode !== 'disabled') {
        this.setFeatureMode(feature, mode);
      }
    }
  }
}

// Usage in UI component
function SmartSuggestions({ document }: { document: Document }) {
  const featureMode = useFeatureMode('smart-suggestions');

  if (featureMode === 'disabled') {
    return null;
  }

  if (featureMode === 'degraded') {
    return <BasicSuggestions document={document} />;
  }

  return <AISuggestions document={document} />;
}

Degradation Communication

// components/AIStatusIndicator.tsx

import React from 'react';
import { useAIStatus } from '../hooks/useAIStatus';

export function AIStatusIndicator() {
  const status = useAIStatus();

  if (status.mode === 'full') {
    return null; // Don't show anything when working normally
  }

  return (
    <div className={`ai-status ai-status--${status.mode}`}>
      {status.mode === 'degraded' && (
        <>
          <span className="ai-status__icon">⚡</span>
          <span>AI features are limited. Some suggestions may be simpler.</span>
        </>
      )}
      {status.mode === 'disabled' && (
        <>
          <span className="ai-status__icon">🔌</span>
          <span>AI features are temporarily unavailable.</span>
        </>
      )}
    </div>
  );
}

Feature Flag Integration

LLM Feature Flags

// config/feature-flags.ts

export interface LLMFeatureFlag {
  name: string;
  enabled: boolean;
  provider?: string;  // Override default provider
  model?: string;     // Override default model
  rolloutPercentage?: number;
  userSegments?: string[];
  maxCostPerUser?: number;
  maxRequestsPerMinute?: number;
}

export const llmFeatureFlags: Record<string, LLMFeatureFlag> = {
  'ai-summarization': {
    name: 'AI Summarization',
    enabled: true,
    provider: 'openai',
    model: 'gpt-4o',
    rolloutPercentage: 100,
    maxCostPerUser: 0.50,  // $0.50 per user per day
  },
  'ai-chat': {
    name: 'AI Chat Assistant',
    enabled: true,
    provider: 'anthropic',
    model: 'claude-sonnet-4-20250514',
    rolloutPercentage: 50,  // 50% of users
    userSegments: ['premium', 'beta'],
    maxRequestsPerMinute: 10,
  },
  'ai-code-review': {
    name: 'AI Code Review',
    enabled: false,  // Not yet launched
    provider: 'openai',
    model: 'gpt-4-turbo',
  },
};

Feature Flag Service

// services/feature-flag.service.ts

import { llmFeatureFlags, LLMFeatureFlag } from '../config/feature-flags';

export class FeatureFlagService {
  private flags: Record<string, LLMFeatureFlag>;
  private userCosts: Map<string, number> = new Map();

  constructor() {
    this.flags = llmFeatureFlags;
    // In production, load from remote config (LaunchDarkly, etc.)
  }

  isEnabled(
    featureName: string,
    context: { userId: string; userSegments?: string[] }
  ): boolean {
    const flag = this.flags[featureName];
    if (!flag || !flag.enabled) {
      return false;
    }

    // Check user segments
    if (flag.userSegments && flag.userSegments.length > 0) {
      const hasSegment = context.userSegments?.some(s =>
        flag.userSegments!.includes(s)
      );
      if (!hasSegment) {
        return false;
      }
    }

    // Check rollout percentage (consistent per user)
    if (flag.rolloutPercentage !== undefined && flag.rolloutPercentage < 100) {
      const hash = this.hashUserId(context.userId, featureName);
      if (hash > flag.rolloutPercentage) {
        return false;
      }
    }

    // Check cost limits
    if (flag.maxCostPerUser !== undefined) {
      const currentCost = this.userCosts.get(context.userId) || 0;
      if (currentCost >= flag.maxCostPerUser) {
        return false;
      }
    }

    return true;
  }

  getConfig(featureName: string): { provider?: string; model?: string } {
    const flag = this.flags[featureName];
    return {
      provider: flag?.provider,
      model: flag?.model,
    };
  }

  recordCost(userId: string, cost: number): void {
    const current = this.userCosts.get(userId) || 0;
    this.userCosts.set(userId, current + cost);
  }

  private hashUserId(userId: string, feature: string): number {
    // Consistent hashing for rollout percentage
    const str = `${userId}:${feature}`;
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      hash = (hash << 5) - hash + str.charCodeAt(i);
      hash = hash & hash;
    }
    return Math.abs(hash % 100);
  }
}

Using Feature Flags

// services/document.service.ts (updated)

export class DocumentService {
  constructor(
    private summarizer: TextSummarizer,
    private featureFlags: FeatureFlagService,
    private fallbackSummarizer: SimpleSummarizer
  ) {}

  async summarizeDocument(
    doc: Document,
    userId: string,
    userSegments: string[]
  ): Promise<Summary> {
    const isAIEnabled = this.featureFlags.isEnabled('ai-summarization', {
      userId,
      userSegments,
    });

    if (!isAIEnabled) {
      // Use non-AI fallback
      return this.fallbackSummarizer.summarize(doc.content);
    }

    try {
      const summary = await this.summarizer.summarize(doc.content);
      // Cost tracking handled by LLM Gateway
      return summary;
    } catch (error) {
      // Graceful degradation to non-AI fallback
      console.warn('AI summarization failed, using fallback', error);
      return this.fallbackSummarizer.summarize(doc.content);
    }
  }
}

Queue-Based Architecture

Async LLM Processing

┌─────────────────────────────────────────────────────────────────┐
│              Queue-Based LLM Architecture                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SYNCHRONOUS (User Waits)           ASYNCHRONOUS (Background)   │
│  ─────────────────────────────────────────────────────────────  │
│                                                                  │
│  User Action                        User Action                  │
│      │                                  │                        │
│      ▼                                  ▼                        │
│  ┌────────┐                        ┌────────┐                   │
│  │ API    │                        │ API    │                   │
│  │Handler │                        │Handler │                   │
│  └───┬────┘                        └───┬────┘                   │
│      │                                  │                        │
│      ▼                                  │ Enqueue                │
│  ┌────────┐                            ▼                        │
│  │  LLM   │                        ┌────────┐                   │
│  │Gateway │                        │ Queue  │──► Job ID         │
│  └───┬────┘                        └───┬────┘    returned       │
│      │                                  │        to user         │
│      │ 2-30 seconds                     │                        │
│      ▼                                  ▼                        │
│  Response                          ┌────────┐                   │
│  to User                           │ Worker │                   │
│                                    └───┬────┘                   │
│  Good for:                             │                        │
│  • Chat                                ▼                        │
│  • Short completions              ┌────────┐                    │
│  • Real-time features             │  LLM   │                    │
│                                   │Gateway │                    │
│                                   └───┬────┘                    │
│                                       │                         │
│                                       ▼                         │
│                                   Webhook/                      │
│                                   Notification                  │
│                                   to User                       │
│                                                                 │
│                                   Good for:                     │
│                                   • Document processing         │
│                                   • Batch operations            │
│                                   • Non-urgent features         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Queue Implementation

// llm/queue/llm-job.ts

export interface LLMJob {
  id: string;
  type: 'completion' | 'embedding' | 'batch-completion';
  payload: any;
  metadata: {
    feature: string;
    userId: string;
    priority: 'low' | 'normal' | 'high';
    callbackUrl?: string;
    webhookSecret?: string;
  };
  status: 'pending' | 'processing' | 'completed' | 'failed';
  result?: any;
  error?: string;
  createdAt: Date;
  processedAt?: Date;
  completedAt?: Date;
}

// llm/queue/llm-queue.service.ts

import { Queue, Worker, Job } from 'bullmq';
import { LLMGateway } from '../gateway';

export class LLMQueueService {
  private queue: Queue;
  private worker: Worker;

  constructor(
    private gateway: LLMGateway,
    private redis: Redis
  ) {
    this.queue = new Queue('llm-jobs', { connection: redis });
    this.initializeWorker();
  }

  async enqueue(job: Omit<LLMJob, 'id' | 'status' | 'createdAt'>): Promise<string> {
    const bullJob = await this.queue.add(job.type, job, {
      priority: this.getPriority(job.metadata.priority),
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 1000,
      },
    });

    return bullJob.id!;
  }

  async getJobStatus(jobId: string): Promise<LLMJob | null> {
    const job = await this.queue.getJob(jobId);
    if (!job) return null;

    return {
      id: job.id!,
      type: job.name as LLMJob['type'],
      payload: job.data.payload,
      metadata: job.data.metadata,
      status: await this.mapJobState(job),
      result: job.returnvalue,
      error: job.failedReason,
      createdAt: new Date(job.timestamp),
      processedAt: job.processedOn ? new Date(job.processedOn) : undefined,
      completedAt: job.finishedOn ? new Date(job.finishedOn) : undefined,
    };
  }

  private initializeWorker() {
    this.worker = new Worker(
      'llm-jobs',
      async (job: Job) => {
        switch (job.name) {
          case 'completion':
            return this.processCompletion(job.data);
          case 'embedding':
            return this.processEmbedding(job.data);
          case 'batch-completion':
            return this.processBatchCompletion(job.data);
          default:
            throw new Error(`Unknown job type: ${job.name}`);
        }
      },
      {
        connection: this.redis,
        concurrency: 10,  // Process 10 jobs in parallel
      }
    );

    this.worker.on('completed', async (job, result) => {
      // Send webhook if configured
      if (job.data.metadata.callbackUrl) {
        await this.sendWebhook(job.data.metadata, result);
      }
    });

    this.worker.on('failed', async (job, error) => {
      console.error(`Job ${job?.id} failed:`, error);
      // Could send failure webhook
    });
  }

  private async processCompletion(data: any): Promise<any> {
    return this.gateway.complete({
      messages: data.payload.messages,
      model: data.payload.model,
      maxTokens: data.payload.maxTokens,
      metadata: data.metadata,
    });
  }

  private async processBatchCompletion(data: any): Promise<any[]> {
    const results = [];
    for (const item of data.payload.items) {
      const result = await this.gateway.complete({
        messages: item.messages,
        model: data.payload.model,
        metadata: data.metadata,
      });
      results.push(result);
    }
    return results;
  }

  private getPriority(priority: 'low' | 'normal' | 'high'): number {
    return { low: 10, normal: 5, high: 1 }[priority];
  }

  private async mapJobState(job: Job): Promise<LLMJob['status']> {
    const state = await job.getState();
    const mapping: Record<string, LLMJob['status']> = {
      waiting: 'pending',
      active: 'processing',
      completed: 'completed',
      failed: 'failed',
    };
    return mapping[state] || 'pending';
  }

  private async sendWebhook(metadata: any, result: any): Promise<void> {
    // Send result to callback URL
    await fetch(metadata.callbackUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Webhook-Signature': this.signPayload(result, metadata.webhookSecret),
      },
      body: JSON.stringify(result),
    });
  }
}

API Endpoints for Queue

// routes/llm.routes.ts

import { Router } from 'express';
import { LLMQueueService } from '../llm/queue';

export function createLLMRoutes(queueService: LLMQueueService): Router {
  const router = Router();

  // Enqueue a job
  router.post('/jobs', async (req, res) => {
    const jobId = await queueService.enqueue({
      type: req.body.type,
      payload: req.body.payload,
      metadata: {
        feature: req.body.feature,
        userId: req.user.id,
        priority: req.body.priority || 'normal',
        callbackUrl: req.body.callbackUrl,
      },
    });

    res.status(202).json({
      jobId,
      status: 'pending',
      statusUrl: `/api/llm/jobs/${jobId}`,
    });
  });

  // Check job status
  router.get('/jobs/:jobId', async (req, res) => {
    const job = await queueService.getJobStatus(req.params.jobId);

    if (!job) {
      return res.status(404).json({ error: 'Job not found' });
    }

    // Only return result if completed
    res.json({
      id: job.id,
      status: job.status,
      ...(job.status === 'completed' && { result: job.result }),
      ...(job.status === 'failed' && { error: job.error }),
    });
  });

  return router;
}

Caching and Memoization

Multi-Level Cache

// llm/cache/cache.ts

export interface CacheConfig {
  l1: { maxSize: number; ttlMs: number };  // In-memory
  l2?: { redis: Redis; ttlMs: number };     // Redis
  l3?: { storage: Storage; ttlMs: number }; // Persistent
}

export class LLMCache {
  private l1: Map<string, { value: any; expiry: number }> = new Map();
  private l1MaxSize: number;
  private l2?: Redis;
  private l3?: Storage;
  private config: CacheConfig;

  constructor(config: CacheConfig) {
    this.config = config;
    this.l1MaxSize = config.l1.maxSize;
    this.l2 = config.l2?.redis;
    this.l3 = config.l3?.storage;
  }

  async get(key: string): Promise<any | null> {
    // L1: In-memory
    const l1Entry = this.l1.get(key);
    if (l1Entry && l1Entry.expiry > Date.now()) {
      return l1Entry.value;
    }

    // L2: Redis
    if (this.l2) {
      const l2Value = await this.l2.get(`llm:${key}`);
      if (l2Value) {
        const parsed = JSON.parse(l2Value);
        // Populate L1
        this.setL1(key, parsed);
        return parsed;
      }
    }

    // L3: Persistent storage
    if (this.l3) {
      const l3Value = await this.l3.get(`llm:${key}`);
      if (l3Value) {
        const parsed = JSON.parse(l3Value);
        // Populate L1 and L2
        this.setL1(key, parsed);
        if (this.l2) {
          await this.l2.setex(`llm:${key}`, this.config.l2!.ttlMs / 1000, l3Value);
        }
        return parsed;
      }
    }

    return null;
  }

  async set(key: string, value: any): Promise<void> {
    const serialized = JSON.stringify(value);

    // L1
    this.setL1(key, value);

    // L2
    if (this.l2) {
      await this.l2.setex(`llm:${key}`, this.config.l2!.ttlMs / 1000, serialized);
    }

    // L3
    if (this.l3) {
      await this.l3.set(`llm:${key}`, serialized, this.config.l3!.ttlMs);
    }
  }

  private setL1(key: string, value: any): void {
    // LRU eviction
    if (this.l1.size >= this.l1MaxSize) {
      const firstKey = this.l1.keys().next().value;
      this.l1.delete(firstKey);
    }

    this.l1.set(key, {
      value,
      expiry: Date.now() + this.config.l1.ttlMs,
    });
  }
}

Semantic Caching

// llm/cache/semantic-cache.ts

// For similar (not identical) queries
export class SemanticCache {
  constructor(
    private embedder: LLMGateway,
    private vectorStore: VectorStore,
    private cache: LLMCache,
    private similarityThreshold = 0.95
  ) {}

  async get(query: string): Promise<any | null> {
    // Get embedding for query
    const embedding = await this.embedder.embed({
      input: query,
      metadata: { feature: 'semantic-cache', requestId: crypto.randomUUID() },
    });

    // Search for similar cached queries
    const results = await this.vectorStore.search(embedding.embeddings[0], {
      topK: 1,
      minScore: this.similarityThreshold,
    });

    if (results.length > 0) {
      // Found similar query, return cached response
      const cachedKey = results[0].metadata.cacheKey;
      return this.cache.get(cachedKey);
    }

    return null;
  }

  async set(query: string, response: any): Promise<void> {
    const cacheKey = this.generateKey(query);

    // Get embedding for query
    const embedding = await this.embedder.embed({
      input: query,
      metadata: { feature: 'semantic-cache', requestId: crypto.randomUUID() },
    });

    // Store in vector store for similarity search
    await this.vectorStore.upsert({
      id: cacheKey,
      vector: embedding.embeddings[0],
      metadata: { cacheKey, query },
    });

    // Store actual response in regular cache
    await this.cache.set(cacheKey, response);
  }

  private generateKey(query: string): string {
    return crypto.createHash('sha256').update(query).digest('hex');
  }
}

Cost Control Architecture

Cost Tracking

// llm/cost/cost-tracker.ts

export interface CostRecord {
  provider: string;
  model: string;
  feature: string;
  userId?: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
  latencyMs: number;
  requestId: string;
  timestamp: Date;
}

export class CostTracker {
  constructor(
    private storage: CostStorage,
    private alertThresholds: AlertThresholds
  ) {}

  async track(record: Omit<CostRecord, 'timestamp'>): Promise<void> {
    const fullRecord: CostRecord = {
      ...record,
      timestamp: new Date(),
    };

    // Store the record
    await this.storage.save(fullRecord);

    // Check alerts
    await this.checkAlerts(fullRecord);
  }

  async getCostsByFeature(
    startDate: Date,
    endDate: Date
  ): Promise<Record<string, number>> {
    const records = await this.storage.query({ startDate, endDate });

    return records.reduce((acc, record) => {
      acc[record.feature] = (acc[record.feature] || 0) + record.cost;
      return acc;
    }, {} as Record<string, number>);
  }

  async getCostsByUser(
    userId: string,
    startDate: Date,
    endDate: Date
  ): Promise<number> {
    const records = await this.storage.query({ userId, startDate, endDate });
    return records.reduce((sum, record) => sum + record.cost, 0);
  }

  private async checkAlerts(record: CostRecord): Promise<void> {
    // Feature cost alert
    const featureCostToday = await this.getFeatureCostToday(record.feature);
    if (featureCostToday > this.alertThresholds.featureDaily) {
      await this.sendAlert({
        type: 'feature-cost',
        message: `Feature ${record.feature} exceeded daily budget`,
        cost: featureCostToday,
        threshold: this.alertThresholds.featureDaily,
      });
    }

    // User cost alert
    if (record.userId) {
      const userCostToday = await this.getCostsByUser(
        record.userId,
        startOfDay(new Date()),
        new Date()
      );
      if (userCostToday > this.alertThresholds.userDaily) {
        await this.sendAlert({
          type: 'user-cost',
          message: `User ${record.userId} exceeded daily budget`,
          cost: userCostToday,
          threshold: this.alertThresholds.userDaily,
        });
      }
    }
  }
}

Budget Enforcement

// llm/cost/budget-enforcer.ts

export interface Budget {
  type: 'user' | 'feature' | 'organization';
  id: string;
  dailyLimit: number;
  monthlyLimit: number;
  action: 'block' | 'warn' | 'throttle';
}

export class BudgetEnforcer {
  constructor(
    private costTracker: CostTracker,
    private budgets: Budget[]
  ) {}

  async checkBudget(
    feature: string,
    userId?: string,
    estimatedCost?: number
  ): Promise<{ allowed: boolean; reason?: string; remainingBudget?: number }> {
    // Check feature budget
    const featureBudget = this.budgets.find(
      b => b.type === 'feature' && b.id === feature
    );
    if (featureBudget) {
      const featureCost = await this.costTracker.getCostsByFeature(
        startOfDay(new Date()),
        new Date()
      );
      const remaining = featureBudget.dailyLimit - (featureCost[feature] || 0);

      if (remaining <= 0) {
        return this.handleBudgetExceeded(featureBudget, remaining);
      }
    }

    // Check user budget
    if (userId) {
      const userBudget = this.budgets.find(
        b => b.type === 'user' && b.id === userId
      );
      if (userBudget) {
        const userCost = await this.costTracker.getCostsByUser(
          userId,
          startOfDay(new Date()),
          new Date()
        );
        const remaining = userBudget.dailyLimit - userCost;

        if (remaining <= 0) {
          return this.handleBudgetExceeded(userBudget, remaining);
        }
      }
    }

    return { allowed: true };
  }

  private handleBudgetExceeded(
    budget: Budget,
    remaining: number
  ): { allowed: boolean; reason: string; remainingBudget: number } {
    switch (budget.action) {
      case 'block':
        return {
          allowed: false,
          reason: `Budget exceeded for ${budget.type} ${budget.id}`,
          remainingBudget: remaining,
        };
      case 'warn':
        console.warn(`Budget warning: ${budget.type} ${budget.id}`);
        return { allowed: true, remainingBudget: remaining };
      case 'throttle':
        // Could implement request queuing or rate limiting
        return { allowed: true, remainingBudget: remaining };
    }
  }
}

Cost Dashboard Data

// llm/cost/cost-analytics.ts

export class CostAnalytics {
  constructor(private costTracker: CostTracker) {}

  async getDashboardData(organizationId: string): Promise<CostDashboard> {
    const now = new Date();
    const startOfMonth = new Date(now.getFullYear(), now.getMonth(), 1);
    const startOfLastMonth = new Date(now.getFullYear(), now.getMonth() - 1, 1);

    const [
      thisMonthCosts,
      lastMonthCosts,
      costsByFeature,
      costsByModel,
      topUsers,
    ] = await Promise.all([
      this.costTracker.getTotalCost(startOfMonth, now),
      this.costTracker.getTotalCost(startOfLastMonth, startOfMonth),
      this.costTracker.getCostsByFeature(startOfMonth, now),
      this.costTracker.getCostsByModel(startOfMonth, now),
      this.costTracker.getTopUsersByCost(startOfMonth, now, 10),
    ]);

    return {
      summary: {
        totalCostThisMonth: thisMonthCosts,
        totalCostLastMonth: lastMonthCosts,
        changePercent: ((thisMonthCosts - lastMonthCosts) / lastMonthCosts) * 100,
        projectedMonthEnd: this.projectMonthEndCost(thisMonthCosts, now),
      },
      breakdown: {
        byFeature: costsByFeature,
        byModel: costsByModel,
        topUsers: topUsers,
      },
      trends: await this.getDailyTrends(startOfMonth, now),
    };
  }

  private projectMonthEndCost(currentCost: number, now: Date): number {
    const daysInMonth = new Date(now.getFullYear(), now.getMonth() + 1, 0).getDate();
    const dayOfMonth = now.getDate();
    return (currentCost / dayOfMonth) * daysInMonth;
  }
}

Testing Strategies

Testing Without LLM Calls

// llm/testing/mock-provider.ts

import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';

export class MockLLMProvider implements LLMProvider {
  name = 'mock';
  private responses: Map<string, string> = new Map();
  private defaultResponse = 'Mock response';
  private callLog: LLMCompletionRequest[] = [];

  // Configure mock responses
  mockResponse(pattern: string | RegExp, response: string): void {
    if (typeof pattern === 'string') {
      this.responses.set(pattern, response);
    } else {
      this.responses.set(pattern.source, response);
    }
  }

  setDefaultResponse(response: string): void {
    this.defaultResponse = response;
  }

  async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
    this.callLog.push(request);

    const userMessage = request.messages.find(m => m.role === 'user')?.content || '';
    const response = this.findResponse(userMessage);

    return {
      content: response,
      model: 'mock-model',
      usage: {
        promptTokens: this.estimateTokens(request.messages),
        completionTokens: this.estimateTokens([{ role: 'assistant', content: response }]),
        totalTokens: 0,  // Set in next line
      },
      finishReason: 'stop',
      latencyMs: 50,
      cost: 0,
      cached: false,
    };
  }

  // Test utilities
  getCalls(): LLMCompletionRequest[] {
    return [...this.callLog];
  }

  getLastCall(): LLMCompletionRequest | undefined {
    return this.callLog[this.callLog.length - 1];
  }

  clearCalls(): void {
    this.callLog = [];
  }

  assertCalled(times?: number): void {
    if (times !== undefined && this.callLog.length !== times) {
      throw new Error(`Expected ${times} calls, got ${this.callLog.length}`);
    }
    if (this.callLog.length === 0) {
      throw new Error('Expected at least one call');
    }
  }

  assertCalledWith(matcher: (req: LLMCompletionRequest) => boolean): void {
    const match = this.callLog.find(matcher);
    if (!match) {
      throw new Error('No call matched the criteria');
    }
  }

  private findResponse(userMessage: string): string {
    for (const [pattern, response] of this.responses) {
      if (userMessage.includes(pattern) || new RegExp(pattern).test(userMessage)) {
        return response;
      }
    }
    return this.defaultResponse;
  }

  private estimateTokens(messages: { content: string }[]): number {
    return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
  }

  async isAvailable(): Promise<boolean> {
    return true;
  }

  getModels(): string[] {
    return ['mock-model'];
  }
}

Integration Tests

// tests/integration/summarizer.test.ts

import { describe, it, expect, beforeEach } from 'vitest';
import { MockLLMProvider } from '../../llm/testing/mock-provider';
import { LLMGateway } from '../../llm/gateway';
import { LLMSummarizer } from '../../llm/capabilities/summarizer';

describe('Summarizer Integration', () => {
  let mockProvider: MockLLMProvider;
  let gateway: LLMGateway;
  let summarizer: LLMSummarizer;

  beforeEach(() => {
    mockProvider = new MockLLMProvider();
    gateway = new LLMGateway({
      providers: { mock: mockProvider },
      defaultProvider: 'mock',
      fallbackProviders: [],
    });
    summarizer = new LLMSummarizer(gateway, new PromptManager());
  });

  it('should summarize text', async () => {
    mockProvider.mockResponse(
      'summarize',
      JSON.stringify({
        summary: 'This is a test summary.',
        keyPoints: ['Point 1', 'Point 2'],
      })
    );

    const result = await summarizer.summarize('Long text to summarize...');

    expect(result.text).toBe('This is a test summary.');
    expect(result.keyPoints).toHaveLength(2);
    mockProvider.assertCalled(1);
  });

  it('should handle LLM errors gracefully', async () => {
    mockProvider.complete = async () => {
      throw new Error('Provider unavailable');
    };

    await expect(summarizer.summarize('text')).rejects.toThrow();
  });

  it('should include correct metadata in requests', async () => {
    mockProvider.setDefaultResponse('{"summary": "test", "keyPoints": []}');

    await summarizer.summarize('text');

    const call = mockProvider.getLastCall();
    expect(call?.metadata?.feature).toBe('summarizer');
    expect(call?.metadata?.requestId).toBeDefined();
  });
});

Snapshot Testing for Prompts

// tests/prompts/summarizer-prompts.test.ts

import { describe, it, expect } from 'vitest';
import { PromptManager } from '../../llm/prompts';

describe('Summarizer Prompts', () => {
  const prompts = new PromptManager();

  it('should generate consistent prompts', () => {
    const prompt = prompts.get('summarize', {
      text: 'Sample text',
      maxLength: 200,
      style: 'brief',
      targetAudience: 'general',
    });

    // Snapshot ensures prompts don't change unexpectedly
    expect(prompt).toMatchSnapshot();
  });

  it('should handle different styles', () => {
    const styles = ['brief', 'detailed', 'bullet-points'] as const;

    for (const style of styles) {
      const prompt = prompts.get('summarize', {
        text: 'Sample text',
        style,
      });

      expect(prompt.system).toContain(style);
    }
  });
});

Contract Testing

// tests/contract/provider-contract.test.ts

import { describe, it, expect } from 'vitest';
import { LLMProvider, LLMCompletionRequest } from '../../llm/interfaces';
import { OpenAIProvider } from '../../llm/providers/openai.provider';
import { AnthropicProvider } from '../../llm/providers/anthropic.provider';

// Contract test: all providers must behave the same way
function testProviderContract(
  name: string,
  createProvider: () => LLMProvider
) {
  describe(`${name} Provider Contract`, () => {
    let provider: LLMProvider;

    beforeAll(() => {
      provider = createProvider();
    });

    const standardRequest: LLMCompletionRequest = {
      messages: [
        { role: 'system', content: 'You are helpful.' },
        { role: 'user', content: 'Say "hello"' },
      ],
      maxTokens: 10,
      temperature: 0,
      metadata: { feature: 'test', requestId: 'test-123' },
    };

    it('should return required fields', async () => {
      const response = await provider.complete(standardRequest);

      expect(response).toHaveProperty('content');
      expect(response).toHaveProperty('model');
      expect(response).toHaveProperty('usage');
      expect(response.usage).toHaveProperty('promptTokens');
      expect(response.usage).toHaveProperty('completionTokens');
      expect(response.usage).toHaveProperty('totalTokens');
      expect(response).toHaveProperty('finishReason');
      expect(response).toHaveProperty('latencyMs');
      expect(response).toHaveProperty('cost');
    });

    it('should return valid finish reasons', async () => {
      const response = await provider.complete(standardRequest);

      expect(['stop', 'length', 'content_filter', 'error']).toContain(
        response.finishReason
      );
    });

    it('should report availability', async () => {
      const available = await provider.isAvailable();
      expect(typeof available).toBe('boolean');
    });

    it('should list models', () => {
      const models = provider.getModels();
      expect(Array.isArray(models)).toBe(true);
      expect(models.length).toBeGreaterThan(0);
    });
  });
}

// Run contract tests for each provider (in CI, use mocks or test accounts)
if (process.env.RUN_INTEGRATION_TESTS) {
  testProviderContract('OpenAI', () =>
    new OpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! })
  );

  testProviderContract('Anthropic', () =>
    new AnthropicProvider({ apiKey: process.env.ANTHROPIC_API_KEY! })
  );
}

Observability and Monitoring

Structured Logging

// llm/observability/logger.ts

export interface LLMLogEntry {
  timestamp: string;
  level: 'debug' | 'info' | 'warn' | 'error';
  event: string;
  requestId: string;
  provider?: string;
  model?: string;
  feature?: string;
  userId?: string;
  latencyMs?: number;
  tokens?: {
    prompt: number;
    completion: number;
    total: number;
  };
  cost?: number;
  cached?: boolean;
  error?: {
    name: string;
    message: string;
    stack?: string;
  };
  metadata?: Record<string, any>;
}

export class LLMLogger {
  constructor(private sink: (entry: LLMLogEntry) => void) {}

  logRequest(details: Partial<LLMLogEntry>): void {
    this.log('info', 'llm.request.start', details);
  }

  logResponse(details: Partial<LLMLogEntry>): void {
    this.log('info', 'llm.request.complete', details);
  }

  logError(error: Error, details: Partial<LLMLogEntry>): void {
    this.log('error', 'llm.request.error', {
      ...details,
      error: {
        name: error.name,
        message: error.message,
        stack: error.stack,
      },
    });
  }

  logCacheHit(details: Partial<LLMLogEntry>): void {
    this.log('debug', 'llm.cache.hit', { ...details, cached: true });
  }

  logFallback(from: string, to: string, details: Partial<LLMLogEntry>): void {
    this.log('warn', 'llm.fallback', {
      ...details,
      metadata: { from, to },
    });
  }

  private log(
    level: LLMLogEntry['level'],
    event: string,
    details: Partial<LLMLogEntry>
  ): void {
    const entry: LLMLogEntry = {
      timestamp: new Date().toISOString(),
      level,
      event,
      requestId: details.requestId || 'unknown',
      ...details,
    };

    this.sink(entry);
  }
}

// Usage with different sinks
const consoleLogger = new LLMLogger((entry) => {
  console.log(JSON.stringify(entry));
});

const datadogLogger = new LLMLogger((entry) => {
  // Send to Datadog
  datadogLogs.logger.log(entry.level, entry.event, entry);
});

Metrics Collection

// llm/observability/metrics.ts

import { Counter, Histogram, Gauge } from 'prom-client';

export class LLMMetrics {
  // Request metrics
  private requestCounter = new Counter({
    name: 'llm_requests_total',
    help: 'Total LLM requests',
    labelNames: ['provider', 'model', 'feature', 'status'],
  });

  private latencyHistogram = new Histogram({
    name: 'llm_request_duration_seconds',
    help: 'LLM request latency',
    labelNames: ['provider', 'model', 'feature'],
    buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
  });

  private tokenCounter = new Counter({
    name: 'llm_tokens_total',
    help: 'Total tokens used',
    labelNames: ['provider', 'model', 'feature', 'type'],
  });

  private costCounter = new Counter({
    name: 'llm_cost_dollars_total',
    help: 'Total cost in dollars',
    labelNames: ['provider', 'model', 'feature'],
  });

  // Cache metrics
  private cacheHitCounter = new Counter({
    name: 'llm_cache_hits_total',
    help: 'Cache hits',
    labelNames: ['feature'],
  });

  // Circuit breaker metrics
  private circuitBreakerGauge = new Gauge({
    name: 'llm_circuit_breaker_state',
    help: 'Circuit breaker state (0=closed, 1=half-open, 2=open)',
    labelNames: ['provider'],
  });

  recordRequest(labels: {
    provider: string;
    model: string;
    feature: string;
    status: 'success' | 'error';
  }): void {
    this.requestCounter.inc(labels);
  }

  recordLatency(
    labels: { provider: string; model: string; feature: string },
    durationMs: number
  ): void {
    this.latencyHistogram.observe(labels, durationMs / 1000);
  }

  recordTokens(
    labels: { provider: string; model: string; feature: string },
    tokens: { prompt: number; completion: number }
  ): void {
    this.tokenCounter.inc({ ...labels, type: 'prompt' }, tokens.prompt);
    this.tokenCounter.inc({ ...labels, type: 'completion' }, tokens.completion);
  }

  recordCost(
    labels: { provider: string; model: string; feature: string },
    cost: number
  ): void {
    this.costCounter.inc(labels, cost);
  }

  recordCacheHit(feature: string): void {
    this.cacheHitCounter.inc({ feature });
  }

  setCircuitBreakerState(
    provider: string,
    state: 'closed' | 'half-open' | 'open'
  ): void {
    const stateValue = { closed: 0, 'half-open': 1, open: 2 }[state];
    this.circuitBreakerGauge.set({ provider }, stateValue);
  }
}

Distributed Tracing

// llm/observability/tracing.ts

import { trace, Span, SpanStatusCode, context } from '@opentelemetry/api';

export class LLMTracer {
  private tracer = trace.getTracer('llm-gateway');

  async traceRequest<T>(
    name: string,
    attributes: Record<string, string | number | boolean>,
    fn: (span: Span) => Promise<T>
  ): Promise<T> {
    return this.tracer.startActiveSpan(name, async (span) => {
      try {
        // Set attributes
        Object.entries(attributes).forEach(([key, value]) => {
          span.setAttribute(`llm.${key}`, value);
        });

        const result = await fn(span);

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error instanceof Error ? error.message : 'Unknown error',
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    });
  }
}

// Usage in gateway
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
  return this.tracer.traceRequest(
    'llm.complete',
    {
      provider: this.defaultProvider,
      feature: request.metadata?.feature || 'unknown',
      model: request.model || 'default',
    },
    async (span) => {
      // Add request details
      span.setAttribute('llm.messages_count', request.messages.length);

      const response = await this.executeWithFallback(request);

      // Add response details
      span.setAttribute('llm.tokens.prompt', response.usage.promptTokens);
      span.setAttribute('llm.tokens.completion', response.usage.completionTokens);
      span.setAttribute('llm.latency_ms', response.latencyMs);
      span.setAttribute('llm.cost', response.cost);
      span.setAttribute('llm.cached', response.cached);

      return response;
    }
  );
}

Multi-Provider Strategy

Provider Selection

// llm/routing/router.ts

export interface RoutingRule {
  name: string;
  condition: (request: LLMCompletionRequest) => boolean;
  provider: string;
  model?: string;
}

export class LLMRouter {
  private rules: RoutingRule[] = [];

  constructor(private defaultProvider: string) {}

  addRule(rule: RoutingRule): void {
    this.rules.push(rule);
  }

  route(request: LLMCompletionRequest): { provider: string; model?: string } {
    for (const rule of this.rules) {
      if (rule.condition(request)) {
        return { provider: rule.provider, model: rule.model };
      }
    }
    return { provider: this.defaultProvider };
  }
}

// Example routing rules
const router = new LLMRouter('openai');

// Route coding tasks to Claude
router.addRule({
  name: 'coding-to-claude',
  condition: (req) => req.metadata?.feature?.includes('code'),
  provider: 'anthropic',
  model: 'claude-sonnet-4-20250514',
});

// Route long-form content to GPT-4
router.addRule({
  name: 'long-content-to-gpt4',
  condition: (req) => (req.maxTokens || 0) > 2000,
  provider: 'openai',
  model: 'gpt-4-turbo',
});

// Route simple tasks to cheaper models
router.addRule({
  name: 'simple-to-haiku',
  condition: (req) => {
    const systemPrompt = req.messages.find(m => m.role === 'system')?.content || '';
    return systemPrompt.length < 200;
  },
  provider: 'anthropic',
  model: 'claude-3-haiku-20240307',
});

A/B Testing Models

// llm/routing/ab-testing.ts

export interface Experiment {
  id: string;
  name: string;
  feature: string;
  variants: Array<{
    name: string;
    provider: string;
    model: string;
    weight: number;
  }>;
  enabled: boolean;
}

export class LLMExperimentRouter {
  constructor(
    private experiments: Experiment[],
    private analytics: AnalyticsService
  ) {}

  selectVariant(
    feature: string,
    userId: string
  ): { provider: string; model: string; variant: string } | null {
    const experiment = this.experiments.find(
      e => e.feature === feature && e.enabled
    );

    if (!experiment) {
      return null;
    }

    // Consistent assignment based on user ID
    const hash = this.hashUserId(userId, experiment.id);
    let cumulativeWeight = 0;

    for (const variant of experiment.variants) {
      cumulativeWeight += variant.weight;
      if (hash < cumulativeWeight) {
        // Track assignment
        this.analytics.track('experiment_assigned', {
          experimentId: experiment.id,
          variant: variant.name,
          userId,
        });

        return {
          provider: variant.provider,
          model: variant.model,
          variant: variant.name,
        };
      }
    }

    return null;
  }

  trackOutcome(
    experimentId: string,
    userId: string,
    outcome: {
      success: boolean;
      latencyMs: number;
      userSatisfaction?: number;
    }
  ): void {
    this.analytics.track('experiment_outcome', {
      experimentId,
      userId,
      ...outcome,
    });
  }

  private hashUserId(userId: string, experimentId: string): number {
    const str = `${userId}:${experimentId}`;
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      hash = (hash << 5) - hash + str.charCodeAt(i);
      hash = hash & hash;
    }
    return Math.abs(hash % 100);
  }
}

Prompt Management

Centralized Prompt Repository

// llm/prompts/prompt-manager.ts

export interface PromptTemplate {
  id: string;
  version: string;
  system: string;
  user: string;
  variables: string[];
  metadata: {
    author: string;
    description: string;
    lastUpdated: string;
    testCases?: Array<{ input: Record<string, any>; expectedOutput: string }>;
  };
}

export class PromptManager {
  private prompts: Map<string, PromptTemplate> = new Map();
  private interpolationRegex = /\{\{(\w+)\}\}/g;

  constructor() {
    this.loadPrompts();
  }

  get(
    promptId: string,
    variables: Record<string, any>
  ): { system: string; user: string } {
    const template = this.prompts.get(promptId);
    if (!template) {
      throw new Error(`Prompt not found: ${promptId}`);
    }

    return {
      system: this.interpolate(template.system, variables),
      user: this.interpolate(template.user, variables),
    };
  }

  private interpolate(template: string, variables: Record<string, any>): string {
    return template.replace(this.interpolationRegex, (match, key) => {
      if (!(key in variables)) {
        throw new Error(`Missing variable: ${key}`);
      }
      return String(variables[key]);
    });
  }

  private loadPrompts(): void {
    // Load from files or database
    this.prompts.set('summarize', {
      id: 'summarize',
      version: '1.2.0',
      system: `You are an expert summarizer. Create {{style}} summaries for a {{targetAudience}} audience.

Rules:
- Maximum length: {{maxLength}} characters
- Focus on key points
- Use clear, concise language
- Output JSON: { "summary": "...", "keyPoints": ["...", "..."] }`,
      user: `Summarize the following text:

{{text}}`,
      variables: ['style', 'targetAudience', 'maxLength', 'text'],
      metadata: {
        author: 'ai-team',
        description: 'General-purpose text summarization',
        lastUpdated: '2024-01-15',
      },
    });

    // Add more prompts...
  }

  // Hot-reload prompts without restart
  async reloadPrompts(): Promise<void> {
    // Fetch from remote config or database
    const freshPrompts = await this.fetchPromptsFromConfig();
    this.prompts = new Map(freshPrompts.map(p => [p.id, p]));
  }
}

Prompt Versioning

// llm/prompts/versioned-prompts.ts

export interface VersionedPrompt {
  id: string;
  versions: Array<{
    version: string;
    template: PromptTemplate;
    status: 'draft' | 'active' | 'deprecated';
    activatedAt?: Date;
    deprecatedAt?: Date;
  }>;
}

export class VersionedPromptManager {
  constructor(private storage: PromptStorage) {}

  async getActiveVersion(promptId: string): Promise<PromptTemplate> {
    const prompt = await this.storage.get(promptId);
    const active = prompt.versions.find(v => v.status === 'active');

    if (!active) {
      throw new Error(`No active version for prompt: ${promptId}`);
    }

    return active.template;
  }

  async createVersion(
    promptId: string,
    template: Omit<PromptTemplate, 'id' | 'version'>
  ): Promise<string> {
    const prompt = await this.storage.get(promptId);
    const newVersion = this.incrementVersion(
      prompt.versions[prompt.versions.length - 1]?.version || '0.0.0'
    );

    prompt.versions.push({
      version: newVersion,
      template: { ...template, id: promptId, version: newVersion },
      status: 'draft',
    });

    await this.storage.save(prompt);
    return newVersion;
  }

  async activateVersion(promptId: string, version: string): Promise<void> {
    const prompt = await this.storage.get(promptId);

    // Deprecate current active
    prompt.versions.forEach(v => {
      if (v.status === 'active') {
        v.status = 'deprecated';
        v.deprecatedAt = new Date();
      }
    });

    // Activate new version
    const target = prompt.versions.find(v => v.version === version);
    if (!target) {
      throw new Error(`Version not found: ${version}`);
    }

    target.status = 'active';
    target.activatedAt = new Date();

    await this.storage.save(prompt);
  }

  async rollback(promptId: string): Promise<void> {
    const prompt = await this.storage.get(promptId);

    const currentActive = prompt.versions.find(v => v.status === 'active');
    const previousActive = prompt.versions
      .filter(v => v.status === 'deprecated')
      .sort((a, b) => (b.deprecatedAt?.getTime() || 0) - (a.deprecatedAt?.getTime() || 0))[0];

    if (!previousActive) {
      throw new Error('No previous version to rollback to');
    }

    if (currentActive) {
      currentActive.status = 'deprecated';
      currentActive.deprecatedAt = new Date();
    }

    previousActive.status = 'active';
    previousActive.activatedAt = new Date();

    await this.storage.save(prompt);
  }
}

Data Flow Isolation

Preventing Data Leakage

// llm/security/data-sanitizer.ts

export interface SanitizationRule {
  type: 'pii' | 'secrets' | 'custom';
  pattern: RegExp;
  replacement: string | ((match: string) => string);
}

export class DataSanitizer {
  private rules: SanitizationRule[] = [
    // PII patterns
    {
      type: 'pii',
      pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
      replacement: '[EMAIL]',
    },
    {
      type: 'pii',
      pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,
      replacement: '[PHONE]',
    },
    {
      type: 'pii',
      pattern: /\b\d{3}[-]?\d{2}[-]?\d{4}\b/g,
      replacement: '[SSN]',
    },
    {
      type: 'pii',
      pattern: /\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g,
      replacement: '[CREDIT_CARD]',
    },

    // Secrets
    {
      type: 'secrets',
      pattern: /(api[_-]?key|apikey|secret|password|token|auth)['":\s=]+['"]?[\w-]+['"]?/gi,
      replacement: '[REDACTED_SECRET]',
    },
    {
      type: 'secrets',
      pattern: /sk-[a-zA-Z0-9]{48}/g,  // OpenAI keys
      replacement: '[OPENAI_KEY]',
    },
    {
      type: 'secrets',
      pattern: /ghp_[a-zA-Z0-9]{36}/g,  // GitHub tokens
      replacement: '[GITHUB_TOKEN]',
    },
  ];

  sanitize(text: string): { sanitized: string; redactions: string[] } {
    let result = text;
    const redactions: string[] = [];

    for (const rule of this.rules) {
      result = result.replace(rule.pattern, (match) => {
        redactions.push(`${rule.type}: ${match.substring(0, 10)}...`);
        return typeof rule.replacement === 'function'
          ? rule.replacement(match)
          : rule.replacement;
      });
    }

    return { sanitized: result, redactions };
  }

  addRule(rule: SanitizationRule): void {
    this.rules.push(rule);
  }
}

// Middleware for the gateway
export function createSanitizationMiddleware(sanitizer: DataSanitizer) {
  return async (
    request: LLMCompletionRequest,
    next: (req: LLMCompletionRequest) => Promise<LLMCompletionResponse>
  ): Promise<LLMCompletionResponse> => {
    // Sanitize user messages before sending to LLM
    const sanitizedMessages = request.messages.map(msg => {
      if (msg.role === 'user') {
        const { sanitized, redactions } = sanitizer.sanitize(msg.content);
        if (redactions.length > 0) {
          console.info('Sanitized sensitive data:', redactions);
        }
        return { ...msg, content: sanitized };
      }
      return msg;
    });

    return next({ ...request, messages: sanitizedMessages });
  };
}

Data Residency

// llm/security/data-residency.ts

export interface DataResidencyConfig {
  region: string;
  allowedProviders: string[];
  allowedEndpoints: Record<string, string>;  // provider -> endpoint
}

export class DataResidencyEnforcer {
  constructor(private configs: Record<string, DataResidencyConfig>) {}

  getConfig(userRegion: string): DataResidencyConfig {
    return this.configs[userRegion] || this.configs['default'];
  }

  validateProvider(userRegion: string, provider: string): boolean {
    const config = this.getConfig(userRegion);
    return config.allowedProviders.includes(provider);
  }

  getEndpoint(userRegion: string, provider: string): string | undefined {
    const config = this.getConfig(userRegion);
    return config.allowedEndpoints[provider];
  }
}

// Example configuration
const dataResidencyConfigs: Record<string, DataResidencyConfig> = {
  eu: {
    region: 'eu',
    allowedProviders: ['azure-openai', 'anthropic-eu'],
    allowedEndpoints: {
      'azure-openai': 'https://eu-west.openai.azure.com',
      'anthropic-eu': 'https://api.eu.anthropic.com',
    },
  },
  us: {
    region: 'us',
    allowedProviders: ['openai', 'anthropic'],
    allowedEndpoints: {
      openai: 'https://api.openai.com',
      anthropic: 'https://api.anthropic.com',
    },
  },
  default: {
    region: 'default',
    allowedProviders: ['openai', 'anthropic'],
    allowedEndpoints: {
      openai: 'https://api.openai.com',
      anthropic: 'https://api.anthropic.com',
    },
  },
};

Real-World Patterns

Pattern 1: The AI Feature Toggle

// patterns/ai-feature-toggle.ts

// Everything works with or without AI
export class SmartSearch {
  constructor(
    private traditionalSearch: SearchEngine,
    private aiSearchEnhancer: AISearchEnhancer | null,
    private featureFlags: FeatureFlagService
  ) {}

  async search(query: string, userId: string): Promise<SearchResults> {
    // Always do traditional search
    const results = await this.traditionalSearch.search(query);

    // Optionally enhance with AI if available and enabled
    if (
      this.aiSearchEnhancer &&
      this.featureFlags.isEnabled('ai-search', { userId })
    ) {
      try {
        const enhanced = await this.aiSearchEnhancer.enhance(query, results);
        return {
          ...enhanced,
          aiEnhanced: true,
        };
      } catch (error) {
        // Log but don't fail
        console.warn('AI enhancement failed, using traditional results', error);
        return { ...results, aiEnhanced: false };
      }
    }

    return { ...results, aiEnhanced: false };
  }
}

Pattern 2: The Async AI Pipeline

// patterns/async-pipeline.ts

// AI runs in background, results delivered asynchronously
export class DocumentProcessor {
  constructor(
    private docService: DocumentService,
    private aiQueue: LLMQueueService,
    private notifications: NotificationService
  ) {}

  async uploadDocument(file: File, userId: string): Promise<Document> {
    // Immediately save and return
    const doc = await this.docService.save({
      file,
      userId,
      status: 'uploaded',
      aiProcessingStatus: 'pending',
    });

    // Queue AI processing
    await this.aiQueue.enqueue({
      type: 'completion',
      payload: {
        messages: [
          { role: 'system', content: 'Extract metadata and summarize.' },
          { role: 'user', content: doc.content },
        ],
      },
      metadata: {
        feature: 'document-processing',
        userId,
        priority: 'normal',
        callbackUrl: `${process.env.API_URL}/webhooks/ai-complete`,
        webhookSecret: doc.id,  // Used to identify the document
      },
    });

    return doc;
  }

  // Called when AI processing completes
  async handleAIComplete(docId: string, result: any): Promise<void> {
    await this.docService.update(docId, {
      summary: result.summary,
      metadata: result.metadata,
      aiProcessingStatus: 'complete',
    });

    await this.notifications.send({
      type: 'document-ready',
      docId,
    });
  }
}

Pattern 3: The AI Copilot Sidecar

// patterns/ai-copilot.ts

// AI provides suggestions, user is always in control
export class EditorCopilot {
  constructor(
    private llmGateway: LLMGateway,
    private featureFlags: FeatureFlagService
  ) {}

  async getSuggestion(context: EditorContext): Promise<Suggestion | null> {
    if (!this.featureFlags.isEnabled('copilot', { userId: context.userId })) {
      return null;
    }

    try {
      const response = await this.llmGateway.complete({
        messages: [
          { role: 'system', content: 'Provide a brief, helpful suggestion.' },
          { role: 'user', content: this.buildPrompt(context) },
        ],
        maxTokens: 200,
        temperature: 0.3,
        metadata: {
          feature: 'copilot',
          userId: context.userId,
          requestId: crypto.randomUUID(),
        },
      });

      return {
        text: response.content,
        confidence: this.estimateConfidence(response),
        action: 'suggest', // Never auto-apply
      };
    } catch (error) {
      // Copilot failure is silent - it's optional
      console.debug('Copilot suggestion failed', error);
      return null;
    }
  }

  private buildPrompt(context: EditorContext): string {
    return `Current text: ${context.text.substring(0, 500)}
Cursor position: ${context.cursorPosition}
User intent: ${context.lastAction}

Suggest a brief completion or improvement.`;
  }

  private estimateConfidence(response: LLMCompletionResponse): number {
    // Lower confidence for longer responses (more uncertain)
    const lengthFactor = Math.max(0.5, 1 - response.usage.completionTokens / 200);
    return lengthFactor;
  }
}

Decision Framework

When to Use Each Pattern

┌─────────────────────────────────────────────────────────────────┐
│              LLM Integration Decision Framework                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  QUESTION 1: Is real-time response required?                     │
│  ─────────────────────────────────────────────                   │
│  YES → Synchronous with timeout and fallback                     │
│  NO  → Queue-based async processing                              │
│                                                                  │
│  QUESTION 2: Is the feature critical to core UX?                 │
│  ─────────────────────────────────────────────                   │
│  YES → Implement robust fallbacks, never block on AI             │
│  NO  → Can gracefully hide feature when AI unavailable           │
│                                                                  │
│  QUESTION 3: How sensitive is the data?                          │
│  ─────────────────────────────────────────────                   │
│  HIGH → Sanitization, data residency, audit logging              │
│  LOW  → Standard security measures sufficient                    │
│                                                                  │
│  QUESTION 4: What's the cost tolerance?                          │
│  ─────────────────────────────────────────────                   │
│  LOW  → Aggressive caching, cheaper models, rate limiting        │
│  HIGH → Optimize for quality, less aggressive caching            │
│                                                                  │
│  QUESTION 5: How mature is the use case?                         │
│  ─────────────────────────────────────────────                   │
│  EXPERIMENTAL → Feature flags, A/B testing, easy rollback        │
│  PROVEN → Standard integration with monitoring                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Architecture Selection Guide

┌─────────────────────────────────────────────────────────────────┐
│              When to Use What                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  CHAT / CONVERSATION                                             │
│  ─────────────────────────────────────────────                   │
│  Architecture: Streaming with fallback to queued                 │
│  Caching: Semantic cache for similar queries                     │
│  Degradation: Show "AI unavailable" message                      │
│                                                                  │
│  DOCUMENT PROCESSING                                             │
│  ─────────────────────────────────────────────                   │
│  Architecture: Queue-based async                                 │
│  Caching: Hash-based for identical documents                     │
│  Degradation: Mark as "pending manual review"                    │
│                                                                  │
│  REAL-TIME SUGGESTIONS                                           │
│  ─────────────────────────────────────────────                   │
│  Architecture: Sync with aggressive timeout                      │
│  Caching: Aggressive, even slightly stale is OK                  │
│  Degradation: Hide suggestions silently                          │
│                                                                  │
│  SEARCH ENHANCEMENT                                              │
│  ─────────────────────────────────────────────                   │
│  Architecture: Parallel (traditional + AI)                       │
│  Caching: Cache AI enhancements                                  │
│  Degradation: Use traditional search only                        │
│                                                                  │
│  CONTENT GENERATION                                              │
│  ─────────────────────────────────────────────                   │
│  Architecture: Queue-based with preview capability               │
│  Caching: Template-based, not full responses                     │
│  Degradation: Offer templates/suggestions instead                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐
│              Key Takeaways                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. ABSTRACT THE PROVIDER                                        │
│     Never import OpenAI/Anthropic SDK directly in feature code.  │
│     Use an interface and gateway pattern.                        │
│                                                                  │
│  2. DEFINE CAPABILITIES, NOT IMPLEMENTATIONS                     │
│     Your code should call summarizer.summarize(), not            │
│     openai.chat.completions.create().                            │
│                                                                  │
│  3. ALWAYS HAVE A FALLBACK                                       │
│     Every AI feature must work (even if degraded) when AI        │
│     is unavailable.                                              │
│                                                                  │
│  4. CENTRALIZE CROSS-CUTTING CONCERNS                            │
│     Rate limiting, cost tracking, logging, retries - all in      │
│     one gateway, not scattered across features.                  │
│                                                                  │
│  5. MAKE IT TESTABLE                                             │
│     Mock providers, contract tests, snapshot tests for prompts.  │
│     Never require a real API key to run tests.                   │
│                                                                  │
│  6. CONTROL COSTS ARCHITECTURALLY                                │
│     Caching, budgets, feature flags - design for cost control,   │
│     don't bolt it on later.                                      │
│                                                                  │
│  7. OBSERVE EVERYTHING                                           │
│     Every LLM call should be logged, traced, and measured.       │
│     You can't optimize what you can't measure.                   │
│                                                                  │
│  8. PREPARE FOR CHANGE                                           │
│     Models change, providers change, APIs change.                │
│     Your architecture should make migration easy.                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Start Checklist

## MVP LLM Integration Checklist

### Day 1: Foundation
- [ ] Create LLMProvider interface
- [ ] Implement one provider (OpenAI or Anthropic)
- [ ] Create LLMGateway with basic error handling
- [ ] Add structured logging

### Week 1: Production Readiness
- [ ] Add circuit breaker
- [ ] Implement basic caching
- [ ] Add cost tracking
- [ ] Create mock provider for tests
- [ ] Add feature flags

### Month 1: Scale
- [ ] Add second provider for fallback
- [ ] Implement queue for async processing
- [ ] Add semantic caching
- [ ] Create cost dashboards
- [ ] Implement budget controls

### Ongoing
- [ ] Monitor and alert on costs
- [ ] A/B test models
- [ ] Review and update prompts
- [ ] Audit for data leakage

References

The best LLM integration is one you can rip out and replace in a day.

What did you think?