Architecting LLM-Powered Features Without Coupling Your Core System
February 19, 2026315 min read10 views
Architecting LLM-Powered Features Without Coupling Your Core System
A Practical Guide to Building AI Features That Can Evolve, Fail Gracefully, and Be Replaced
Table of Contents
- Introduction: The Coupling Problem
- Why LLMs Are Different
- The Cost of Tight Coupling
- Architectural Principles
- The LLM Abstraction Layer
- Gateway Pattern
- Capability-Based Design
- Graceful Degradation Strategies
- Feature Flag Integration
- Queue-Based Architecture
- Caching and Memoization
- Cost Control Architecture
- Testing Strategies
- Observability and Monitoring
- Multi-Provider Strategy
- Prompt Management
- Data Flow Isolation
- Migration and Evolution
- Real-World Patterns
- Decision Framework
Introduction: The Coupling Problem
You're building a product. The PM wants AI features. You integrate OpenAI's API directly into your codebase. Six months later:
- OpenAI changes their API, breaking your integration
- Costs are 10x what you budgeted
- Users complain when the AI is slow or down
- You can't easily A/B test different models
- Your codebase is littered with
await openai.chat.completions.create() - You want to try Claude or Gemini but it would require rewriting everything
This is the coupling problem.
┌─────────────────────────────────────────────────────────────────┐
│ Tight Coupling: What Goes Wrong │
├─────────────────────────────────────────────────────────────────┤
│ │
│ YOUR CODEBASE │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ UserService.ts │ │
│ │ ├── import OpenAI from 'openai' │ │
│ │ ├── const openai = new OpenAI({ apiKey: '...' }) │ │
│ │ └── openai.chat.completions.create(...) │ │
│ │ │ │
│ │ ProductService.ts │ │
│ │ ├── import OpenAI from 'openai' │ │
│ │ └── openai.chat.completions.create(...) │ │
│ │ │ │
│ │ SearchService.ts │ │
│ │ ├── import OpenAI from 'openai' │ │
│ │ └── openai.embeddings.create(...) │ │
│ │ │ │
│ │ SupportBot.ts │ │
│ │ ├── import OpenAI from 'openai' │ │
│ │ └── openai.chat.completions.create(...) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Problems: │
│ ✗ 47 files import OpenAI directly │
│ ✗ API key scattered or in shared config │
│ ✗ No way to swap providers without 47 file changes │
│ ✗ No centralized error handling │
│ ✗ No cost tracking per feature │
│ ✗ No fallback when OpenAI is down │
│ ✗ Can't test without hitting real API (or mocking everywhere) │
│ │
└─────────────────────────────────────────────────────────────────┘
Why LLMs Are Different
LLM integrations have unique characteristics that make traditional integration patterns insufficient:
┌─────────────────────────────────────────────────────────────────┐
│ LLM Integration Challenges │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. NON-DETERMINISTIC OUTPUT │
│ • Same input → different output │
│ • Makes testing fundamentally different │
│ • Behavior changes with model updates │
│ │
│ 2. HIGH LATENCY │
│ • 500ms - 30s response times (vs 50ms for typical APIs) │
│ • Streaming responses complicate architecture │
│ • Timeout handling is critical │
│ │
│ 3. UNPREDICTABLE COSTS │
│ • Per-token billing │
│ • Costs scale with usage AND input size │
│ • A bug can cost thousands in minutes │
│ │
│ 4. RAPID EVOLUTION │
│ • New models every few months │
│ • Old models deprecated │
│ • API changes (GPT-3.5 → GPT-4 → GPT-4 Turbo → GPT-4o) │
│ │
│ 5. QUALITY VARIANCE │
│ • Different models excel at different tasks │
│ • Need to experiment to find best fit │
│ • Quality can degrade unexpectedly │
│ │
│ 6. RATE LIMITS AND QUOTAS │
│ • Tokens per minute limits │
│ • Requests per minute limits │
│ • Vary by tier and model │
│ │
│ 7. REGULATORY CONCERNS │
│ • Data residency requirements │
│ • PII handling │
│ • May need to switch providers for compliance │
│ │
└─────────────────────────────────────────────────────────────────┘
The Volatility Matrix
┌─────────────────────────────────────────────────────────────────┐
│ What Changes and How Often │
├─────────────────────────────────────────────────────────────────┤
│ │
│ FREQUENCY OF CHANGE │
│ Low ◄─────────────────► High │
│ │
│ S │ Database │ │ Prompts │ │
│ T │ Schema │ │ │ │
│ A │ │ │ │ │
│ B ├───────────────┼───────────────┼────────────────┤ │
│ I │ Core │ API │ Model │ │
│ L │ Business │ Contracts │ Selection │ │
│ I │ Logic │ │ │ │
│ T ├───────────────┼───────────────┼────────────────┤ │
│ Y │ │ UI │ LLM Provider │ │
│ │ │ Components │ API Changes │ │
│ ▼ │ │ │ │ │
│ │
│ LESSON: Isolate high-volatility components (LLM-related) │
│ from low-volatility components (core business logic) │
│ │
└─────────────────────────────────────────────────────────────────┘
The Cost of Tight Coupling
Technical Debt Accumulation
// Month 1: "Let's just use OpenAI directly, we can refactor later"
// services/user.service.ts
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function generateUserBio(userInfo: UserInfo): Promise<string> {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'Generate a professional bio.' },
{ role: 'user', content: JSON.stringify(userInfo) }
],
max_tokens: 200,
});
return response.choices[0].message.content;
}
// Month 6: 47 files later...
// - Same pattern copied everywhere
// - No error handling standardization
// - No retry logic
// - No cost tracking
// - No way to test without mocking OpenAI in every test file
// - PM asks: "Can we try Claude for the chat feature?"
// - Answer: "That's a 2-week refactor"
Hidden Costs
┌─────────────────────────────────────────────────────────────────┐
│ Hidden Costs of Tight Coupling │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DEVELOPMENT VELOCITY │
│ • Every LLM feature change touches multiple files │
│ • Testing requires complex mocking │
│ • New developers must understand OpenAI API │
│ • Code reviews focus on implementation, not business logic │
│ │
│ OPERATIONAL COSTS │
│ • Can't easily implement cost controls │
│ • No centralized monitoring │
│ • Debugging requires understanding scattered implementations │
│ • Outages affect everything simultaneously │
│ │
│ BUSINESS AGILITY │
│ • Can't quickly A/B test different models │
│ • Vendor lock-in limits negotiation leverage │
│ • Compliance requirements may force painful migrations │
│ • Can't gradually roll out model changes │
│ │
│ TECHNICAL RISK │
│ • Provider outage = your outage │
│ • API deprecation = emergency rewrite │
│ • No fallback capability │
│ • Rate limits hit unexpectedly │
│ │
└─────────────────────────────────────────────────────────────────┘
Architectural Principles
Core Principles for LLM Integration
┌─────────────────────────────────────────────────────────────────┐
│ LLM Integration Design Principles │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. DEPENDENCY INVERSION │
│ High-level modules should not depend on LLM providers. │
│ Both should depend on abstractions. │
│ │
│ ❌ UserService → OpenAI SDK │
│ ✅ UserService → LLMService (interface) ← OpenAIProvider │
│ │
│ 2. SINGLE POINT OF INTEGRATION │
│ All LLM calls flow through one gateway. │
│ Enables centralized control, monitoring, and modification. │
│ │
│ 3. GRACEFUL DEGRADATION │
│ Every LLM-powered feature must have a fallback. │
│ The app should work (perhaps with reduced functionality) │
│ even when the LLM is unavailable. │
│ │
│ 4. CAPABILITY, NOT IMPLEMENTATION │
│ Define what you need (summarize, classify, generate), │
│ not how it's done (OpenAI, Claude, local model). │
│ │
│ 5. CONFIGURATION OVER CODE │
│ Model selection, prompts, and parameters should be │
│ configurable without code changes. │
│ │
│ 6. OBSERVABILITY FIRST │
│ Every LLM call should be traceable, measurable, │
│ and attributable to a feature and user. │
│ │
│ 7. COST AWARENESS │
│ Cost should be a first-class consideration in the │
│ architecture, not an afterthought. │
│ │
└─────────────────────────────────────────────────────────────────┘
Layered Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Decoupled LLM Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ APPLICATION LAYER │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Feature │ │ Feature │ │ Feature │ │ Feature │ │ │
│ │ │ A │ │ B │ │ C │ │ D │ │ │
│ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │
│ └───────┼────────────┼────────────┼────────────┼──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CAPABILITY LAYER │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │Summarizer │ │Classifier │ │ Generator │ ... │ │
│ │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ │
│ └────────┼──────────────┼──────────────┼───────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ LLM GATEWAY │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ • Request routing • Rate limiting │ │ │
│ │ │ • Cost tracking • Retry logic │ │ │
│ │ │ • Caching • Circuit breaker │ │ │
│ │ │ • Logging/tracing • Fallback handling │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PROVIDER LAYER │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ OpenAI │ │ Anthro- │ │ Local │ │ Mock │ │ │
│ │ │Provider │ │ pic │ │ (Llama) │ │Provider │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The LLM Abstraction Layer
Provider Interface
// llm/interfaces/provider.interface.ts
export interface LLMMessage {
role: 'system' | 'user' | 'assistant';
content: string;
}
export interface LLMCompletionRequest {
messages: LLMMessage[];
model?: string;
maxTokens?: number;
temperature?: number;
stream?: boolean;
metadata?: {
feature: string;
userId?: string;
requestId: string;
};
}
export interface LLMCompletionResponse {
content: string;
model: string;
usage: {
promptTokens: number;
completionTokens: number;
totalTokens: number;
};
finishReason: 'stop' | 'length' | 'content_filter' | 'error';
latencyMs: number;
cost: number;
cached: boolean;
}
export interface LLMEmbeddingRequest {
input: string | string[];
model?: string;
metadata?: {
feature: string;
requestId: string;
};
}
export interface LLMEmbeddingResponse {
embeddings: number[][];
model: string;
usage: {
totalTokens: number;
};
latencyMs: number;
cost: number;
}
// The provider interface - all providers implement this
export interface LLMProvider {
name: string;
complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse>;
completeStream(
request: LLMCompletionRequest
): AsyncGenerator<string, LLMCompletionResponse>;
embed(request: LLMEmbeddingRequest): Promise<LLMEmbeddingResponse>;
isAvailable(): Promise<boolean>;
getModels(): string[];
}
OpenAI Provider Implementation
// llm/providers/openai.provider.ts
import OpenAI from 'openai';
import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';
export class OpenAIProvider implements LLMProvider {
name = 'openai';
private client: OpenAI;
private costPerToken: Record<string, { input: number; output: number }>;
constructor(config: { apiKey: string }) {
this.client = new OpenAI({ apiKey: config.apiKey });
// Cost per 1K tokens (as of 2024)
this.costPerToken = {
'gpt-4o': { input: 0.005, output: 0.015 },
'gpt-4-turbo': { input: 0.01, output: 0.03 },
'gpt-4': { input: 0.03, output: 0.06 },
'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
};
}
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
const startTime = Date.now();
const model = request.model || 'gpt-4o';
try {
const response = await this.client.chat.completions.create({
model,
messages: request.messages.map(m => ({
role: m.role,
content: m.content,
})),
max_tokens: request.maxTokens,
temperature: request.temperature,
});
const latencyMs = Date.now() - startTime;
const usage = response.usage!;
const cost = this.calculateCost(model, usage.prompt_tokens, usage.completion_tokens);
return {
content: response.choices[0].message.content || '',
model: response.model,
usage: {
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
totalTokens: usage.total_tokens,
},
finishReason: this.mapFinishReason(response.choices[0].finish_reason),
latencyMs,
cost,
cached: false,
};
} catch (error) {
throw this.mapError(error);
}
}
async *completeStream(
request: LLMCompletionRequest
): AsyncGenerator<string, LLMCompletionResponse> {
const startTime = Date.now();
const model = request.model || 'gpt-4o';
let content = '';
const stream = await this.client.chat.completions.create({
model,
messages: request.messages.map(m => ({
role: m.role,
content: m.content,
})),
max_tokens: request.maxTokens,
temperature: request.temperature,
stream: true,
stream_options: { include_usage: true },
});
let finalUsage: any;
let finishReason: string = 'stop';
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
const text = chunk.choices[0].delta.content;
content += text;
yield text;
}
if (chunk.choices[0]?.finish_reason) {
finishReason = chunk.choices[0].finish_reason;
}
if (chunk.usage) {
finalUsage = chunk.usage;
}
}
const latencyMs = Date.now() - startTime;
const cost = finalUsage
? this.calculateCost(model, finalUsage.prompt_tokens, finalUsage.completion_tokens)
: 0;
return {
content,
model,
usage: {
promptTokens: finalUsage?.prompt_tokens || 0,
completionTokens: finalUsage?.completion_tokens || 0,
totalTokens: finalUsage?.total_tokens || 0,
},
finishReason: this.mapFinishReason(finishReason),
latencyMs,
cost,
cached: false,
};
}
async embed(request: LLMEmbeddingRequest): Promise<LLMEmbeddingResponse> {
const startTime = Date.now();
const model = request.model || 'text-embedding-3-small';
const response = await this.client.embeddings.create({
model,
input: request.input,
});
return {
embeddings: response.data.map(d => d.embedding),
model: response.model,
usage: { totalTokens: response.usage.total_tokens },
latencyMs: Date.now() - startTime,
cost: response.usage.total_tokens * 0.00002 / 1000,
};
}
async isAvailable(): Promise<boolean> {
try {
await this.client.models.list();
return true;
} catch {
return false;
}
}
getModels(): string[] {
return ['gpt-4o', 'gpt-4-turbo', 'gpt-4', 'gpt-3.5-turbo'];
}
private calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const pricing = this.costPerToken[model] || this.costPerToken['gpt-4o'];
return (inputTokens * pricing.input + outputTokens * pricing.output) / 1000;
}
private mapFinishReason(reason: string): LLMCompletionResponse['finishReason'] {
const mapping: Record<string, LLMCompletionResponse['finishReason']> = {
stop: 'stop',
length: 'length',
content_filter: 'content_filter',
};
return mapping[reason] || 'error';
}
private mapError(error: any): Error {
// Map provider-specific errors to generic errors
if (error.status === 429) {
return new RateLimitError('OpenAI rate limit exceeded', error);
}
if (error.status === 503) {
return new ProviderUnavailableError('OpenAI service unavailable', error);
}
return new LLMError('OpenAI request failed', error);
}
}
// Custom error types
export class LLMError extends Error {
constructor(message: string, public cause?: Error) {
super(message);
this.name = 'LLMError';
}
}
export class RateLimitError extends LLMError {
name = 'RateLimitError';
}
export class ProviderUnavailableError extends LLMError {
name = 'ProviderUnavailableError';
}
Anthropic Provider Implementation
// llm/providers/anthropic.provider.ts
import Anthropic from '@anthropic-ai/sdk';
import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';
export class AnthropicProvider implements LLMProvider {
name = 'anthropic';
private client: Anthropic;
constructor(config: { apiKey: string }) {
this.client = new Anthropic({ apiKey: config.apiKey });
}
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
const startTime = Date.now();
const model = request.model || 'claude-sonnet-4-20250514';
// Extract system message (Anthropic handles it separately)
const systemMessage = request.messages.find(m => m.role === 'system')?.content;
const messages = request.messages
.filter(m => m.role !== 'system')
.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
}));
const response = await this.client.messages.create({
model,
max_tokens: request.maxTokens || 4096,
system: systemMessage,
messages,
});
const latencyMs = Date.now() - startTime;
const content = response.content[0].type === 'text'
? response.content[0].text
: '';
return {
content,
model: response.model,
usage: {
promptTokens: response.usage.input_tokens,
completionTokens: response.usage.output_tokens,
totalTokens: response.usage.input_tokens + response.usage.output_tokens,
},
finishReason: response.stop_reason === 'end_turn' ? 'stop' : 'length',
latencyMs,
cost: this.calculateCost(model, response.usage),
cached: false,
};
}
// ... streaming and embed implementations
async isAvailable(): Promise<boolean> {
try {
// Simple health check
await this.client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1,
messages: [{ role: 'user', content: 'hi' }],
});
return true;
} catch {
return false;
}
}
getModels(): string[] {
return ['claude-sonnet-4-20250514', 'claude-3-5-sonnet-20241022', 'claude-3-haiku-20240307'];
}
private calculateCost(model: string, usage: { input_tokens: number; output_tokens: number }): number {
const pricing: Record<string, { input: number; output: number }> = {
'claude-sonnet-4-20250514': { input: 0.003, output: 0.015 },
'claude-3-5-sonnet-20241022': { input: 0.003, output: 0.015 },
'claude-3-haiku-20240307': { input: 0.00025, output: 0.00125 },
};
const p = pricing[model] || pricing['claude-sonnet-4-20250514'];
return (usage.input_tokens * p.input + usage.output_tokens * p.output) / 1000;
}
}
Gateway Pattern
The LLM Gateway
// llm/gateway/llm.gateway.ts
import {
LLMProvider,
LLMCompletionRequest,
LLMCompletionResponse,
LLMEmbeddingRequest,
LLMEmbeddingResponse,
} from '../interfaces';
import { CircuitBreaker } from './circuit-breaker';
import { RateLimiter } from './rate-limiter';
import { Cache } from './cache';
import { CostTracker } from './cost-tracker';
import { Logger } from './logger';
interface GatewayConfig {
providers: Record<string, LLMProvider>;
defaultProvider: string;
fallbackProviders: string[];
cache?: Cache;
costTracker?: CostTracker;
rateLimiter?: RateLimiter;
logger?: Logger;
}
export class LLMGateway {
private providers: Record<string, LLMProvider>;
private defaultProvider: string;
private fallbackProviders: string[];
private circuitBreakers: Record<string, CircuitBreaker>;
private cache?: Cache;
private costTracker?: CostTracker;
private rateLimiter?: RateLimiter;
private logger?: Logger;
constructor(config: GatewayConfig) {
this.providers = config.providers;
this.defaultProvider = config.defaultProvider;
this.fallbackProviders = config.fallbackProviders;
this.cache = config.cache;
this.costTracker = config.costTracker;
this.rateLimiter = config.rateLimiter;
this.logger = config.logger;
// Initialize circuit breakers for each provider
this.circuitBreakers = {};
Object.keys(this.providers).forEach(name => {
this.circuitBreakers[name] = new CircuitBreaker({
failureThreshold: 5,
recoveryTimeout: 30000,
});
});
}
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
const startTime = Date.now();
const requestId = request.metadata?.requestId || crypto.randomUUID();
// Check rate limits
if (this.rateLimiter) {
const allowed = await this.rateLimiter.checkLimit(
request.metadata?.feature || 'default',
request.metadata?.userId
);
if (!allowed) {
throw new RateLimitExceededError('Rate limit exceeded');
}
}
// Check cache
if (this.cache) {
const cached = await this.cache.get(this.getCacheKey(request));
if (cached) {
this.logger?.info('Cache hit', { requestId, feature: request.metadata?.feature });
return { ...cached, cached: true };
}
}
// Try providers in order
const providersToTry = [this.defaultProvider, ...this.fallbackProviders];
let lastError: Error | null = null;
for (const providerName of providersToTry) {
const provider = this.providers[providerName];
const circuitBreaker = this.circuitBreakers[providerName];
if (!provider || circuitBreaker.isOpen()) {
continue;
}
try {
this.logger?.info('Attempting provider', { providerName, requestId });
const response = await circuitBreaker.execute(() =>
provider.complete(request)
);
// Track cost
if (this.costTracker) {
await this.costTracker.track({
provider: providerName,
model: response.model,
feature: request.metadata?.feature || 'unknown',
userId: request.metadata?.userId,
inputTokens: response.usage.promptTokens,
outputTokens: response.usage.completionTokens,
cost: response.cost,
latencyMs: response.latencyMs,
requestId,
});
}
// Cache successful response
if (this.cache && request.temperature === 0) {
await this.cache.set(this.getCacheKey(request), response);
}
this.logger?.info('Request completed', {
providerName,
requestId,
latencyMs: response.latencyMs,
tokens: response.usage.totalTokens,
cost: response.cost,
});
return response;
} catch (error) {
lastError = error as Error;
this.logger?.warn('Provider failed', {
providerName,
requestId,
error: lastError.message,
});
// Continue to next provider
}
}
this.logger?.error('All providers failed', { requestId, error: lastError?.message });
throw lastError || new Error('All LLM providers failed');
}
async *completeStream(
request: LLMCompletionRequest
): AsyncGenerator<string, LLMCompletionResponse> {
const provider = this.providers[this.defaultProvider];
// Streaming doesn't support fallback easily, use primary only
const generator = provider.completeStream(request);
let chunk = await generator.next();
while (!chunk.done) {
yield chunk.value;
chunk = await generator.next();
}
// Track cost for streaming response
if (this.costTracker && chunk.value) {
await this.costTracker.track({
provider: this.defaultProvider,
model: chunk.value.model,
feature: request.metadata?.feature || 'unknown',
inputTokens: chunk.value.usage.promptTokens,
outputTokens: chunk.value.usage.completionTokens,
cost: chunk.value.cost,
latencyMs: chunk.value.latencyMs,
requestId: request.metadata?.requestId || '',
});
}
return chunk.value;
}
private getCacheKey(request: LLMCompletionRequest): string {
return crypto
.createHash('sha256')
.update(JSON.stringify({
messages: request.messages,
model: request.model,
maxTokens: request.maxTokens,
temperature: request.temperature,
}))
.digest('hex');
}
}
Circuit Breaker Implementation
// llm/gateway/circuit-breaker.ts
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
interface CircuitBreakerConfig {
failureThreshold: number;
recoveryTimeout: number;
halfOpenRequests?: number;
}
export class CircuitBreaker {
private state: CircuitState = 'CLOSED';
private failureCount = 0;
private successCount = 0;
private lastFailureTime = 0;
private config: CircuitBreakerConfig;
constructor(config: CircuitBreakerConfig) {
this.config = {
halfOpenRequests: 3,
...config,
};
}
isOpen(): boolean {
if (this.state === 'OPEN') {
// Check if recovery timeout has passed
if (Date.now() - this.lastFailureTime >= this.config.recoveryTimeout) {
this.state = 'HALF_OPEN';
this.successCount = 0;
return false;
}
return true;
}
return false;
}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
throw new CircuitOpenError('Circuit breaker is open');
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.failureCount = 0;
if (this.state === 'HALF_OPEN') {
this.successCount++;
if (this.successCount >= this.config.halfOpenRequests!) {
this.state = 'CLOSED';
}
}
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.config.failureThreshold) {
this.state = 'OPEN';
}
}
}
export class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = 'CircuitOpenError';
}
}
Capability-Based Design
Define Capabilities, Not Implementations
// llm/capabilities/index.ts
// Define WHAT you need, not HOW it's done
export interface TextSummarizer {
summarize(text: string, options?: SummarizeOptions): Promise<Summary>;
}
export interface SummarizeOptions {
maxLength?: number;
style?: 'brief' | 'detailed' | 'bullet-points';
targetAudience?: string;
}
export interface Summary {
text: string;
keyPoints: string[];
originalLength: number;
summaryLength: number;
}
export interface TextClassifier {
classify(text: string, categories: string[]): Promise<Classification>;
}
export interface Classification {
category: string;
confidence: number;
reasoning?: string;
}
export interface ContentGenerator {
generate(prompt: GeneratePrompt): Promise<GeneratedContent>;
}
export interface GeneratePrompt {
type: 'email' | 'blog-post' | 'product-description' | 'social-media';
context: Record<string, any>;
tone?: 'formal' | 'casual' | 'professional';
length?: 'short' | 'medium' | 'long';
}
export interface GeneratedContent {
content: string;
suggestions?: string[];
metadata: {
wordCount: number;
readingTimeMinutes: number;
};
}
export interface SemanticSearch {
index(documents: Document[]): Promise<void>;
search(query: string, options?: SearchOptions): Promise<SearchResult[]>;
}
Capability Implementations
// llm/capabilities/summarizer.ts
import { TextSummarizer, SummarizeOptions, Summary } from './index';
import { LLMGateway } from '../gateway';
import { PromptManager } from '../prompts';
export class LLMSummarizer implements TextSummarizer {
constructor(
private gateway: LLMGateway,
private prompts: PromptManager
) {}
async summarize(text: string, options?: SummarizeOptions): Promise<Summary> {
const prompt = this.prompts.get('summarize', {
text,
maxLength: options?.maxLength || 200,
style: options?.style || 'brief',
targetAudience: options?.targetAudience || 'general',
});
const response = await this.gateway.complete({
messages: [
{ role: 'system', content: prompt.system },
{ role: 'user', content: prompt.user },
],
temperature: 0.3,
metadata: {
feature: 'summarizer',
requestId: crypto.randomUUID(),
},
});
// Parse structured response
const parsed = this.parseResponse(response.content);
return {
text: parsed.summary,
keyPoints: parsed.keyPoints,
originalLength: text.length,
summaryLength: parsed.summary.length,
};
}
private parseResponse(content: string): { summary: string; keyPoints: string[] } {
// Handle JSON response or extract from text
try {
return JSON.parse(content);
} catch {
// Fallback: treat entire response as summary
return {
summary: content,
keyPoints: [],
};
}
}
}
// Factory function for dependency injection
export function createSummarizer(gateway: LLMGateway): TextSummarizer {
const prompts = new PromptManager();
return new LLMSummarizer(gateway, prompts);
}
Using Capabilities in Application Code
// services/document.service.ts
import { TextSummarizer, TextClassifier } from '../llm/capabilities';
// Application code depends on CAPABILITIES, not LLM details
export class DocumentService {
constructor(
private summarizer: TextSummarizer,
private classifier: TextClassifier,
private documentRepo: DocumentRepository
) {}
async processDocument(doc: Document): Promise<ProcessedDocument> {
// Summarize
const summary = await this.summarizer.summarize(doc.content, {
style: 'bullet-points',
maxLength: 500,
});
// Classify
const classification = await this.classifier.classify(doc.content, [
'legal',
'financial',
'technical',
'marketing',
'other',
]);
// Store
const processed = {
...doc,
summary: summary.text,
keyPoints: summary.keyPoints,
category: classification.category,
categoryConfidence: classification.confidence,
};
await this.documentRepo.save(processed);
return processed;
}
}
// The DocumentService has NO IDEA:
// - Which LLM provider is being used
// - What model is being called
// - What the prompts look like
// - How much it costs
// - How errors are handled
//
// It just uses summarizer.summarize() and classifier.classify()
Graceful Degradation Strategies
Fallback Hierarchy
// llm/fallback/fallback-strategy.ts
export interface FallbackConfig {
feature: string;
llmCapability: () => Promise<any>;
fallbacks: FallbackOption[];
}
export interface FallbackOption {
name: string;
condition?: () => boolean | Promise<boolean>;
execute: () => Promise<any>;
}
export class FallbackStrategy {
async execute(config: FallbackConfig): Promise<any> {
// Try LLM first
try {
return await config.llmCapability();
} catch (error) {
console.warn(`LLM failed for ${config.feature}:`, error);
}
// Try fallbacks in order
for (const fallback of config.fallbacks) {
try {
if (fallback.condition) {
const shouldUse = await fallback.condition();
if (!shouldUse) continue;
}
console.info(`Using fallback ${fallback.name} for ${config.feature}`);
return await fallback.execute();
} catch (error) {
console.warn(`Fallback ${fallback.name} failed:`, error);
}
}
throw new Error(`All options exhausted for ${config.feature}`);
}
}
// Example usage
const summarizeWithFallback = async (text: string) => {
const strategy = new FallbackStrategy();
return strategy.execute({
feature: 'summarization',
llmCapability: () => summarizer.summarize(text),
fallbacks: [
{
name: 'cached-summary',
condition: async () => {
const cached = await cache.get(`summary:${hash(text)}`);
return !!cached;
},
execute: async () => cache.get(`summary:${hash(text)}`),
},
{
name: 'extractive-summary',
execute: async () => extractiveSummarize(text), // Non-LLM algorithm
},
{
name: 'first-paragraph',
execute: async () => ({
text: text.split('\n\n')[0],
keyPoints: [],
originalLength: text.length,
summaryLength: text.split('\n\n')[0].length,
}),
},
],
});
};
Feature-Level Degradation
// features/ai-features.ts
export class AIFeatureManager {
private featureStatus: Map<string, FeatureStatus> = new Map();
constructor(private llmGateway: LLMGateway) {
this.initializeHealthChecks();
}
private initializeHealthChecks() {
// Periodic health checks
setInterval(async () => {
const available = await this.llmGateway.isAvailable();
this.updateAllFeatures(available ? 'full' : 'degraded');
}, 30000);
}
getFeatureMode(feature: string): 'full' | 'degraded' | 'disabled' {
return this.featureStatus.get(feature)?.mode || 'full';
}
setFeatureMode(feature: string, mode: 'full' | 'degraded' | 'disabled') {
this.featureStatus.set(feature, { mode, updatedAt: new Date() });
}
private updateAllFeatures(mode: 'full' | 'degraded') {
for (const [feature] of this.featureStatus) {
// Don't override manually disabled features
if (this.featureStatus.get(feature)?.mode !== 'disabled') {
this.setFeatureMode(feature, mode);
}
}
}
}
// Usage in UI component
function SmartSuggestions({ document }: { document: Document }) {
const featureMode = useFeatureMode('smart-suggestions');
if (featureMode === 'disabled') {
return null;
}
if (featureMode === 'degraded') {
return <BasicSuggestions document={document} />;
}
return <AISuggestions document={document} />;
}
Degradation Communication
// components/AIStatusIndicator.tsx
import React from 'react';
import { useAIStatus } from '../hooks/useAIStatus';
export function AIStatusIndicator() {
const status = useAIStatus();
if (status.mode === 'full') {
return null; // Don't show anything when working normally
}
return (
<div className={`ai-status ai-status--${status.mode}`}>
{status.mode === 'degraded' && (
<>
<span className="ai-status__icon">⚡</span>
<span>AI features are limited. Some suggestions may be simpler.</span>
</>
)}
{status.mode === 'disabled' && (
<>
<span className="ai-status__icon">🔌</span>
<span>AI features are temporarily unavailable.</span>
</>
)}
</div>
);
}
Feature Flag Integration
LLM Feature Flags
// config/feature-flags.ts
export interface LLMFeatureFlag {
name: string;
enabled: boolean;
provider?: string; // Override default provider
model?: string; // Override default model
rolloutPercentage?: number;
userSegments?: string[];
maxCostPerUser?: number;
maxRequestsPerMinute?: number;
}
export const llmFeatureFlags: Record<string, LLMFeatureFlag> = {
'ai-summarization': {
name: 'AI Summarization',
enabled: true,
provider: 'openai',
model: 'gpt-4o',
rolloutPercentage: 100,
maxCostPerUser: 0.50, // $0.50 per user per day
},
'ai-chat': {
name: 'AI Chat Assistant',
enabled: true,
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
rolloutPercentage: 50, // 50% of users
userSegments: ['premium', 'beta'],
maxRequestsPerMinute: 10,
},
'ai-code-review': {
name: 'AI Code Review',
enabled: false, // Not yet launched
provider: 'openai',
model: 'gpt-4-turbo',
},
};
Feature Flag Service
// services/feature-flag.service.ts
import { llmFeatureFlags, LLMFeatureFlag } from '../config/feature-flags';
export class FeatureFlagService {
private flags: Record<string, LLMFeatureFlag>;
private userCosts: Map<string, number> = new Map();
constructor() {
this.flags = llmFeatureFlags;
// In production, load from remote config (LaunchDarkly, etc.)
}
isEnabled(
featureName: string,
context: { userId: string; userSegments?: string[] }
): boolean {
const flag = this.flags[featureName];
if (!flag || !flag.enabled) {
return false;
}
// Check user segments
if (flag.userSegments && flag.userSegments.length > 0) {
const hasSegment = context.userSegments?.some(s =>
flag.userSegments!.includes(s)
);
if (!hasSegment) {
return false;
}
}
// Check rollout percentage (consistent per user)
if (flag.rolloutPercentage !== undefined && flag.rolloutPercentage < 100) {
const hash = this.hashUserId(context.userId, featureName);
if (hash > flag.rolloutPercentage) {
return false;
}
}
// Check cost limits
if (flag.maxCostPerUser !== undefined) {
const currentCost = this.userCosts.get(context.userId) || 0;
if (currentCost >= flag.maxCostPerUser) {
return false;
}
}
return true;
}
getConfig(featureName: string): { provider?: string; model?: string } {
const flag = this.flags[featureName];
return {
provider: flag?.provider,
model: flag?.model,
};
}
recordCost(userId: string, cost: number): void {
const current = this.userCosts.get(userId) || 0;
this.userCosts.set(userId, current + cost);
}
private hashUserId(userId: string, feature: string): number {
// Consistent hashing for rollout percentage
const str = `${userId}:${feature}`;
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = (hash << 5) - hash + str.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash % 100);
}
}
Using Feature Flags
// services/document.service.ts (updated)
export class DocumentService {
constructor(
private summarizer: TextSummarizer,
private featureFlags: FeatureFlagService,
private fallbackSummarizer: SimpleSummarizer
) {}
async summarizeDocument(
doc: Document,
userId: string,
userSegments: string[]
): Promise<Summary> {
const isAIEnabled = this.featureFlags.isEnabled('ai-summarization', {
userId,
userSegments,
});
if (!isAIEnabled) {
// Use non-AI fallback
return this.fallbackSummarizer.summarize(doc.content);
}
try {
const summary = await this.summarizer.summarize(doc.content);
// Cost tracking handled by LLM Gateway
return summary;
} catch (error) {
// Graceful degradation to non-AI fallback
console.warn('AI summarization failed, using fallback', error);
return this.fallbackSummarizer.summarize(doc.content);
}
}
}
Queue-Based Architecture
Async LLM Processing
┌─────────────────────────────────────────────────────────────────┐
│ Queue-Based LLM Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SYNCHRONOUS (User Waits) ASYNCHRONOUS (Background) │
│ ───────────────────────────────────────────────────────────── │
│ │
│ User Action User Action │
│ │ │ │
│ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │
│ │ API │ │ API │ │
│ │Handler │ │Handler │ │
│ └───┬────┘ └───┬────┘ │
│ │ │ │
│ ▼ │ Enqueue │
│ ┌────────┐ ▼ │
│ │ LLM │ ┌────────┐ │
│ │Gateway │ │ Queue │──► Job ID │
│ └───┬────┘ └───┬────┘ returned │
│ │ │ to user │
│ │ 2-30 seconds │ │
│ ▼ ▼ │
│ Response ┌────────┐ │
│ to User │ Worker │ │
│ └───┬────┘ │
│ Good for: │ │
│ • Chat ▼ │
│ • Short completions ┌────────┐ │
│ • Real-time features │ LLM │ │
│ │Gateway │ │
│ └───┬────┘ │
│ │ │
│ ▼ │
│ Webhook/ │
│ Notification │
│ to User │
│ │
│ Good for: │
│ • Document processing │
│ • Batch operations │
│ • Non-urgent features │
│ │
└─────────────────────────────────────────────────────────────────┘
Queue Implementation
// llm/queue/llm-job.ts
export interface LLMJob {
id: string;
type: 'completion' | 'embedding' | 'batch-completion';
payload: any;
metadata: {
feature: string;
userId: string;
priority: 'low' | 'normal' | 'high';
callbackUrl?: string;
webhookSecret?: string;
};
status: 'pending' | 'processing' | 'completed' | 'failed';
result?: any;
error?: string;
createdAt: Date;
processedAt?: Date;
completedAt?: Date;
}
// llm/queue/llm-queue.service.ts
import { Queue, Worker, Job } from 'bullmq';
import { LLMGateway } from '../gateway';
export class LLMQueueService {
private queue: Queue;
private worker: Worker;
constructor(
private gateway: LLMGateway,
private redis: Redis
) {
this.queue = new Queue('llm-jobs', { connection: redis });
this.initializeWorker();
}
async enqueue(job: Omit<LLMJob, 'id' | 'status' | 'createdAt'>): Promise<string> {
const bullJob = await this.queue.add(job.type, job, {
priority: this.getPriority(job.metadata.priority),
attempts: 3,
backoff: {
type: 'exponential',
delay: 1000,
},
});
return bullJob.id!;
}
async getJobStatus(jobId: string): Promise<LLMJob | null> {
const job = await this.queue.getJob(jobId);
if (!job) return null;
return {
id: job.id!,
type: job.name as LLMJob['type'],
payload: job.data.payload,
metadata: job.data.metadata,
status: await this.mapJobState(job),
result: job.returnvalue,
error: job.failedReason,
createdAt: new Date(job.timestamp),
processedAt: job.processedOn ? new Date(job.processedOn) : undefined,
completedAt: job.finishedOn ? new Date(job.finishedOn) : undefined,
};
}
private initializeWorker() {
this.worker = new Worker(
'llm-jobs',
async (job: Job) => {
switch (job.name) {
case 'completion':
return this.processCompletion(job.data);
case 'embedding':
return this.processEmbedding(job.data);
case 'batch-completion':
return this.processBatchCompletion(job.data);
default:
throw new Error(`Unknown job type: ${job.name}`);
}
},
{
connection: this.redis,
concurrency: 10, // Process 10 jobs in parallel
}
);
this.worker.on('completed', async (job, result) => {
// Send webhook if configured
if (job.data.metadata.callbackUrl) {
await this.sendWebhook(job.data.metadata, result);
}
});
this.worker.on('failed', async (job, error) => {
console.error(`Job ${job?.id} failed:`, error);
// Could send failure webhook
});
}
private async processCompletion(data: any): Promise<any> {
return this.gateway.complete({
messages: data.payload.messages,
model: data.payload.model,
maxTokens: data.payload.maxTokens,
metadata: data.metadata,
});
}
private async processBatchCompletion(data: any): Promise<any[]> {
const results = [];
for (const item of data.payload.items) {
const result = await this.gateway.complete({
messages: item.messages,
model: data.payload.model,
metadata: data.metadata,
});
results.push(result);
}
return results;
}
private getPriority(priority: 'low' | 'normal' | 'high'): number {
return { low: 10, normal: 5, high: 1 }[priority];
}
private async mapJobState(job: Job): Promise<LLMJob['status']> {
const state = await job.getState();
const mapping: Record<string, LLMJob['status']> = {
waiting: 'pending',
active: 'processing',
completed: 'completed',
failed: 'failed',
};
return mapping[state] || 'pending';
}
private async sendWebhook(metadata: any, result: any): Promise<void> {
// Send result to callback URL
await fetch(metadata.callbackUrl, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Webhook-Signature': this.signPayload(result, metadata.webhookSecret),
},
body: JSON.stringify(result),
});
}
}
API Endpoints for Queue
// routes/llm.routes.ts
import { Router } from 'express';
import { LLMQueueService } from '../llm/queue';
export function createLLMRoutes(queueService: LLMQueueService): Router {
const router = Router();
// Enqueue a job
router.post('/jobs', async (req, res) => {
const jobId = await queueService.enqueue({
type: req.body.type,
payload: req.body.payload,
metadata: {
feature: req.body.feature,
userId: req.user.id,
priority: req.body.priority || 'normal',
callbackUrl: req.body.callbackUrl,
},
});
res.status(202).json({
jobId,
status: 'pending',
statusUrl: `/api/llm/jobs/${jobId}`,
});
});
// Check job status
router.get('/jobs/:jobId', async (req, res) => {
const job = await queueService.getJobStatus(req.params.jobId);
if (!job) {
return res.status(404).json({ error: 'Job not found' });
}
// Only return result if completed
res.json({
id: job.id,
status: job.status,
...(job.status === 'completed' && { result: job.result }),
...(job.status === 'failed' && { error: job.error }),
});
});
return router;
}
Caching and Memoization
Multi-Level Cache
// llm/cache/cache.ts
export interface CacheConfig {
l1: { maxSize: number; ttlMs: number }; // In-memory
l2?: { redis: Redis; ttlMs: number }; // Redis
l3?: { storage: Storage; ttlMs: number }; // Persistent
}
export class LLMCache {
private l1: Map<string, { value: any; expiry: number }> = new Map();
private l1MaxSize: number;
private l2?: Redis;
private l3?: Storage;
private config: CacheConfig;
constructor(config: CacheConfig) {
this.config = config;
this.l1MaxSize = config.l1.maxSize;
this.l2 = config.l2?.redis;
this.l3 = config.l3?.storage;
}
async get(key: string): Promise<any | null> {
// L1: In-memory
const l1Entry = this.l1.get(key);
if (l1Entry && l1Entry.expiry > Date.now()) {
return l1Entry.value;
}
// L2: Redis
if (this.l2) {
const l2Value = await this.l2.get(`llm:${key}`);
if (l2Value) {
const parsed = JSON.parse(l2Value);
// Populate L1
this.setL1(key, parsed);
return parsed;
}
}
// L3: Persistent storage
if (this.l3) {
const l3Value = await this.l3.get(`llm:${key}`);
if (l3Value) {
const parsed = JSON.parse(l3Value);
// Populate L1 and L2
this.setL1(key, parsed);
if (this.l2) {
await this.l2.setex(`llm:${key}`, this.config.l2!.ttlMs / 1000, l3Value);
}
return parsed;
}
}
return null;
}
async set(key: string, value: any): Promise<void> {
const serialized = JSON.stringify(value);
// L1
this.setL1(key, value);
// L2
if (this.l2) {
await this.l2.setex(`llm:${key}`, this.config.l2!.ttlMs / 1000, serialized);
}
// L3
if (this.l3) {
await this.l3.set(`llm:${key}`, serialized, this.config.l3!.ttlMs);
}
}
private setL1(key: string, value: any): void {
// LRU eviction
if (this.l1.size >= this.l1MaxSize) {
const firstKey = this.l1.keys().next().value;
this.l1.delete(firstKey);
}
this.l1.set(key, {
value,
expiry: Date.now() + this.config.l1.ttlMs,
});
}
}
Semantic Caching
// llm/cache/semantic-cache.ts
// For similar (not identical) queries
export class SemanticCache {
constructor(
private embedder: LLMGateway,
private vectorStore: VectorStore,
private cache: LLMCache,
private similarityThreshold = 0.95
) {}
async get(query: string): Promise<any | null> {
// Get embedding for query
const embedding = await this.embedder.embed({
input: query,
metadata: { feature: 'semantic-cache', requestId: crypto.randomUUID() },
});
// Search for similar cached queries
const results = await this.vectorStore.search(embedding.embeddings[0], {
topK: 1,
minScore: this.similarityThreshold,
});
if (results.length > 0) {
// Found similar query, return cached response
const cachedKey = results[0].metadata.cacheKey;
return this.cache.get(cachedKey);
}
return null;
}
async set(query: string, response: any): Promise<void> {
const cacheKey = this.generateKey(query);
// Get embedding for query
const embedding = await this.embedder.embed({
input: query,
metadata: { feature: 'semantic-cache', requestId: crypto.randomUUID() },
});
// Store in vector store for similarity search
await this.vectorStore.upsert({
id: cacheKey,
vector: embedding.embeddings[0],
metadata: { cacheKey, query },
});
// Store actual response in regular cache
await this.cache.set(cacheKey, response);
}
private generateKey(query: string): string {
return crypto.createHash('sha256').update(query).digest('hex');
}
}
Cost Control Architecture
Cost Tracking
// llm/cost/cost-tracker.ts
export interface CostRecord {
provider: string;
model: string;
feature: string;
userId?: string;
inputTokens: number;
outputTokens: number;
cost: number;
latencyMs: number;
requestId: string;
timestamp: Date;
}
export class CostTracker {
constructor(
private storage: CostStorage,
private alertThresholds: AlertThresholds
) {}
async track(record: Omit<CostRecord, 'timestamp'>): Promise<void> {
const fullRecord: CostRecord = {
...record,
timestamp: new Date(),
};
// Store the record
await this.storage.save(fullRecord);
// Check alerts
await this.checkAlerts(fullRecord);
}
async getCostsByFeature(
startDate: Date,
endDate: Date
): Promise<Record<string, number>> {
const records = await this.storage.query({ startDate, endDate });
return records.reduce((acc, record) => {
acc[record.feature] = (acc[record.feature] || 0) + record.cost;
return acc;
}, {} as Record<string, number>);
}
async getCostsByUser(
userId: string,
startDate: Date,
endDate: Date
): Promise<number> {
const records = await this.storage.query({ userId, startDate, endDate });
return records.reduce((sum, record) => sum + record.cost, 0);
}
private async checkAlerts(record: CostRecord): Promise<void> {
// Feature cost alert
const featureCostToday = await this.getFeatureCostToday(record.feature);
if (featureCostToday > this.alertThresholds.featureDaily) {
await this.sendAlert({
type: 'feature-cost',
message: `Feature ${record.feature} exceeded daily budget`,
cost: featureCostToday,
threshold: this.alertThresholds.featureDaily,
});
}
// User cost alert
if (record.userId) {
const userCostToday = await this.getCostsByUser(
record.userId,
startOfDay(new Date()),
new Date()
);
if (userCostToday > this.alertThresholds.userDaily) {
await this.sendAlert({
type: 'user-cost',
message: `User ${record.userId} exceeded daily budget`,
cost: userCostToday,
threshold: this.alertThresholds.userDaily,
});
}
}
}
}
Budget Enforcement
// llm/cost/budget-enforcer.ts
export interface Budget {
type: 'user' | 'feature' | 'organization';
id: string;
dailyLimit: number;
monthlyLimit: number;
action: 'block' | 'warn' | 'throttle';
}
export class BudgetEnforcer {
constructor(
private costTracker: CostTracker,
private budgets: Budget[]
) {}
async checkBudget(
feature: string,
userId?: string,
estimatedCost?: number
): Promise<{ allowed: boolean; reason?: string; remainingBudget?: number }> {
// Check feature budget
const featureBudget = this.budgets.find(
b => b.type === 'feature' && b.id === feature
);
if (featureBudget) {
const featureCost = await this.costTracker.getCostsByFeature(
startOfDay(new Date()),
new Date()
);
const remaining = featureBudget.dailyLimit - (featureCost[feature] || 0);
if (remaining <= 0) {
return this.handleBudgetExceeded(featureBudget, remaining);
}
}
// Check user budget
if (userId) {
const userBudget = this.budgets.find(
b => b.type === 'user' && b.id === userId
);
if (userBudget) {
const userCost = await this.costTracker.getCostsByUser(
userId,
startOfDay(new Date()),
new Date()
);
const remaining = userBudget.dailyLimit - userCost;
if (remaining <= 0) {
return this.handleBudgetExceeded(userBudget, remaining);
}
}
}
return { allowed: true };
}
private handleBudgetExceeded(
budget: Budget,
remaining: number
): { allowed: boolean; reason: string; remainingBudget: number } {
switch (budget.action) {
case 'block':
return {
allowed: false,
reason: `Budget exceeded for ${budget.type} ${budget.id}`,
remainingBudget: remaining,
};
case 'warn':
console.warn(`Budget warning: ${budget.type} ${budget.id}`);
return { allowed: true, remainingBudget: remaining };
case 'throttle':
// Could implement request queuing or rate limiting
return { allowed: true, remainingBudget: remaining };
}
}
}
Cost Dashboard Data
// llm/cost/cost-analytics.ts
export class CostAnalytics {
constructor(private costTracker: CostTracker) {}
async getDashboardData(organizationId: string): Promise<CostDashboard> {
const now = new Date();
const startOfMonth = new Date(now.getFullYear(), now.getMonth(), 1);
const startOfLastMonth = new Date(now.getFullYear(), now.getMonth() - 1, 1);
const [
thisMonthCosts,
lastMonthCosts,
costsByFeature,
costsByModel,
topUsers,
] = await Promise.all([
this.costTracker.getTotalCost(startOfMonth, now),
this.costTracker.getTotalCost(startOfLastMonth, startOfMonth),
this.costTracker.getCostsByFeature(startOfMonth, now),
this.costTracker.getCostsByModel(startOfMonth, now),
this.costTracker.getTopUsersByCost(startOfMonth, now, 10),
]);
return {
summary: {
totalCostThisMonth: thisMonthCosts,
totalCostLastMonth: lastMonthCosts,
changePercent: ((thisMonthCosts - lastMonthCosts) / lastMonthCosts) * 100,
projectedMonthEnd: this.projectMonthEndCost(thisMonthCosts, now),
},
breakdown: {
byFeature: costsByFeature,
byModel: costsByModel,
topUsers: topUsers,
},
trends: await this.getDailyTrends(startOfMonth, now),
};
}
private projectMonthEndCost(currentCost: number, now: Date): number {
const daysInMonth = new Date(now.getFullYear(), now.getMonth() + 1, 0).getDate();
const dayOfMonth = now.getDate();
return (currentCost / dayOfMonth) * daysInMonth;
}
}
Testing Strategies
Testing Without LLM Calls
// llm/testing/mock-provider.ts
import { LLMProvider, LLMCompletionRequest, LLMCompletionResponse } from '../interfaces';
export class MockLLMProvider implements LLMProvider {
name = 'mock';
private responses: Map<string, string> = new Map();
private defaultResponse = 'Mock response';
private callLog: LLMCompletionRequest[] = [];
// Configure mock responses
mockResponse(pattern: string | RegExp, response: string): void {
if (typeof pattern === 'string') {
this.responses.set(pattern, response);
} else {
this.responses.set(pattern.source, response);
}
}
setDefaultResponse(response: string): void {
this.defaultResponse = response;
}
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
this.callLog.push(request);
const userMessage = request.messages.find(m => m.role === 'user')?.content || '';
const response = this.findResponse(userMessage);
return {
content: response,
model: 'mock-model',
usage: {
promptTokens: this.estimateTokens(request.messages),
completionTokens: this.estimateTokens([{ role: 'assistant', content: response }]),
totalTokens: 0, // Set in next line
},
finishReason: 'stop',
latencyMs: 50,
cost: 0,
cached: false,
};
}
// Test utilities
getCalls(): LLMCompletionRequest[] {
return [...this.callLog];
}
getLastCall(): LLMCompletionRequest | undefined {
return this.callLog[this.callLog.length - 1];
}
clearCalls(): void {
this.callLog = [];
}
assertCalled(times?: number): void {
if (times !== undefined && this.callLog.length !== times) {
throw new Error(`Expected ${times} calls, got ${this.callLog.length}`);
}
if (this.callLog.length === 0) {
throw new Error('Expected at least one call');
}
}
assertCalledWith(matcher: (req: LLMCompletionRequest) => boolean): void {
const match = this.callLog.find(matcher);
if (!match) {
throw new Error('No call matched the criteria');
}
}
private findResponse(userMessage: string): string {
for (const [pattern, response] of this.responses) {
if (userMessage.includes(pattern) || new RegExp(pattern).test(userMessage)) {
return response;
}
}
return this.defaultResponse;
}
private estimateTokens(messages: { content: string }[]): number {
return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
}
async isAvailable(): Promise<boolean> {
return true;
}
getModels(): string[] {
return ['mock-model'];
}
}
Integration Tests
// tests/integration/summarizer.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { MockLLMProvider } from '../../llm/testing/mock-provider';
import { LLMGateway } from '../../llm/gateway';
import { LLMSummarizer } from '../../llm/capabilities/summarizer';
describe('Summarizer Integration', () => {
let mockProvider: MockLLMProvider;
let gateway: LLMGateway;
let summarizer: LLMSummarizer;
beforeEach(() => {
mockProvider = new MockLLMProvider();
gateway = new LLMGateway({
providers: { mock: mockProvider },
defaultProvider: 'mock',
fallbackProviders: [],
});
summarizer = new LLMSummarizer(gateway, new PromptManager());
});
it('should summarize text', async () => {
mockProvider.mockResponse(
'summarize',
JSON.stringify({
summary: 'This is a test summary.',
keyPoints: ['Point 1', 'Point 2'],
})
);
const result = await summarizer.summarize('Long text to summarize...');
expect(result.text).toBe('This is a test summary.');
expect(result.keyPoints).toHaveLength(2);
mockProvider.assertCalled(1);
});
it('should handle LLM errors gracefully', async () => {
mockProvider.complete = async () => {
throw new Error('Provider unavailable');
};
await expect(summarizer.summarize('text')).rejects.toThrow();
});
it('should include correct metadata in requests', async () => {
mockProvider.setDefaultResponse('{"summary": "test", "keyPoints": []}');
await summarizer.summarize('text');
const call = mockProvider.getLastCall();
expect(call?.metadata?.feature).toBe('summarizer');
expect(call?.metadata?.requestId).toBeDefined();
});
});
Snapshot Testing for Prompts
// tests/prompts/summarizer-prompts.test.ts
import { describe, it, expect } from 'vitest';
import { PromptManager } from '../../llm/prompts';
describe('Summarizer Prompts', () => {
const prompts = new PromptManager();
it('should generate consistent prompts', () => {
const prompt = prompts.get('summarize', {
text: 'Sample text',
maxLength: 200,
style: 'brief',
targetAudience: 'general',
});
// Snapshot ensures prompts don't change unexpectedly
expect(prompt).toMatchSnapshot();
});
it('should handle different styles', () => {
const styles = ['brief', 'detailed', 'bullet-points'] as const;
for (const style of styles) {
const prompt = prompts.get('summarize', {
text: 'Sample text',
style,
});
expect(prompt.system).toContain(style);
}
});
});
Contract Testing
// tests/contract/provider-contract.test.ts
import { describe, it, expect } from 'vitest';
import { LLMProvider, LLMCompletionRequest } from '../../llm/interfaces';
import { OpenAIProvider } from '../../llm/providers/openai.provider';
import { AnthropicProvider } from '../../llm/providers/anthropic.provider';
// Contract test: all providers must behave the same way
function testProviderContract(
name: string,
createProvider: () => LLMProvider
) {
describe(`${name} Provider Contract`, () => {
let provider: LLMProvider;
beforeAll(() => {
provider = createProvider();
});
const standardRequest: LLMCompletionRequest = {
messages: [
{ role: 'system', content: 'You are helpful.' },
{ role: 'user', content: 'Say "hello"' },
],
maxTokens: 10,
temperature: 0,
metadata: { feature: 'test', requestId: 'test-123' },
};
it('should return required fields', async () => {
const response = await provider.complete(standardRequest);
expect(response).toHaveProperty('content');
expect(response).toHaveProperty('model');
expect(response).toHaveProperty('usage');
expect(response.usage).toHaveProperty('promptTokens');
expect(response.usage).toHaveProperty('completionTokens');
expect(response.usage).toHaveProperty('totalTokens');
expect(response).toHaveProperty('finishReason');
expect(response).toHaveProperty('latencyMs');
expect(response).toHaveProperty('cost');
});
it('should return valid finish reasons', async () => {
const response = await provider.complete(standardRequest);
expect(['stop', 'length', 'content_filter', 'error']).toContain(
response.finishReason
);
});
it('should report availability', async () => {
const available = await provider.isAvailable();
expect(typeof available).toBe('boolean');
});
it('should list models', () => {
const models = provider.getModels();
expect(Array.isArray(models)).toBe(true);
expect(models.length).toBeGreaterThan(0);
});
});
}
// Run contract tests for each provider (in CI, use mocks or test accounts)
if (process.env.RUN_INTEGRATION_TESTS) {
testProviderContract('OpenAI', () =>
new OpenAIProvider({ apiKey: process.env.OPENAI_API_KEY! })
);
testProviderContract('Anthropic', () =>
new AnthropicProvider({ apiKey: process.env.ANTHROPIC_API_KEY! })
);
}
Observability and Monitoring
Structured Logging
// llm/observability/logger.ts
export interface LLMLogEntry {
timestamp: string;
level: 'debug' | 'info' | 'warn' | 'error';
event: string;
requestId: string;
provider?: string;
model?: string;
feature?: string;
userId?: string;
latencyMs?: number;
tokens?: {
prompt: number;
completion: number;
total: number;
};
cost?: number;
cached?: boolean;
error?: {
name: string;
message: string;
stack?: string;
};
metadata?: Record<string, any>;
}
export class LLMLogger {
constructor(private sink: (entry: LLMLogEntry) => void) {}
logRequest(details: Partial<LLMLogEntry>): void {
this.log('info', 'llm.request.start', details);
}
logResponse(details: Partial<LLMLogEntry>): void {
this.log('info', 'llm.request.complete', details);
}
logError(error: Error, details: Partial<LLMLogEntry>): void {
this.log('error', 'llm.request.error', {
...details,
error: {
name: error.name,
message: error.message,
stack: error.stack,
},
});
}
logCacheHit(details: Partial<LLMLogEntry>): void {
this.log('debug', 'llm.cache.hit', { ...details, cached: true });
}
logFallback(from: string, to: string, details: Partial<LLMLogEntry>): void {
this.log('warn', 'llm.fallback', {
...details,
metadata: { from, to },
});
}
private log(
level: LLMLogEntry['level'],
event: string,
details: Partial<LLMLogEntry>
): void {
const entry: LLMLogEntry = {
timestamp: new Date().toISOString(),
level,
event,
requestId: details.requestId || 'unknown',
...details,
};
this.sink(entry);
}
}
// Usage with different sinks
const consoleLogger = new LLMLogger((entry) => {
console.log(JSON.stringify(entry));
});
const datadogLogger = new LLMLogger((entry) => {
// Send to Datadog
datadogLogs.logger.log(entry.level, entry.event, entry);
});
Metrics Collection
// llm/observability/metrics.ts
import { Counter, Histogram, Gauge } from 'prom-client';
export class LLMMetrics {
// Request metrics
private requestCounter = new Counter({
name: 'llm_requests_total',
help: 'Total LLM requests',
labelNames: ['provider', 'model', 'feature', 'status'],
});
private latencyHistogram = new Histogram({
name: 'llm_request_duration_seconds',
help: 'LLM request latency',
labelNames: ['provider', 'model', 'feature'],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30],
});
private tokenCounter = new Counter({
name: 'llm_tokens_total',
help: 'Total tokens used',
labelNames: ['provider', 'model', 'feature', 'type'],
});
private costCounter = new Counter({
name: 'llm_cost_dollars_total',
help: 'Total cost in dollars',
labelNames: ['provider', 'model', 'feature'],
});
// Cache metrics
private cacheHitCounter = new Counter({
name: 'llm_cache_hits_total',
help: 'Cache hits',
labelNames: ['feature'],
});
// Circuit breaker metrics
private circuitBreakerGauge = new Gauge({
name: 'llm_circuit_breaker_state',
help: 'Circuit breaker state (0=closed, 1=half-open, 2=open)',
labelNames: ['provider'],
});
recordRequest(labels: {
provider: string;
model: string;
feature: string;
status: 'success' | 'error';
}): void {
this.requestCounter.inc(labels);
}
recordLatency(
labels: { provider: string; model: string; feature: string },
durationMs: number
): void {
this.latencyHistogram.observe(labels, durationMs / 1000);
}
recordTokens(
labels: { provider: string; model: string; feature: string },
tokens: { prompt: number; completion: number }
): void {
this.tokenCounter.inc({ ...labels, type: 'prompt' }, tokens.prompt);
this.tokenCounter.inc({ ...labels, type: 'completion' }, tokens.completion);
}
recordCost(
labels: { provider: string; model: string; feature: string },
cost: number
): void {
this.costCounter.inc(labels, cost);
}
recordCacheHit(feature: string): void {
this.cacheHitCounter.inc({ feature });
}
setCircuitBreakerState(
provider: string,
state: 'closed' | 'half-open' | 'open'
): void {
const stateValue = { closed: 0, 'half-open': 1, open: 2 }[state];
this.circuitBreakerGauge.set({ provider }, stateValue);
}
}
Distributed Tracing
// llm/observability/tracing.ts
import { trace, Span, SpanStatusCode, context } from '@opentelemetry/api';
export class LLMTracer {
private tracer = trace.getTracer('llm-gateway');
async traceRequest<T>(
name: string,
attributes: Record<string, string | number | boolean>,
fn: (span: Span) => Promise<T>
): Promise<T> {
return this.tracer.startActiveSpan(name, async (span) => {
try {
// Set attributes
Object.entries(attributes).forEach(([key, value]) => {
span.setAttribute(`llm.${key}`, value);
});
const result = await fn(span);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : 'Unknown error',
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
}
// Usage in gateway
async complete(request: LLMCompletionRequest): Promise<LLMCompletionResponse> {
return this.tracer.traceRequest(
'llm.complete',
{
provider: this.defaultProvider,
feature: request.metadata?.feature || 'unknown',
model: request.model || 'default',
},
async (span) => {
// Add request details
span.setAttribute('llm.messages_count', request.messages.length);
const response = await this.executeWithFallback(request);
// Add response details
span.setAttribute('llm.tokens.prompt', response.usage.promptTokens);
span.setAttribute('llm.tokens.completion', response.usage.completionTokens);
span.setAttribute('llm.latency_ms', response.latencyMs);
span.setAttribute('llm.cost', response.cost);
span.setAttribute('llm.cached', response.cached);
return response;
}
);
}
Multi-Provider Strategy
Provider Selection
// llm/routing/router.ts
export interface RoutingRule {
name: string;
condition: (request: LLMCompletionRequest) => boolean;
provider: string;
model?: string;
}
export class LLMRouter {
private rules: RoutingRule[] = [];
constructor(private defaultProvider: string) {}
addRule(rule: RoutingRule): void {
this.rules.push(rule);
}
route(request: LLMCompletionRequest): { provider: string; model?: string } {
for (const rule of this.rules) {
if (rule.condition(request)) {
return { provider: rule.provider, model: rule.model };
}
}
return { provider: this.defaultProvider };
}
}
// Example routing rules
const router = new LLMRouter('openai');
// Route coding tasks to Claude
router.addRule({
name: 'coding-to-claude',
condition: (req) => req.metadata?.feature?.includes('code'),
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
});
// Route long-form content to GPT-4
router.addRule({
name: 'long-content-to-gpt4',
condition: (req) => (req.maxTokens || 0) > 2000,
provider: 'openai',
model: 'gpt-4-turbo',
});
// Route simple tasks to cheaper models
router.addRule({
name: 'simple-to-haiku',
condition: (req) => {
const systemPrompt = req.messages.find(m => m.role === 'system')?.content || '';
return systemPrompt.length < 200;
},
provider: 'anthropic',
model: 'claude-3-haiku-20240307',
});
A/B Testing Models
// llm/routing/ab-testing.ts
export interface Experiment {
id: string;
name: string;
feature: string;
variants: Array<{
name: string;
provider: string;
model: string;
weight: number;
}>;
enabled: boolean;
}
export class LLMExperimentRouter {
constructor(
private experiments: Experiment[],
private analytics: AnalyticsService
) {}
selectVariant(
feature: string,
userId: string
): { provider: string; model: string; variant: string } | null {
const experiment = this.experiments.find(
e => e.feature === feature && e.enabled
);
if (!experiment) {
return null;
}
// Consistent assignment based on user ID
const hash = this.hashUserId(userId, experiment.id);
let cumulativeWeight = 0;
for (const variant of experiment.variants) {
cumulativeWeight += variant.weight;
if (hash < cumulativeWeight) {
// Track assignment
this.analytics.track('experiment_assigned', {
experimentId: experiment.id,
variant: variant.name,
userId,
});
return {
provider: variant.provider,
model: variant.model,
variant: variant.name,
};
}
}
return null;
}
trackOutcome(
experimentId: string,
userId: string,
outcome: {
success: boolean;
latencyMs: number;
userSatisfaction?: number;
}
): void {
this.analytics.track('experiment_outcome', {
experimentId,
userId,
...outcome,
});
}
private hashUserId(userId: string, experimentId: string): number {
const str = `${userId}:${experimentId}`;
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = (hash << 5) - hash + str.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash % 100);
}
}
Prompt Management
Centralized Prompt Repository
// llm/prompts/prompt-manager.ts
export interface PromptTemplate {
id: string;
version: string;
system: string;
user: string;
variables: string[];
metadata: {
author: string;
description: string;
lastUpdated: string;
testCases?: Array<{ input: Record<string, any>; expectedOutput: string }>;
};
}
export class PromptManager {
private prompts: Map<string, PromptTemplate> = new Map();
private interpolationRegex = /\{\{(\w+)\}\}/g;
constructor() {
this.loadPrompts();
}
get(
promptId: string,
variables: Record<string, any>
): { system: string; user: string } {
const template = this.prompts.get(promptId);
if (!template) {
throw new Error(`Prompt not found: ${promptId}`);
}
return {
system: this.interpolate(template.system, variables),
user: this.interpolate(template.user, variables),
};
}
private interpolate(template: string, variables: Record<string, any>): string {
return template.replace(this.interpolationRegex, (match, key) => {
if (!(key in variables)) {
throw new Error(`Missing variable: ${key}`);
}
return String(variables[key]);
});
}
private loadPrompts(): void {
// Load from files or database
this.prompts.set('summarize', {
id: 'summarize',
version: '1.2.0',
system: `You are an expert summarizer. Create {{style}} summaries for a {{targetAudience}} audience.
Rules:
- Maximum length: {{maxLength}} characters
- Focus on key points
- Use clear, concise language
- Output JSON: { "summary": "...", "keyPoints": ["...", "..."] }`,
user: `Summarize the following text:
{{text}}`,
variables: ['style', 'targetAudience', 'maxLength', 'text'],
metadata: {
author: 'ai-team',
description: 'General-purpose text summarization',
lastUpdated: '2024-01-15',
},
});
// Add more prompts...
}
// Hot-reload prompts without restart
async reloadPrompts(): Promise<void> {
// Fetch from remote config or database
const freshPrompts = await this.fetchPromptsFromConfig();
this.prompts = new Map(freshPrompts.map(p => [p.id, p]));
}
}
Prompt Versioning
// llm/prompts/versioned-prompts.ts
export interface VersionedPrompt {
id: string;
versions: Array<{
version: string;
template: PromptTemplate;
status: 'draft' | 'active' | 'deprecated';
activatedAt?: Date;
deprecatedAt?: Date;
}>;
}
export class VersionedPromptManager {
constructor(private storage: PromptStorage) {}
async getActiveVersion(promptId: string): Promise<PromptTemplate> {
const prompt = await this.storage.get(promptId);
const active = prompt.versions.find(v => v.status === 'active');
if (!active) {
throw new Error(`No active version for prompt: ${promptId}`);
}
return active.template;
}
async createVersion(
promptId: string,
template: Omit<PromptTemplate, 'id' | 'version'>
): Promise<string> {
const prompt = await this.storage.get(promptId);
const newVersion = this.incrementVersion(
prompt.versions[prompt.versions.length - 1]?.version || '0.0.0'
);
prompt.versions.push({
version: newVersion,
template: { ...template, id: promptId, version: newVersion },
status: 'draft',
});
await this.storage.save(prompt);
return newVersion;
}
async activateVersion(promptId: string, version: string): Promise<void> {
const prompt = await this.storage.get(promptId);
// Deprecate current active
prompt.versions.forEach(v => {
if (v.status === 'active') {
v.status = 'deprecated';
v.deprecatedAt = new Date();
}
});
// Activate new version
const target = prompt.versions.find(v => v.version === version);
if (!target) {
throw new Error(`Version not found: ${version}`);
}
target.status = 'active';
target.activatedAt = new Date();
await this.storage.save(prompt);
}
async rollback(promptId: string): Promise<void> {
const prompt = await this.storage.get(promptId);
const currentActive = prompt.versions.find(v => v.status === 'active');
const previousActive = prompt.versions
.filter(v => v.status === 'deprecated')
.sort((a, b) => (b.deprecatedAt?.getTime() || 0) - (a.deprecatedAt?.getTime() || 0))[0];
if (!previousActive) {
throw new Error('No previous version to rollback to');
}
if (currentActive) {
currentActive.status = 'deprecated';
currentActive.deprecatedAt = new Date();
}
previousActive.status = 'active';
previousActive.activatedAt = new Date();
await this.storage.save(prompt);
}
}
Data Flow Isolation
Preventing Data Leakage
// llm/security/data-sanitizer.ts
export interface SanitizationRule {
type: 'pii' | 'secrets' | 'custom';
pattern: RegExp;
replacement: string | ((match: string) => string);
}
export class DataSanitizer {
private rules: SanitizationRule[] = [
// PII patterns
{
type: 'pii',
pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
replacement: '[EMAIL]',
},
{
type: 'pii',
pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,
replacement: '[PHONE]',
},
{
type: 'pii',
pattern: /\b\d{3}[-]?\d{2}[-]?\d{4}\b/g,
replacement: '[SSN]',
},
{
type: 'pii',
pattern: /\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g,
replacement: '[CREDIT_CARD]',
},
// Secrets
{
type: 'secrets',
pattern: /(api[_-]?key|apikey|secret|password|token|auth)['":\s=]+['"]?[\w-]+['"]?/gi,
replacement: '[REDACTED_SECRET]',
},
{
type: 'secrets',
pattern: /sk-[a-zA-Z0-9]{48}/g, // OpenAI keys
replacement: '[OPENAI_KEY]',
},
{
type: 'secrets',
pattern: /ghp_[a-zA-Z0-9]{36}/g, // GitHub tokens
replacement: '[GITHUB_TOKEN]',
},
];
sanitize(text: string): { sanitized: string; redactions: string[] } {
let result = text;
const redactions: string[] = [];
for (const rule of this.rules) {
result = result.replace(rule.pattern, (match) => {
redactions.push(`${rule.type}: ${match.substring(0, 10)}...`);
return typeof rule.replacement === 'function'
? rule.replacement(match)
: rule.replacement;
});
}
return { sanitized: result, redactions };
}
addRule(rule: SanitizationRule): void {
this.rules.push(rule);
}
}
// Middleware for the gateway
export function createSanitizationMiddleware(sanitizer: DataSanitizer) {
return async (
request: LLMCompletionRequest,
next: (req: LLMCompletionRequest) => Promise<LLMCompletionResponse>
): Promise<LLMCompletionResponse> => {
// Sanitize user messages before sending to LLM
const sanitizedMessages = request.messages.map(msg => {
if (msg.role === 'user') {
const { sanitized, redactions } = sanitizer.sanitize(msg.content);
if (redactions.length > 0) {
console.info('Sanitized sensitive data:', redactions);
}
return { ...msg, content: sanitized };
}
return msg;
});
return next({ ...request, messages: sanitizedMessages });
};
}
Data Residency
// llm/security/data-residency.ts
export interface DataResidencyConfig {
region: string;
allowedProviders: string[];
allowedEndpoints: Record<string, string>; // provider -> endpoint
}
export class DataResidencyEnforcer {
constructor(private configs: Record<string, DataResidencyConfig>) {}
getConfig(userRegion: string): DataResidencyConfig {
return this.configs[userRegion] || this.configs['default'];
}
validateProvider(userRegion: string, provider: string): boolean {
const config = this.getConfig(userRegion);
return config.allowedProviders.includes(provider);
}
getEndpoint(userRegion: string, provider: string): string | undefined {
const config = this.getConfig(userRegion);
return config.allowedEndpoints[provider];
}
}
// Example configuration
const dataResidencyConfigs: Record<string, DataResidencyConfig> = {
eu: {
region: 'eu',
allowedProviders: ['azure-openai', 'anthropic-eu'],
allowedEndpoints: {
'azure-openai': 'https://eu-west.openai.azure.com',
'anthropic-eu': 'https://api.eu.anthropic.com',
},
},
us: {
region: 'us',
allowedProviders: ['openai', 'anthropic'],
allowedEndpoints: {
openai: 'https://api.openai.com',
anthropic: 'https://api.anthropic.com',
},
},
default: {
region: 'default',
allowedProviders: ['openai', 'anthropic'],
allowedEndpoints: {
openai: 'https://api.openai.com',
anthropic: 'https://api.anthropic.com',
},
},
};
Real-World Patterns
Pattern 1: The AI Feature Toggle
// patterns/ai-feature-toggle.ts
// Everything works with or without AI
export class SmartSearch {
constructor(
private traditionalSearch: SearchEngine,
private aiSearchEnhancer: AISearchEnhancer | null,
private featureFlags: FeatureFlagService
) {}
async search(query: string, userId: string): Promise<SearchResults> {
// Always do traditional search
const results = await this.traditionalSearch.search(query);
// Optionally enhance with AI if available and enabled
if (
this.aiSearchEnhancer &&
this.featureFlags.isEnabled('ai-search', { userId })
) {
try {
const enhanced = await this.aiSearchEnhancer.enhance(query, results);
return {
...enhanced,
aiEnhanced: true,
};
} catch (error) {
// Log but don't fail
console.warn('AI enhancement failed, using traditional results', error);
return { ...results, aiEnhanced: false };
}
}
return { ...results, aiEnhanced: false };
}
}
Pattern 2: The Async AI Pipeline
// patterns/async-pipeline.ts
// AI runs in background, results delivered asynchronously
export class DocumentProcessor {
constructor(
private docService: DocumentService,
private aiQueue: LLMQueueService,
private notifications: NotificationService
) {}
async uploadDocument(file: File, userId: string): Promise<Document> {
// Immediately save and return
const doc = await this.docService.save({
file,
userId,
status: 'uploaded',
aiProcessingStatus: 'pending',
});
// Queue AI processing
await this.aiQueue.enqueue({
type: 'completion',
payload: {
messages: [
{ role: 'system', content: 'Extract metadata and summarize.' },
{ role: 'user', content: doc.content },
],
},
metadata: {
feature: 'document-processing',
userId,
priority: 'normal',
callbackUrl: `${process.env.API_URL}/webhooks/ai-complete`,
webhookSecret: doc.id, // Used to identify the document
},
});
return doc;
}
// Called when AI processing completes
async handleAIComplete(docId: string, result: any): Promise<void> {
await this.docService.update(docId, {
summary: result.summary,
metadata: result.metadata,
aiProcessingStatus: 'complete',
});
await this.notifications.send({
type: 'document-ready',
docId,
});
}
}
Pattern 3: The AI Copilot Sidecar
// patterns/ai-copilot.ts
// AI provides suggestions, user is always in control
export class EditorCopilot {
constructor(
private llmGateway: LLMGateway,
private featureFlags: FeatureFlagService
) {}
async getSuggestion(context: EditorContext): Promise<Suggestion | null> {
if (!this.featureFlags.isEnabled('copilot', { userId: context.userId })) {
return null;
}
try {
const response = await this.llmGateway.complete({
messages: [
{ role: 'system', content: 'Provide a brief, helpful suggestion.' },
{ role: 'user', content: this.buildPrompt(context) },
],
maxTokens: 200,
temperature: 0.3,
metadata: {
feature: 'copilot',
userId: context.userId,
requestId: crypto.randomUUID(),
},
});
return {
text: response.content,
confidence: this.estimateConfidence(response),
action: 'suggest', // Never auto-apply
};
} catch (error) {
// Copilot failure is silent - it's optional
console.debug('Copilot suggestion failed', error);
return null;
}
}
private buildPrompt(context: EditorContext): string {
return `Current text: ${context.text.substring(0, 500)}
Cursor position: ${context.cursorPosition}
User intent: ${context.lastAction}
Suggest a brief completion or improvement.`;
}
private estimateConfidence(response: LLMCompletionResponse): number {
// Lower confidence for longer responses (more uncertain)
const lengthFactor = Math.max(0.5, 1 - response.usage.completionTokens / 200);
return lengthFactor;
}
}
Decision Framework
When to Use Each Pattern
┌─────────────────────────────────────────────────────────────────┐
│ LLM Integration Decision Framework │
├─────────────────────────────────────────────────────────────────┤
│ │
│ QUESTION 1: Is real-time response required? │
│ ───────────────────────────────────────────── │
│ YES → Synchronous with timeout and fallback │
│ NO → Queue-based async processing │
│ │
│ QUESTION 2: Is the feature critical to core UX? │
│ ───────────────────────────────────────────── │
│ YES → Implement robust fallbacks, never block on AI │
│ NO → Can gracefully hide feature when AI unavailable │
│ │
│ QUESTION 3: How sensitive is the data? │
│ ───────────────────────────────────────────── │
│ HIGH → Sanitization, data residency, audit logging │
│ LOW → Standard security measures sufficient │
│ │
│ QUESTION 4: What's the cost tolerance? │
│ ───────────────────────────────────────────── │
│ LOW → Aggressive caching, cheaper models, rate limiting │
│ HIGH → Optimize for quality, less aggressive caching │
│ │
│ QUESTION 5: How mature is the use case? │
│ ───────────────────────────────────────────── │
│ EXPERIMENTAL → Feature flags, A/B testing, easy rollback │
│ PROVEN → Standard integration with monitoring │
│ │
└─────────────────────────────────────────────────────────────────┘
Architecture Selection Guide
┌─────────────────────────────────────────────────────────────────┐
│ When to Use What │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CHAT / CONVERSATION │
│ ───────────────────────────────────────────── │
│ Architecture: Streaming with fallback to queued │
│ Caching: Semantic cache for similar queries │
│ Degradation: Show "AI unavailable" message │
│ │
│ DOCUMENT PROCESSING │
│ ───────────────────────────────────────────── │
│ Architecture: Queue-based async │
│ Caching: Hash-based for identical documents │
│ Degradation: Mark as "pending manual review" │
│ │
│ REAL-TIME SUGGESTIONS │
│ ───────────────────────────────────────────── │
│ Architecture: Sync with aggressive timeout │
│ Caching: Aggressive, even slightly stale is OK │
│ Degradation: Hide suggestions silently │
│ │
│ SEARCH ENHANCEMENT │
│ ───────────────────────────────────────────── │
│ Architecture: Parallel (traditional + AI) │
│ Caching: Cache AI enhancements │
│ Degradation: Use traditional search only │
│ │
│ CONTENT GENERATION │
│ ───────────────────────────────────────────── │
│ Architecture: Queue-based with preview capability │
│ Caching: Template-based, not full responses │
│ Degradation: Offer templates/suggestions instead │
│ │
└─────────────────────────────────────────────────────────────────┘
Summary
┌─────────────────────────────────────────────────────────────────┐
│ Key Takeaways │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. ABSTRACT THE PROVIDER │
│ Never import OpenAI/Anthropic SDK directly in feature code. │
│ Use an interface and gateway pattern. │
│ │
│ 2. DEFINE CAPABILITIES, NOT IMPLEMENTATIONS │
│ Your code should call summarizer.summarize(), not │
│ openai.chat.completions.create(). │
│ │
│ 3. ALWAYS HAVE A FALLBACK │
│ Every AI feature must work (even if degraded) when AI │
│ is unavailable. │
│ │
│ 4. CENTRALIZE CROSS-CUTTING CONCERNS │
│ Rate limiting, cost tracking, logging, retries - all in │
│ one gateway, not scattered across features. │
│ │
│ 5. MAKE IT TESTABLE │
│ Mock providers, contract tests, snapshot tests for prompts. │
│ Never require a real API key to run tests. │
│ │
│ 6. CONTROL COSTS ARCHITECTURALLY │
│ Caching, budgets, feature flags - design for cost control, │
│ don't bolt it on later. │
│ │
│ 7. OBSERVE EVERYTHING │
│ Every LLM call should be logged, traced, and measured. │
│ You can't optimize what you can't measure. │
│ │
│ 8. PREPARE FOR CHANGE │
│ Models change, providers change, APIs change. │
│ Your architecture should make migration easy. │
│ │
└─────────────────────────────────────────────────────────────────┘
Quick Start Checklist
## MVP LLM Integration Checklist
### Day 1: Foundation
- [ ] Create LLMProvider interface
- [ ] Implement one provider (OpenAI or Anthropic)
- [ ] Create LLMGateway with basic error handling
- [ ] Add structured logging
### Week 1: Production Readiness
- [ ] Add circuit breaker
- [ ] Implement basic caching
- [ ] Add cost tracking
- [ ] Create mock provider for tests
- [ ] Add feature flags
### Month 1: Scale
- [ ] Add second provider for fallback
- [ ] Implement queue for async processing
- [ ] Add semantic caching
- [ ] Create cost dashboards
- [ ] Implement budget controls
### Ongoing
- [ ] Monitor and alert on costs
- [ ] A/B test models
- [ ] Review and update prompts
- [ ] Audit for data leakage
References
- OpenAI API Best Practices
- Anthropic API Documentation
- LangChain Architecture Concepts
- Circuit Breaker Pattern (Martin Fowler)
- Feature Toggles (Martin Fowler)
The best LLM integration is one you can rip out and replace in a day.
What did you think?