AI Code Completion and Copilot Internals: How Autocomplete Actually Works

May 9, 202681 min read0 views

ai code completion

AI Code Completion and Copilot Internals: How Autocomplete Actually Works

Real-World Problem Context

A frontend developer types function validate in their editor and an AI autocomplete suggests a full validation function with error messages, type checks, and edge case handling — all contextually relevant to their project. GitHub Copilot, Cursor Tab, Cody, Supermaven, and TabNine power this experience for millions of developers daily. But how does it actually work? How does the model know about your project's coding style, the framework you're using, which variables are in scope, and what you're likely to type next? The answer involves transformer-based language models, context window construction from open files and neighboring tabs, tokenization of source code, speculative decoding for low-latency suggestions, ghost text rendering in the editor, and caching strategies for real-time performance. This post dissects the end-to-end architecture: from keystroke to suggestion.

Problem Statements

Context Construction: How does an AI code completion tool decide what context to send to the model — which files, which lines, which imports, which recently edited code — given a fixed context window (e.g., 8K-128K tokens)?
Low-Latency Inference: Users expect suggestions within 100-300ms of pausing typing — how do completion tools achieve sub-300ms latency using speculative decoding, model quantization, local models, caching, and request cancellation?
Editor Integration: How does the suggestion appear as "ghost text" in the editor, handle partial accepts (word-by-word), multi-line completions, and avoid interfering with the language server's own completions (IntelliSense)?

Deep Dive: Internal Mechanisms

1. End-to-End Architecture

/*
 * Keystroke → Suggestion pipeline:
 *
 *   Developer types code
 *       │
 *       ▼
 *   Editor Extension (VS Code, JetBrains, Neovim)
 *   1. Debounce keystrokes (150-300ms)
 *   2. Gather context (current file, open tabs, imports)
 *   3. Build prompt (prefix + suffix for fill-in-middle)
 *       │
 *       ▼
 *   Request to AI backend:
 *   ┌─── Local model (Ollama, Supermaven) ───┐
 *   │         OR                               │
 *   ├─── Cloud API (Copilot, Cursor) ────────┤
 *   │         OR                               │
 *   ├─── Hybrid (local draft + cloud refine) ─┤
 *   └─────────────────────────────────────────┘
 *       │
 *       ▼
 *   Model inference:
 *   - Tokenize prompt
 *   - Run transformer forward pass
 *   - Generate tokens (autoregressive or speculative)
 *   - Stop at natural boundary (end of function, blank line)
 *       │
 *       ▼
 *   Post-processing:
 *   - Filter low-confidence suggestions
 *   - De-duplicate with recent suggestions
 *   - Validate syntax (bracket matching)
 *   - Trim trailing whitespace/newlines
 *       │
 *       ▼
 *   Ghost text rendered in editor
 *   User presses Tab to accept (or keeps typing to dismiss)
 */

2. Context Window Construction

/*
 * The model has a fixed context window (8K-128K tokens).
 * What goes in matters enormously for suggestion quality.
 *
 * Context sources (priority order):
 *
 * 1. Current file: code before cursor (prefix) + after cursor (suffix)
 * 2. Recently edited files (sorted by edit recency)
 * 3. Open editor tabs (sorted by relevance)
 * 4. Imported/required files (follow import graph)
 * 5. Related test files (if editing source, include test)
 * 6. Project-level snippets (README, config files)
 * 7. Language/framework documentation snippets
 *
 * Budget allocation (example for 8K token window):
 *   - Prefix (before cursor):     ~3000 tokens
 *   - Suffix (after cursor):      ~1000 tokens
 *   - Neighboring files:          ~3000 tokens
 *   - System prompt/instructions: ~1000 tokens
 */

function buildCompletionContext(editor, tokenBudget = 8000) {
    const cursor = editor.getCursorPosition();
    const currentFile = editor.getDocument().getText();
    
    // Split current file at cursor position:
    const prefix = currentFile.substring(0, cursor.offset);
    const suffix = currentFile.substring(cursor.offset);
    
    // Allocate token budget:
    const budgets = {
        systemPrompt: 200,
        prefix: Math.floor(tokenBudget * 0.35),
        suffix: Math.floor(tokenBudget * 0.12),
        neighbors: Math.floor(tokenBudget * 0.45),
        metadata: Math.floor(tokenBudget * 0.08),
    };
    
    // Truncate prefix (keep the END — most relevant to cursor):
    const truncatedPrefix = truncateFromStart(prefix, budgets.prefix);
    
    // Truncate suffix (keep the START — immediately after cursor):
    const truncatedSuffix = truncateFromEnd(suffix, budgets.suffix);
    
    // Select and rank neighboring files:
    const neighbors = selectNeighborFiles(editor, budgets.neighbors);
    
    // File metadata (language, framework hints):
    const metadata = buildMetadata(editor);
    
    return {
        prefix: truncatedPrefix,
        suffix: truncatedSuffix,
        neighbors,
        metadata,
    };
}

// Neighbor file ranking heuristics:
function selectNeighborFiles(editor, tokenBudget) {
    const candidates = [];
    
    // Recently edited files (highest signal):
    for (const file of getRecentlyEditedFiles()) {
        candidates.push({
            path: file.path,
            content: file.content,
            score: 10 + (1 / (Date.now() - file.lastEditTime)),
        });
    }
    
    // Files imported by current file:
    const imports = extractImports(editor.getDocument().getText());
    for (const imp of imports) {
        const resolvedPath = resolveImport(imp, editor.getFilePath());
        if (resolvedPath) {
            candidates.push({
                path: resolvedPath,
                content: readFile(resolvedPath),
                score: 8, // High relevance — directly imported
            });
        }
    }
    
    // Open tabs:
    for (const tab of editor.getOpenTabs()) {
        if (!candidates.find(c => c.path === tab.path)) {
            candidates.push({
                path: tab.path,
                content: tab.content,
                score: 3,
            });
        }
    }
    
    // Sort by score, fill up to token budget:
    candidates.sort((a, b) => b.score - a.score);
    
    const selected = [];
    let remaining = tokenBudget;
    
    for (const candidate of candidates) {
        const tokens = estimateTokenCount(candidate.content);
        if (tokens <= remaining) {
            selected.push(candidate);
            remaining -= tokens;
        } else {
            // Include partial (most relevant portion):
            selected.push({
                ...candidate,
                content: truncateToTokens(candidate.content, remaining),
            });
            break;
        }
    }
    
    return selected;
}

3. Fill-in-the-Middle (FIM) Prompting

/*
 * Most code completions use Fill-in-the-Middle (FIM) format,
 * not standard left-to-right generation.
 *
 * Standard (left-to-right):
 *   "function add(a, b) {\n  return "  →  "a + b;\n}"
 *   Only sees what comes BEFORE the cursor.
 *
 * FIM (fill-in-middle):
 *   Prefix: "function add(a, b) {\n  return "
 *   Suffix: "\n}\n\nconsole.log(add(1, 2));"
 *   Model fills: "a + b;"
 *   Sees BOTH before AND after cursor — much better context.
 *
 * FIM token format (varies by model):
 *
 * StarCoder/CodeLlama:
 *   <fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>{generation}
 *
 * GPT-style:
 *   <|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{generation}
 *
 * Copilot uses a custom format with file path annotations:
 *   # Path: src/utils/math.ts
 *   {prefix}<CURSOR>{suffix}
 */

function buildFIMPrompt(context, modelFormat = 'starcoder') {
    const { prefix, suffix, neighbors, metadata } = context;
    
    let prompt = '';
    
    // Add neighbor file context:
    for (const neighbor of neighbors) {
        prompt += `// Path: ${neighbor.path}\n`;
        prompt += neighbor.content + '\n\n';
    }
    
    const currentFilePath = metadata.filePath;
    
    switch (modelFormat) {
        case 'starcoder':
            prompt += `<fim_prefix>// Path: ${currentFilePath}\n`;
            prompt += prefix;
            prompt += `<fim_suffix>`;
            prompt += suffix;
            prompt += `<fim_middle>`;
            break;
            
        case 'codellama':
            prompt += `<PRE> // Path: ${currentFilePath}\n`;
            prompt += prefix;
            prompt += ` <SUF> `;
            prompt += suffix;
            prompt += ` <MID>`;
            break;
            
        case 'deepseek':
            prompt += `<｜fim▁begin｜>// Path: ${currentFilePath}\n`;
            prompt += prefix;
            prompt += `<｜fim▁hole｜>`;
            prompt += suffix;
            prompt += `<｜fim▁end｜>`;
            break;
    }
    
    return prompt;
}

/*
 * Why FIM matters for frontend developers:
 *
 * Editing in the MIDDLE of a component:
 *
 *   function UserProfile({ user }) {
 *     const [editing, setEditing] = useState(false);
 *     |  ← cursor here
 *     return (
 *       <div className="profile">
 *         <h1>{user.name}</h1>
 *         {editing && <EditForm user={user} />}
 *       </div>
 *     );
 *   }
 *
 * With FIM, the model sees BOTH the state above AND the JSX below.
 * It knows to suggest something that connects them:
 *   "const handleEdit = () => setEditing(true);"
 *
 * Without FIM (left-to-right only), it has no idea about the JSX.
 */

4. Tokenization of Source Code

/*
 * LLMs don't see characters — they see tokens.
 * Tokenization heavily affects code completion quality.
 *
 * BPE (Byte Pair Encoding) for code:
 *   "function" → [1 token]  (common word)
 *   "useState" → [1-2 tokens]  (common in training data)
 *   "myCustomHookName" → [3-5 tokens]  (split by subwords)
 *   "  " (indentation) → [1 token per ~4 spaces]
 *
 * Code-optimized tokenizers (used by Copilot, StarCoder):
 *   - Trained on code corpora (not just English text)
 *   - Better handling of indentation, brackets, operators
 *   - Keywords like "const", "return", "async" are single tokens
 *   - Common patterns like "() => {" may be 2-3 tokens
 *
 * Token efficiency matters:
 *   More efficient tokenization → more code fits in context window
 *   → better suggestions
 *
 * Example token counts (GPT-4 tokenizer vs code-optimized):
 *
 *   Code snippet (100 chars):
 *     GPT-4 tokenizer:     ~35 tokens
 *     StarCoder tokenizer: ~25 tokens
 *     → 30% more context capacity with code tokenizer
 */

// Estimating tokens for budget allocation:
function estimateTokenCount(code) {
    // Rule of thumb: ~3.5 characters per token for code
    // More accurate: use tiktoken or the model's tokenizer
    return Math.ceil(code.length / 3.5);
}

// Accurate token counting (using tiktoken):
// import { encoding_for_model } from 'tiktoken';
// const enc = encoding_for_model('gpt-4');
// const tokens = enc.encode(code);
// console.log(tokens.length);

/*
 * Tokenization pitfalls for code:
 *
 * 1. Indentation waste:
 *    4 spaces = 1 token, 8 spaces = 2 tokens
 *    Deeply nested code burns tokens on whitespace
 *    → Some tools strip/normalize indentation before sending
 *
 * 2. Long variable names:
 *    "useAuthenticationProviderContext" = 5+ tokens
 *    vs "useAuth" = 1-2 tokens
 *    → Not a reason to use short names, but context budget impact
 *
 * 3. Comments:
 *    JSDoc comments consume tokens but improve suggestion quality
 *    → Include comments near cursor, trim distant ones
 */

5. Speculative Decoding for Low Latency

/*
 * Standard autoregressive generation:
 *   Token 1 → Token 2 → Token 3 → ... → Token N
 *   Each token requires a full model forward pass.
 *   Latency = N × forward_pass_time
 *
 * For a 20-token suggestion with a cloud model:
 *   20 × 30ms = 600ms — too slow for inline completion.
 *
 * Speculative decoding:
 *   Use a SMALL local model to draft tokens quickly.
 *   Use the LARGE model to verify/correct in parallel.
 *
 *   Small model (local, 1B params):
 *     Draft: "const result = data.filter(item => item.active)"
 *     Speed: 5ms per token
 *
 *   Large model (cloud, 70B+ params):
 *     Verify all draft tokens in ONE forward pass
 *     Accept matching tokens, regenerate from first mismatch
 *     Speed: 30ms for the batch
 *
 *   Result: ~50ms total instead of ~600ms
 *
 *
 * Copilot's approach (simplified):
 *
 *   ┌─── Local Model ─────────────────┐
 *   │ Fast draft (runs on CPU/GPU)     │
 *   │ Lower quality but instant        │──draft──┐
 *   └─────────────────────────────────┘         │
 *                                                ▼
 *   ┌─── Cloud Model ─────────────────┐   ┌──────────┐
 *   │ High quality (runs on GPU cloud) │◀──│ Verify   │
 *   │ Slower but more accurate         │   │ & accept │
 *   └─────────────────────────────────┘   └──────────┘
 *
 * Supermaven's approach:
 *   Uses a very large context window (300K+) with a
 *   custom model architecture optimized for code completion.
 *   Runs partially local for speed.
 */

// Request management for low latency:
class CompletionRequestManager {
    constructor() {
        this.pendingRequest = null;
        this.cache = new LRUCache(100);
    }
    
    async requestCompletion(context) {
        // Cancel any in-flight request:
        if (this.pendingRequest) {
            this.pendingRequest.abort();
        }
        
        // Check cache (same prefix → same suggestion):
        const cacheKey = hashContext(context);
        const cached = this.cache.get(cacheKey);
        if (cached) return cached;
        
        const controller = new AbortController();
        this.pendingRequest = controller;
        
        try {
            const response = await fetch('/api/completions', {
                method: 'POST',
                body: JSON.stringify(context),
                signal: controller.signal,
                headers: { 'Content-Type': 'application/json' },
            });
            
            const suggestion = await response.json();
            
            // Cache for reuse:
            this.cache.set(cacheKey, suggestion);
            
            return suggestion;
        } catch (error) {
            if (error.name === 'AbortError') return null;
            throw error;
        }
    }
}

6. Ghost Text Rendering in Editors

/*
 * "Ghost text" = the gray/dimmed suggestion text shown inline.
 *
 * VS Code API for inline completions:
 *   vscode.languages.registerInlineCompletionItemProvider()
 *
 * The extension provides InlineCompletionItem objects.
 * VS Code renders them as ghost text overlay.
 *
 * Key behaviors:
 *   - Tab accepts the full suggestion
 *   - Ctrl+Right accepts word-by-word (partial accept)
 *   - Typing a character that matches keeps showing suggestion
 *   - Typing a different character dismisses
 *   - Escape dismisses
 *   - Multiple suggestions: cycle with Alt+[ and Alt+]
 */

// VS Code inline completion provider:
const provider = vscode.languages.registerInlineCompletionItemProvider(
    { pattern: '**' }, // All files
    {
        async provideInlineCompletionItems(document, position, context, token) {
            // Don't suggest if explicitly dismissed:
            if (context.triggerKind === vscode.InlineCompletionTriggerKind.Invoke) {
                // Manually triggered (Ctrl+Space equivalent)
            }
            
            // Build context from document:
            const prefix = document.getText(
                new vscode.Range(0, 0, position.line, position.character)
            );
            const suffix = document.getText(
                new vscode.Range(
                    position.line, position.character,
                    document.lineCount, 0
                )
            );
            
            // Request completion from AI model:
            const suggestion = await requestCompletion({
                prefix,
                suffix,
                language: document.languageId,
                filePath: document.uri.fsPath,
            });
            
            if (!suggestion || token.isCancellationRequested) {
                return { items: [] };
            }
            
            return {
                items: [{
                    insertText: suggestion.text,
                    range: new vscode.Range(position, position),
                    // For partial accept (word-by-word):
                    command: {
                        command: 'copilot.acceptPartial',
                        title: 'Accept next word',
                    },
                }],
            };
        },
    }
);

/*
 * Ghost text rendering details:
 *
 * 1. Single-line completion:
 *    const x = |getSortedList()     ← ghost text after cursor
 *
 * 2. Multi-line completion:
 *    function sort(arr) {|
 *      return arr.sort((a, b) => {  ← ghost text
 *        return a - b;               ← ghost text
 *      });                           ← ghost text
 *    }                               ← ghost text
 *
 * 3. Partial accept:
 *    User presses Ctrl+Right:
 *    const x = getSortedList|()     ← accepted "getSortedList"
 *    Remaining ghost text: "()"
 *
 * The editor decorates the ghost text with a specific CSS style:
 *   color: var(--vscode-editorGhostText-foreground)
 *   opacity: 0.5-0.7
 *   font-style: italic (optional)
 */

7. Debouncing and Request Lifecycle

/*
 * Completion timing is critical:
 *
 * Too fast: unnecessary requests while user is typing
 * Too slow: user waits and breaks flow
 *
 * Typical lifecycle:
 *
 * Time → 
 * ──────────────────────────────────────────
 * User types: f, u, n, c, t, i, o, n, ' '
 *                                          │
 *                                          ▼ pause (>200ms)
 *                                    ┌──────────┐
 *                                    │ Debounce  │
 *                                    └─────┬────┘
 *                                          ▼
 *                                    Build context
 *                                          │
 *                                    ┌─────▼────┐
 *                                    │ API call  │ ~100-300ms
 *                                    └─────┬────┘
 *                                          ▼
 *                                    Show ghost text
 *                                          │
 *                              User presses Tab → Accept
 *                              User types char → Dismiss
 */

class CompletionTrigger {
    constructor(options = {}) {
        this.debounceMs = options.debounceMs || 250;
        this.minPrefixLength = options.minPrefixLength || 3;
        this.timer = null;
        this.lastRequest = null;
    }
    
    onTextChange(document, position) {
        // Clear previous debounce:
        clearTimeout(this.timer);
        
        // Cancel in-flight request:
        if (this.lastRequest) {
            this.lastRequest.abort();
        }
        
        // Don't trigger in comments or strings (optional):
        if (this.isInCommentOrString(document, position)) {
            return;
        }
        
        // Don't trigger with very short prefix:
        const linePrefix = document.lineAt(position.line)
            .text.substring(0, position.character).trim();
        if (linePrefix.length < this.minPrefixLength) {
            return;
        }
        
        // Debounce:
        this.timer = setTimeout(() => {
            this.triggerCompletion(document, position);
        }, this.debounceMs);
    }
    
    async triggerCompletion(document, position) {
        const controller = new AbortController();
        this.lastRequest = controller;
        
        const startTime = performance.now();
        
        try {
            const context = buildCompletionContext(document, position);
            const suggestion = await fetchCompletion(context, controller.signal);
            
            const latency = performance.now() - startTime;
            
            // Track metrics:
            telemetry.track('completion_latency', {
                latencyMs: latency,
                language: document.languageId,
                contextTokens: estimateTokenCount(context.prefix + context.suffix),
            });
            
            if (suggestion && !controller.signal.aborted) {
                showGhostText(suggestion, position);
            }
        } catch (error) {
            if (error.name !== 'AbortError') {
                console.error('Completion error:', error);
            }
        }
    }
}

8. Suggestion Quality and Filtering

/*
 * Not every model output should be shown to the user.
 * Post-processing filters improve signal-to-noise ratio.
 */

function filterSuggestion(suggestion, context) {
    // 1. Empty or whitespace-only:
    if (!suggestion.text.trim()) return null;
    
    // 2. Too short (not useful):
    if (suggestion.text.trim().length < 5) return null;
    
    // 3. Exact duplicate of existing code:
    if (context.suffix.startsWith(suggestion.text)) return null;
    
    // 4. Low confidence (if model provides scores):
    if (suggestion.confidence !== undefined && suggestion.confidence < 0.3) {
        return null;
    }
    
    // 5. Bracket/quote balance check:
    if (!hasBracketBalance(context.prefix + suggestion.text)) {
        // Try trimming to last balanced position:
        const trimmed = trimToBalanced(context.prefix, suggestion.text);
        if (trimmed) {
            suggestion.text = trimmed;
        } else {
            return null;
        }
    }
    
    // 6. Stop at natural boundaries:
    suggestion.text = trimAtBoundary(suggestion.text);
    
    // 7. Remove trailing incomplete tokens:
    suggestion.text = removeIncompleteLastLine(suggestion.text);
    
    return suggestion;
}

function trimAtBoundary(text) {
    // Stop at natural code boundaries:
    const boundaries = [
        /\n\n/,                    // Blank line (function boundary)
        /\n(export|import|class|function|const|let|var)\s/,  // Top-level declaration
        /\n(describe|it|test)\(/,  // Test blocks
    ];
    
    for (const boundary of boundaries) {
        const match = text.match(boundary);
        if (match && match.index > 0) {
            return text.substring(0, match.index);
        }
    }
    
    return text;
}

/*
 * Acceptance telemetry (how Copilot improves):
 *
 * Track:
 *   - Suggestion shown → accepted (Tab) or dismissed
 *   - Partial accept (word/line)
 *   - Time between suggestion shown and acceptance
 *   - Whether accepted code was immediately edited
 *   - Language, file type, completion type (single/multi-line)
 *
 * This data trains the NEXT version of the model
 * and tunes the filtering/ranking heuristics.
 */

9. Caching and Prefetching

/*
 * Caching reduces latency for common patterns:
 *
 * 1. Prefix cache: same code prefix → same suggestion
 * 2. Semantic cache: similar intent → similar suggestion
 * 3. KV cache: model's key-value cache for incremental generation
 * 4. Prefetch: predict what the user will type next
 */

class CompletionCache {
    constructor() {
        // LRU cache of recent completions:
        this.cache = new Map();
        this.maxSize = 200;
        
        // Track recently shown suggestions for dedup:
        this.recentSuggestions = [];
    }
    
    getCacheKey(prefix, suffix) {
        // Use last N characters of prefix + first N of suffix:
        const prefixTail = prefix.slice(-500);
        const suffixHead = suffix.slice(0, 200);
        return hashString(prefixTail + '|CURSOR|' + suffixHead);
    }
    
    get(prefix, suffix) {
        const key = this.getCacheKey(prefix, suffix);
        const entry = this.cache.get(key);
        
        if (entry && (Date.now() - entry.timestamp) < 30000) { // 30s TTL
            // Move to front (LRU):
            this.cache.delete(key);
            this.cache.set(key, entry);
            return entry.suggestion;
        }
        
        return null;
    }
    
    set(prefix, suffix, suggestion) {
        const key = this.getCacheKey(prefix, suffix);
        
        // Evict oldest if at capacity:
        if (this.cache.size >= this.maxSize) {
            const oldest = this.cache.keys().next().value;
            this.cache.delete(oldest);
        }
        
        this.cache.set(key, {
            suggestion,
            timestamp: Date.now(),
        });
    }
}

/*
 * KV Cache (model-level optimization):
 *
 * When the user accepts a suggestion and types more,
 * much of the context is the same as the previous request.
 *
 * Without KV cache:
 *   Request 1: process 4000 tokens of context → generate
 *   Request 2: process 4050 tokens (mostly same!) → generate
 *
 * With KV cache:
 *   Request 1: process 4000 tokens → cache KV pairs → generate
 *   Request 2: reuse 4000 cached KV pairs + process 50 new → generate
 *   
 *   Speed: ~10x faster for incremental completions
 *
 * This is why Supermaven claims 300K+ context — they aggressively
 * cache the KV state so reprocessing the context is free.
 */

10. Multi-Line and Multi-Step Completions

/*
 * Beyond single-line: completing entire functions, blocks, and workflows.
 *
 * Single-line: const x = |                → items.filter(i => i.active)
 * Multi-line:  function validate(|         → full function body
 * Multi-step:  // Create a form component| → entire component with state
 *
 * Multi-line completion challenges:
 *   1. When to stop generating (don't generate infinite code)
 *   2. Indentation must match the file's style
 *   3. Closing brackets must balance with opening ones
 *   4. Import statements may be needed at the top of file
 */

// Stop sequence logic:
const STOP_SEQUENCES = [
    '\n\n\n',              // Triple newline (section break)
    '\nexport ',           // New export (different declaration)
    '\nfunction ',         // New top-level function
    '\nclass ',            // New class
    '\ndescribe(',         // New test suite
    '\n// ---',            // Section separator comment
];

function shouldStopGeneration(generatedSoFar, nextToken) {
    const combined = generatedSoFar + nextToken;
    
    for (const stop of STOP_SEQUENCES) {
        if (combined.includes(stop)) {
            // Return text up to the stop sequence:
            return combined.substring(0, combined.indexOf(stop));
        }
    }
    
    // Also stop if we've generated too many lines:
    const lineCount = (generatedSoFar.match(/\n/g) || []).length;
    if (lineCount > 30) {
        // Trim to last complete statement:
        return trimToLastCompleteStatement(generatedSoFar);
    }
    
    return null; // Continue generating
}

/*
 * Next-edit prediction (Cursor's approach):
 *
 * Beyond completing at the cursor, predict the NEXT place
 * the user will edit and pre-generate that completion too.
 *
 * Example:
 *   User adds a new parameter to a function:
 *     function process(data, options) { ... }
 *                            ^^^^^^^^ added
 *
 *   AI predicts: user needs to update the call sites too.
 *   Pre-generates edits for:
 *     process(data)  →  process(data, {})
 *
 * This is "multi-edit" or "next action prediction" —
 * going from autocomplete to auto-refactor.
 */

Trade-offs & Considerations

Aspect	Local Model	Cloud Model	Hybrid
Latency	~20-80ms	~200-500ms	~50-150ms
Quality	Lower (smaller)	Higher (larger)	Best of both
Privacy	Code stays local	Code sent to cloud	Drafts local, verify cloud
Offline	Works	Fails	Degrades gracefully
Cost	Hardware (GPU)	API pricing	Both
Context window	Limited (4-8K)	Large (32-128K)	Tiered
Model size	1-7B params	70B+ params	1B + 70B

Best Practices

Construct context with FIM (fill-in-middle) format including both prefix and suffix — code before the cursor tells the model what's being written; code after tells it what needs to connect; FIM-aware models produce dramatically better mid-function completions than left-to-right-only generation; include imports, type definitions, and the current function's signature in the prefix.
Aggressively cancel in-flight requests on every keystroke — users type faster than models generate; every character change invalidates the previous request; use AbortController to cancel pending API calls immediately, reducing wasted compute and avoiding stale suggestions appearing after the user has moved on.
Cache completions by prefix+suffix hash with short TTL — when a user deletes and retypes, or moves the cursor back and forth, the same context often recurs; cache recent completions for 15-30 seconds to provide instant suggestions for repeated contexts; invalidate on file save or significant edits.
Filter suggestions for bracket balance, natural boundaries, and minimum quality — raw model output often has unmatched brackets, trailing incomplete lines, or low-confidence gibberish; validate bracket/quote balance, trim at natural boundaries (blank lines, top-level declarations), and reject suggestions below a confidence threshold.
Track acceptance rate and post-acceptance edits as quality signals — the acceptance rate (suggestions shown vs. accepted) is the primary quality metric; track whether accepted code was immediately edited (indicating partial usefulness) or deleted (indicating bad suggestion); use these signals to tune debounce timing, context construction, and model selection.

Conclusion

AI code completion works through a carefully orchestrated pipeline: the editor extension debounces keystrokes (150-300ms), constructs a context window from the current file (prefix + suffix via FIM format), neighboring files ranked by relevance (recent edits, imports, open tabs), and project metadata — all within the model's token budget (8K-128K tokens). The context is tokenized using a code-optimized tokenizer (more efficient than general-purpose ones), then sent to a language model (local 1-7B for speed, cloud 70B+ for quality, or hybrid via speculative decoding). The model generates tokens autoregressively, stopping at natural code boundaries. Post-processing filters the output for bracket balance, minimum quality, natural stopping points, and deduplication against existing code. The validated suggestion renders as ghost text in the editor via the inline completion API, supporting full accept (Tab), partial accept (word-by-word), and dismiss (Escape or keep typing). Caching (prefix hash, KV cache) and request cancellation (AbortController on every keystroke) keep latency under 300ms. The critical insight: suggestion quality depends more on context construction (what you send to the model) than model size — a well-constructed 8K context with relevant imports, types, and neighboring code outperforms a poorly constructed 128K context of irrelevant files.

What did you think?