AI-Augmented Code Reviews: Architecting the Workflow, Not Just the Prompt

March 5, 20264 min read11 views

ai in software engineering

code review automation

llm integration

developer workflow

engineering productivity

software architecture

engineering leadership

ai architecture

AI-Augmented Code Reviews: Architecting the Workflow, Not Just the Prompt

The naive approach to AI code review is seductive: point an LLM at a diff, ask it to "review this code," and paste the output as a comment. This creates noise, erodes trust, and eventually gets ignored. The challenge isn't getting an LLM to generate review comments—it's architecting a system where AI augments human judgment without drowning it.

This guide covers building an AI-augmented review pipeline that knows its boundaries, integrates with existing tooling, and keeps senior engineers focused on what matters: architecture, design, and mentorship.

The Problem with Naive AI Reviews

┌─────────────────────────────────────────────────────────────────────┐
│                    NAIVE AI REVIEW ANTI-PATTERNS                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   PR Opened                                                         │
│       │                                                             │
│       ▼                                                             │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  "Review this code and find all issues"                     │  │
│   │                                                             │  │
│   │  LLM Response:                                              │  │
│   │  1. Consider adding a docstring here                        │  │
│   │  2. This variable name could be more descriptive            │  │
│   │  3. You might want to handle the edge case where...         │  │
│   │  4. Consider using const instead of let                     │  │
│   │  5. This function could be split into smaller functions     │  │
│   │  6. Add error handling for network failures                 │  │
│   │  7. Consider adding unit tests                              │  │
│   │  8. The indentation on line 47 seems inconsistent           │  │
│   │  ... 47 more suggestions ...                                │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
│   Problems:                                                         │
│   • No prioritization (style nits mixed with security issues)       │
│   • Duplicates what linters already catch                           │
│   • No context about codebase conventions                           │
│   • No understanding of PR intent                                   │
│   • Creates alert fatigue → gets ignored                            │
│   • Undermines human reviewers' authority                           │
│   • No feedback loop for improvement                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

What Goes Wrong

Signal-to-noise ratio collapses - 50 comments where 3 matter
Redundancy with existing tools - AI suggests what ESLint already enforces
Context blindness - AI doesn't know your team's conventions
Authority confusion - Is this a suggestion or a requirement?
No learning - Same unhelpful comments on every PR

The Right Mental Model

AI should handle what computers do best, freeing humans for what they do best:

┌─────────────────────────────────────────────────────────────────────┐
│                    RESPONSIBILITY BOUNDARIES                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   AUTOMATED (No Human Needed)           HUMAN JUDGMENT REQUIRED     │
│   ════════════════════════════          ════════════════════════    │
│                                                                     │
│   Static Analysis (ESLint, etc.)        Architecture decisions      │
│   ├─ Syntax errors                      ├─ Does this belong here?   │
│   ├─ Style violations                   ├─ Is this the right        │
│   ├─ Unused variables                   │  abstraction?             │
│   └─ Import sorting                     └─ Will this scale?         │
│                                                                     │
│   Type Checking (TypeScript)            Design patterns             │
│   ├─ Type mismatches                    ├─ Is this idiomatic?       │
│   ├─ Null safety                        ├─ Does it follow our       │
│   └─ Interface compliance               │  conventions?             │
│                                                                     │
│   Security Scanning (Semgrep)           Business logic              │
│   ├─ Known vulnerabilities              ├─ Does this solve the      │
│   ├─ Injection patterns                 │  actual problem?          │
│   └─ Secrets detection                  ├─ Edge cases?              │
│                                         └─ Correctness?             │
│                                                                     │
│   AI-AUGMENTED (Human Reviews Output)   Mentorship                  │
│   ═══════════════════════════════════   ├─ Teaching opportunities   │
│                                         ├─ Career growth            │
│   Complexity analysis                   └─ Team knowledge sharing   │
│   Documentation gaps                                                │
│   Test coverage suggestions             Trade-off decisions         │
│   Performance red flags                 ├─ Tech debt vs velocity    │
│   Cross-cutting concern detection       ├─ Consistency vs progress  │
│                                         └─ Scope of change          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Pipeline Architecture

A well-designed review pipeline runs checks in stages, each with clear responsibilities:

┌─────────────────────────────────────────────────────────────────────┐
│                    REVIEW PIPELINE STAGES                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   PR Opened/Updated                                                 │
│       │                                                             │
│       ▼                                                             │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Stage 1: GATE CHECKS (Block if fail)                       │  │
│   │  ├─ CI/CD (tests, build)                                    │  │
│   │  ├─ Linting (ESLint, Prettier)                              │  │
│   │  ├─ Type checking (tsc --noEmit)                            │  │
│   │  └─ Security scan (Semgrep, Snyk)                           │  │
│   └─────────────────────────────────────────────────────────────┘  │
│       │                                                             │
│       ▼ (Only if Stage 1 passes)                                    │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Stage 2: AI TRIAGE (Categorize & Filter)                   │  │
│   │  ├─ Classify PR type (feature/bugfix/refactor/docs)         │  │
│   │  ├─ Identify high-risk areas (auth, payments, data)         │  │
│   │  ├─ Detect cross-cutting concerns                           │  │
│   │  └─ Flag for specific reviewer expertise                    │  │
│   └─────────────────────────────────────────────────────────────┘  │
│       │                                                             │
│       ▼                                                             │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Stage 3: AI ANALYSIS (Generate insights, not comments)     │  │
│   │  ├─ Complexity delta analysis                               │  │
│   │  ├─ Test coverage gaps                                      │  │
│   │  ├─ Documentation requirements                              │  │
│   │  ├─ Performance implications                                │  │
│   │  └─ Summary for reviewer                                    │  │
│   └─────────────────────────────────────────────────────────────┘  │
│       │                                                             │
│       ▼                                                             │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Stage 4: HUMAN REVIEW (Final authority)                    │  │
│   │  ├─ Reviewer sees AI summary (not raw comments)             │  │
│   │  ├─ Reviewer has full context                               │  │
│   │  ├─ Reviewer makes approve/request changes decision         │  │
│   │  └─ AI suggestions are optional, not blocking               │  │
│   └─────────────────────────────────────────────────────────────┘  │
│       │                                                             │
│       ▼                                                             │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │  Stage 5: FEEDBACK LOOP                                     │  │
│   │  ├─ Track which AI suggestions were accepted                │  │
│   │  ├─ Learn from reviewer corrections                         │  │
│   │  └─ Tune prompts based on team patterns                     │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Implementation: GitHub Actions Workflow

Main Workflow Configuration

# .github/workflows/pr-review-pipeline.yml

name: PR Review Pipeline

on:
  pull_request:
    types: [opened, synchronize, ready_for_review]

permissions:
  contents: read
  pull-requests: write
  checks: write

jobs:
  # Stage 1: Gate Checks (must pass before AI runs)
  gate-checks:
    name: Gate Checks
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Type check
        run: npm run typecheck

      - name: Lint
        run: npm run lint

      - name: Test
        run: npm run test:ci

      - name: Security scan
        uses: returntocorp/semgrep-action@v1
        with:
          config: >-
            p/security-audit
            p/secrets
            p/owasp-top-ten

  # Stage 2 & 3: AI Analysis (only runs after gates pass)
  ai-analysis:
    name: AI Analysis
    needs: gate-checks
    runs-on: ubuntu-latest
    if: github.event.pull_request.draft == false
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for diff

      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr.diff
          echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT

      - name: Skip large PRs
        if: steps.diff.outputs.diff_size > 2000
        run: |
          echo "PR too large for AI review ($(cat pr.diff | wc -l) lines)"
          echo "Consider breaking into smaller PRs"
          exit 0

      - name: Run AI analysis
        id: ai-review
        uses: ./.github/actions/ai-review
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          diff-file: pr.diff
          pr-number: ${{ github.event.pull_request.number }}

      - name: Post review summary
        if: steps.ai-review.outputs.has-findings == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const summary = require('./ai-review-output.json');
            await github.rest.pulls.createReview({
              owner: context.repo.owner,
              repo: context.repo.repo,
              pull_number: context.issue.number,
              event: 'COMMENT',
              body: summary.reviewerSummary
            });

Custom AI Review Action

# .github/actions/ai-review/action.yml

name: 'AI Code Review'
description: 'Runs AI-augmented code review analysis'

inputs:
  github-token:
    description: 'GitHub token'
    required: true
  openai-api-key:
    description: 'OpenAI API key'
    required: true
  diff-file:
    description: 'Path to diff file'
    required: true
  pr-number:
    description: 'PR number'
    required: true

outputs:
  has-findings:
    description: 'Whether the analysis found anything notable'
    value: ${{ steps.analyze.outputs.has-findings }}

runs:
  using: 'composite'
  steps:
    - name: Setup Node
      uses: actions/setup-node@v4
      with:
        node-version: '20'

    - name: Install analysis tools
      shell: bash
      run: npm install -g @company/ai-review-cli

    - name: Gather context
      id: context
      shell: bash
      run: |
        # Get PR metadata
        gh pr view ${{ inputs.pr-number }} --json title,body,labels,author > pr-metadata.json

        # Get recent commits for context
        git log --oneline -20 > recent-commits.txt

        # Get changed files with their full content
        git diff --name-only origin/${{ github.base_ref }}...HEAD > changed-files.txt

        # Get codebase conventions (if exists)
        if [ -f ".github/REVIEW_GUIDELINES.md" ]; then
          cp .github/REVIEW_GUIDELINES.md review-guidelines.md
        fi
      env:
        GH_TOKEN: ${{ inputs.github-token }}

    - name: Run analysis
      id: analyze
      shell: bash
      run: |
        ai-review analyze \
          --diff "${{ inputs.diff-file }}" \
          --context pr-metadata.json \
          --guidelines review-guidelines.md \
          --output ai-review-output.json

        if [ -s ai-review-output.json ]; then
          echo "has-findings=true" >> $GITHUB_OUTPUT
        else
          echo "has-findings=false" >> $GITHUB_OUTPUT
        fi
      env:
        OPENAI_API_KEY: ${{ inputs.openai-api-key }}

The Analysis Engine

The core analysis tool runs multiple specialized prompts, each focused on a specific review concern.

// tools/ai-review/src/analyzer.ts

import OpenAI from 'openai';
import { readFileSync } from 'fs';

interface PRContext {
  title: string;
  body: string;
  labels: string[];
  author: string;
  diff: string;
  changedFiles: string[];
  guidelines?: string;
}

interface AnalysisResult {
  reviewerSummary: string;
  riskLevel: 'low' | 'medium' | 'high';
  categories: {
    security: Finding[];
    performance: Finding[];
    testability: Finding[];
    documentation: Finding[];
  };
  suggestedReviewers: string[];
  estimatedReviewTime: string;
}

interface Finding {
  severity: 'info' | 'warning' | 'critical';
  file: string;
  line?: number;
  title: string;
  description: string;
  suggestion?: string;
}

const openai = new OpenAI();

export async function analyzepr(context: PRContext): Promise<AnalysisResult> {
  // Run specialized analyses in parallel
  const [
    classification,
    securityAnalysis,
    performanceAnalysis,
    testabilityAnalysis,
    documentationAnalysis,
  ] = await Promise.all([
    classifyPR(context),
    analyzeSecurityImplications(context),
    analyzePerformanceImplications(context),
    analyzeTestability(context),
    analyzeDocumentationNeeds(context),
  ]);

  // Generate reviewer summary (not raw findings)
  const summary = await generateReviewerSummary({
    classification,
    security: securityAnalysis,
    performance: performanceAnalysis,
    testability: testabilityAnalysis,
    documentation: documentationAnalysis,
    context,
  });

  return {
    reviewerSummary: summary,
    riskLevel: calculateRiskLevel(securityAnalysis, performanceAnalysis),
    categories: {
      security: securityAnalysis,
      performance: performanceAnalysis,
      testability: testabilityAnalysis,
      documentation: documentationAnalysis,
    },
    suggestedReviewers: classification.suggestedReviewers,
    estimatedReviewTime: classification.estimatedReviewTime,
  };
}

async function classifyPR(context: PRContext): Promise<{
  type: string;
  riskAreas: string[];
  suggestedReviewers: string[];
  estimatedReviewTime: string;
}> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.1, // Low temperature for classification
    messages: [
      {
        role: 'system',
        content: `You are a PR triage system. Classify PRs and identify review requirements.

Your job is to:
1. Classify the PR type (feature, bugfix, refactor, docs, deps, config)
2. Identify high-risk areas that need careful review
3. Suggest reviewers based on file ownership patterns
4. Estimate review time

Output JSON only, no explanation.`,
      },
      {
        role: 'user',
        content: `
PR Title: ${context.title}
PR Description: ${context.body}
Labels: ${context.labels.join(', ')}
Changed Files: ${context.changedFiles.join('\n')}

Classify this PR.`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  return JSON.parse(response.choices[0].message.content!);
}

async function analyzeSecurityImplications(context: PRContext): Promise<Finding[]> {
  // Skip if no security-relevant files changed
  const securityRelevantPatterns = [
    /auth/i, /login/i, /password/i, /token/i, /secret/i,
    /payment/i, /billing/i, /credit/i,
    /admin/i, /permission/i, /role/i,
    /api.*route/i, /middleware/i,
    /\.env/, /config/i,
  ];

  const hasSecurityRelevantChanges = context.changedFiles.some(file =>
    securityRelevantPatterns.some(pattern => pattern.test(file))
  );

  if (!hasSecurityRelevantChanges) {
    return [];
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.2,
    messages: [
      {
        role: 'system',
        content: `You are a security-focused code reviewer. Your ONLY job is to identify potential security issues.

DO NOT comment on:
- Code style
- Performance (unless it's a DoS vector)
- Documentation
- General best practices

ONLY flag:
- Authentication/authorization bypasses
- Injection vulnerabilities (SQL, XSS, command)
- Sensitive data exposure
- Insecure cryptography
- Access control issues
- Input validation gaps in security-critical paths

If you find nothing security-relevant, return an empty array.
Be specific about the vulnerability, not vague concerns.

Output JSON array of findings with: severity, file, line (if known), title, description, suggestion`,
      },
      {
        role: 'user',
        content: `Review this diff for security issues:\n\n${context.diff}`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  const result = JSON.parse(response.choices[0].message.content!);
  return result.findings || [];
}

async function analyzePerformanceImplications(context: PRContext): Promise<Finding[]> {
  // Only analyze if there are code changes (not just docs/config)
  const codeExtensions = /\.(ts|tsx|js|jsx|py|go|rs|java)$/;
  const hasCodeChanges = context.changedFiles.some(f => codeExtensions.test(f));

  if (!hasCodeChanges) {
    return [];
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.2,
    messages: [
      {
        role: 'system',
        content: `You are a performance-focused code reviewer. Identify potential performance issues.

Focus on:
- N+1 query patterns
- Unbounded loops or recursion
- Missing pagination on list endpoints
- Large payload serialization
- Synchronous operations that should be async
- Missing caching opportunities for expensive operations
- Memory leaks (event listeners, subscriptions)

DO NOT flag:
- Micro-optimizations
- Style preferences
- Theoretical issues without real impact

Only flag issues that would materially impact user experience or costs.

Output JSON array with: severity, file, line, title, description, suggestion`,
      },
      {
        role: 'user',
        content: `Review for performance issues:\n\n${context.diff}`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  const result = JSON.parse(response.choices[0].message.content!);
  return result.findings || [];
}

async function analyzeTestability(context: PRContext): Promise<Finding[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.3,
    messages: [
      {
        role: 'system',
        content: `You are reviewing test coverage. Your job is NOT to write tests, but to identify gaps.

Flag when:
- New public functions lack corresponding tests
- Complex branching logic isn't tested
- Error paths aren't tested
- Integration points (API calls, DB) lack tests

DO NOT:
- Suggest specific test implementations
- Flag private/internal functions
- Require 100% coverage

Output JSON array with: severity (info for suggestions, warning for gaps), file, title, description`,
      },
      {
        role: 'user',
        content: `Review test coverage:\n\nChanged files:\n${context.changedFiles.join('\n')}\n\nDiff:\n${context.diff}`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  const result = JSON.parse(response.choices[0].message.content!);
  return result.findings || [];
}

async function analyzeDocumentationNeeds(context: PRContext): Promise<Finding[]> {
  // Only flag documentation for public API changes
  const isPublicAPIChange = context.changedFiles.some(f =>
    /^(src\/api|src\/lib|packages\/.*\/src)/.test(f)
  );

  if (!isPublicAPIChange) {
    return [];
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.3,
    messages: [
      {
        role: 'system',
        content: `You review documentation needs for public API changes.

Flag when:
- New exported functions lack JSDoc
- Breaking changes aren't documented
- New configuration options lack explanation
- New environment variables aren't documented

DO NOT flag:
- Internal functions
- Self-explanatory code
- Minor changes

Output JSON array with: severity (info), file, title, description`,
      },
      {
        role: 'user',
        content: `Review documentation needs:\n\n${context.diff}`,
      },
    ],
    response_format: { type: 'json_object' },
  });

  const result = JSON.parse(response.choices[0].message.content!);
  return result.findings || [];
}

async function generateReviewerSummary(analysis: {
  classification: any;
  security: Finding[];
  performance: Finding[];
  testability: Finding[];
  documentation: Finding[];
  context: PRContext;
}): Promise<string> {
  const criticalCount =
    analysis.security.filter(f => f.severity === 'critical').length +
    analysis.performance.filter(f => f.severity === 'critical').length;

  const warningCount =
    analysis.security.filter(f => f.severity === 'warning').length +
    analysis.performance.filter(f => f.severity === 'warning').length +
    analysis.testability.filter(f => f.severity === 'warning').length;

  // Don't generate a summary if there's nothing notable
  if (criticalCount === 0 && warningCount === 0) {
    return '';
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.4,
    messages: [
      {
        role: 'system',
        content: `You write concise review summaries for human reviewers.

Guidelines:
- Be brief (max 200 words)
- Lead with the most important finding
- Group related issues
- Don't repeat what static analysis already caught
- End with what the human reviewer should focus on
- Use markdown formatting

This is a SUMMARY for the reviewer, not the full review.`,
      },
      {
        role: 'user',
        content: `
PR: ${analysis.context.title}
Type: ${analysis.classification.type}
Risk Areas: ${analysis.classification.riskAreas.join(', ')}

Findings:
${JSON.stringify({
  security: analysis.security,
  performance: analysis.performance,
  testability: analysis.testability,
  documentation: analysis.documentation,
}, null, 2)}

Write a summary for the human reviewer.`,
      },
    ],
  });

  return response.choices[0].message.content!;
}

function calculateRiskLevel(
  security: Finding[],
  performance: Finding[]
): 'low' | 'medium' | 'high' {
  const criticalCount =
    security.filter(f => f.severity === 'critical').length +
    performance.filter(f => f.severity === 'critical').length;

  const warningCount =
    security.filter(f => f.severity === 'warning').length;

  if (criticalCount > 0 || warningCount >= 3) return 'high';
  if (warningCount > 0) return 'medium';
  return 'low';
}

Review Guidelines Document

Provide the AI with team-specific context:

<!-- .github/REVIEW_GUIDELINES.md -->

# Code Review Guidelines

## Our Conventions

### TypeScript
- Strict mode enabled, no `any` without comment explaining why
- Prefer `interface` over `type` for object shapes
- Use `const` assertions for literal types

### React
- Functional components only
- Colocate styles with components
- Use React Query for server state
- Error boundaries around route components

### API Design
- RESTful endpoints under `/api/v1/`
- Always return `{ data, error, meta }` wrapper
- Use Zod for request validation

### Testing
- Unit tests for business logic
- Integration tests for API endpoints
- E2E tests for critical user flows
- Minimum 80% coverage on new code

## High-Risk Areas (Require Senior Review)

- `src/services/auth/` - Authentication logic
- `src/services/billing/` - Payment processing
- `src/lib/permissions/` - Authorization
- `prisma/migrations/` - Database schema changes

## Code Owners

- `src/services/auth/*` → @security-team
- `src/services/billing/*` → @payments-team
- `src/components/ui/*` → @design-system-team
- `prisma/*` → @backend-team

## What We DON'T Review

These are enforced automatically:
- Formatting (Prettier)
- Import ordering (eslint-plugin-import)
- Unused variables (TypeScript + ESLint)
- Known security patterns (Semgrep)

## What Human Reviewers Focus On

- Is this the right solution to the problem?
- Does the architecture scale?
- Are there edge cases not covered?
- Is the error handling appropriate?
- Would a new team member understand this?

Output Format: Summary, Not Comments

Instead of inline comments that clutter the PR, generate a single summary:

<!-- Example AI Review Summary -->

## 🤖 AI Review Summary

**PR Type:** Feature
**Risk Level:** 🟡 Medium
**Estimated Review Time:** 20 minutes

### Key Findings

#### Security (1 warning)
- **Rate limiting missing on `/api/auth/reset-password`**
  `src/app/api/auth/reset-password/route.ts`
  The password reset endpoint accepts unlimited requests. Consider adding rate limiting to prevent enumeration attacks.

#### Performance (1 info)
- **N+1 query potential in user list**
  `src/services/user.service.ts:45`
  The `getUsersWithTeams()` function fetches teams in a loop. Consider using `include` or a join.

### Suggested Focus Areas

1. Verify the rate limiting concern above
2. Check the new permission logic in `src/lib/permissions/team.ts`
3. Confirm the migration is backward compatible

---
*This summary was generated to assist human review. All findings require human verification.*

Feedback Loop Implementation

Track what AI suggestions get accepted to improve over time:

// tools/ai-review/src/feedback.ts

import { Octokit } from '@octokit/rest';

interface FeedbackEvent {
  prNumber: number;
  findingId: string;
  findingType: 'security' | 'performance' | 'testability' | 'documentation';
  severity: string;
  action: 'accepted' | 'rejected' | 'ignored';
  reviewerComment?: string;
  timestamp: string;
}

export async function collectFeedback(octokit: Octokit, repo: { owner: string; repo: string }) {
  // Get recently merged PRs
  const { data: prs } = await octokit.pulls.list({
    ...repo,
    state: 'closed',
    sort: 'updated',
    direction: 'desc',
    per_page: 50,
  });

  const mergedPRs = prs.filter(pr => pr.merged_at);

  for (const pr of mergedPRs) {
    // Get AI review comment
    const { data: comments } = await octokit.issues.listComments({
      ...repo,
      issue_number: pr.number,
    });

    const aiComment = comments.find(c =>
      c.user?.login === 'github-actions[bot]' &&
      c.body?.includes('AI Review Summary')
    );

    if (!aiComment) continue;

    // Get human reviewer reactions/responses
    const { data: reviews } = await octokit.pulls.listReviews({
      ...repo,
      pull_number: pr.number,
    });

    // Parse what happened to each finding
    const feedback = analyzeFeedback(aiComment.body!, reviews, pr);

    // Store for analysis
    await storeFeedback(feedback);
  }
}

async function analyzeFeedback(
  aiComment: string,
  reviews: any[],
  pr: any
): Promise<FeedbackEvent[]> {
  const events: FeedbackEvent[] = [];

  // Parse findings from AI comment
  const findings = parseAIFindings(aiComment);

  for (const finding of findings) {
    // Check if there were commits addressing this finding
    const wasAddressed = await checkIfAddressed(finding, pr);

    // Check if reviewer explicitly dismissed
    const wasDismissed = reviews.some(r =>
      r.body?.toLowerCase().includes(finding.title.toLowerCase()) &&
      (r.body?.includes('not applicable') || r.body?.includes('false positive'))
    );

    events.push({
      prNumber: pr.number,
      findingId: finding.id,
      findingType: finding.type,
      severity: finding.severity,
      action: wasDismissed ? 'rejected' : wasAddressed ? 'accepted' : 'ignored',
      timestamp: new Date().toISOString(),
    });
  }

  return events;
}

// Use feedback to tune prompts
export async function generatePromptAdjustments(): Promise<string[]> {
  const feedback = await loadRecentFeedback();

  const adjustments: string[] = [];

  // Find patterns in rejected findings
  const rejectedByType = groupBy(
    feedback.filter(f => f.action === 'rejected'),
    'findingType'
  );

  for (const [type, rejected] of Object.entries(rejectedByType)) {
    const rejectionRate = rejected.length / feedback.filter(f => f.findingType === type).length;

    if (rejectionRate > 0.5) {
      adjustments.push(
        `${type} findings have high rejection rate (${(rejectionRate * 100).toFixed(0)}%). ` +
        `Consider being more conservative or adding specific exclusions.`
      );
    }
  }

  // Find patterns in ignored findings
  const ignoredByType = groupBy(
    feedback.filter(f => f.action === 'ignored'),
    'findingType'
  );

  for (const [type, ignored] of Object.entries(ignoredByType)) {
    const ignoreRate = ignored.length / feedback.filter(f => f.findingType === type).length;

    if (ignoreRate > 0.7) {
      adjustments.push(
        `${type} findings are frequently ignored (${(ignoreRate * 100).toFixed(0)}%). ` +
        `These may not be providing value.`
      );
    }
  }

  return adjustments;
}

Handling Different PR Types

Not all PRs need the same level of AI scrutiny:

// tools/ai-review/src/pr-router.ts

interface ReviewStrategy {
  runSecurityAnalysis: boolean;
  runPerformanceAnalysis: boolean;
  runTestabilityAnalysis: boolean;
  runDocumentationAnalysis: boolean;
  requireHumanReview: boolean;
  suggestedReviewers: string[];
}

export function determineReviewStrategy(context: PRContext): ReviewStrategy {
  const labels = context.labels.map(l => l.toLowerCase());
  const changedFiles = context.changedFiles;

  // Documentation-only PRs
  if (changedFiles.every(f => /\.(md|mdx|txt|rst)$/.test(f) || f.startsWith('docs/'))) {
    return {
      runSecurityAnalysis: false,
      runPerformanceAnalysis: false,
      runTestabilityAnalysis: false,
      runDocumentationAnalysis: false, // Docs reviewing docs is circular
      requireHumanReview: true,
      suggestedReviewers: ['@docs-team'],
    };
  }

  // Dependency updates (Dependabot, Renovate)
  if (labels.includes('dependencies') || context.author.includes('[bot]')) {
    return {
      runSecurityAnalysis: true, // Always check security for deps
      runPerformanceAnalysis: false,
      runTestabilityAnalysis: false,
      runDocumentationAnalysis: false,
      requireHumanReview: hasBreakingDependencyChange(changedFiles),
      suggestedReviewers: ['@security-team'],
    };
  }

  // Hotfix/emergency PRs
  if (labels.includes('hotfix') || labels.includes('emergency')) {
    return {
      runSecurityAnalysis: true,
      runPerformanceAnalysis: false, // Speed over thoroughness
      runTestabilityAnalysis: false,
      runDocumentationAnalysis: false,
      requireHumanReview: true,
      suggestedReviewers: ['@on-call'],
    };
  }

  // Security-sensitive areas
  if (changedFiles.some(f => isSecuritySensitive(f))) {
    return {
      runSecurityAnalysis: true,
      runPerformanceAnalysis: true,
      runTestabilityAnalysis: true,
      runDocumentationAnalysis: true,
      requireHumanReview: true,
      suggestedReviewers: ['@security-team', '@senior-engineers'],
    };
  }

  // Default: full analysis
  return {
    runSecurityAnalysis: true,
    runPerformanceAnalysis: true,
    runTestabilityAnalysis: true,
    runDocumentationAnalysis: true,
    requireHumanReview: true,
    suggestedReviewers: determineCodeOwners(changedFiles),
  };
}

function isSecuritySensitive(file: string): boolean {
  const patterns = [
    /auth/i,
    /login/i,
    /password/i,
    /session/i,
    /token/i,
    /permission/i,
    /role/i,
    /admin/i,
    /payment/i,
    /billing/i,
    /credit/i,
    /api\/.*route/i,
    /middleware/i,
    /\.env/,
    /secret/i,
    /key/i,
    /credential/i,
  ];

  return patterns.some(p => p.test(file));
}

Integrating Static Analysis Results

Don't let AI duplicate what tools already do—feed static analysis results into context:

// tools/ai-review/src/static-analysis-context.ts

interface StaticAnalysisResults {
  eslint: ESLintResult[];
  typescript: TypeScriptDiagnostic[];
  semgrep: SemgrepFinding[];
  testCoverage: CoverageReport;
}

export async function gatherStaticAnalysisContext(
  prFiles: string[]
): Promise<StaticAnalysisResults> {
  const [eslint, typescript, semgrep, coverage] = await Promise.all([
    runESLint(prFiles),
    runTypeScript(prFiles),
    runSemgrep(prFiles),
    getCoverageReport(prFiles),
  ]);

  return { eslint, typescript, semgrep, testCoverage: coverage };
}

export function buildContextPrompt(staticResults: StaticAnalysisResults): string {
  const sections: string[] = [];

  // ESLint already caught these - don't duplicate
  if (staticResults.eslint.length > 0) {
    sections.push(`
## Already Flagged by ESLint
The following issues are already caught by ESLint and should NOT be mentioned:
${staticResults.eslint.map(e => `- ${e.ruleId}: ${e.message} (${e.filePath}:${e.line})`).join('\n')}
`);
  }

  // TypeScript errors - don't duplicate
  if (staticResults.typescript.length > 0) {
    sections.push(`
## Already Flagged by TypeScript
These type errors are caught by the compiler:
${staticResults.typescript.map(d => `- ${d.code}: ${d.message} (${d.file}:${d.line})`).join('\n')}
`);
  }

  // Security findings from Semgrep
  if (staticResults.semgrep.length > 0) {
    sections.push(`
## Already Flagged by Semgrep
These security issues are already caught:
${staticResults.semgrep.map(f => `- ${f.check_id}: ${f.message} (${f.path}:${f.line})`).join('\n')}
`);
  }

  // Coverage gaps for context
  if (staticResults.testCoverage.uncoveredFiles.length > 0) {
    sections.push(`
## Test Coverage Context
Files with low coverage (consider mentioning if critical):
${staticResults.testCoverage.uncoveredFiles.map(f => `- ${f.path}: ${f.coverage}%`).join('\n')}
`);
  }

  return sections.join('\n');
}

// Use in main analyzer
export async function analyzeWithStaticContext(context: PRContext): Promise<AnalysisResult> {
  const staticResults = await gatherStaticAnalysisContext(context.changedFiles);
  const staticContext = buildContextPrompt(staticResults);

  // Pass to AI with explicit instruction not to duplicate
  const enrichedContext = {
    ...context,
    systemAddendum: `
${staticContext}

IMPORTANT: Do not flag any issues that are listed above as "Already Flagged."
Focus only on issues that static analysis cannot catch:
- Business logic errors
- Architectural concerns
- Cross-cutting implications
- Performance patterns that require understanding intent
- Security issues beyond pattern matching
`,
  };

  return analyzepr(enrichedContext);
}

Rate Limiting and Cost Control

Prevent runaway API costs:

// tools/ai-review/src/cost-control.ts

interface CostLimits {
  maxTokensPerPR: number;
  maxPRsPerDay: number;
  maxDailySpend: number;
  skipLargeDiffs: number; // lines
}

const DEFAULT_LIMITS: CostLimits = {
  maxTokensPerPR: 100000,
  maxPRsPerDay: 100,
  maxDailySpend: 50, // USD
  skipLargeDiffs: 2000,
};

class CostController {
  private dailyUsage = {
    tokens: 0,
    prs: 0,
    spend: 0,
    date: new Date().toDateString(),
  };

  async shouldProcess(context: PRContext): Promise<{ allowed: boolean; reason?: string }> {
    // Reset daily counters
    if (this.dailyUsage.date !== new Date().toDateString()) {
      this.dailyUsage = {
        tokens: 0,
        prs: 0,
        spend: 0,
        date: new Date().toDateString(),
      };
    }

    // Check daily PR limit
    if (this.dailyUsage.prs >= DEFAULT_LIMITS.maxPRsPerDay) {
      return { allowed: false, reason: 'Daily PR limit reached' };
    }

    // Check daily spend limit
    if (this.dailyUsage.spend >= DEFAULT_LIMITS.maxDailySpend) {
      return { allowed: false, reason: 'Daily spend limit reached' };
    }

    // Check diff size
    const diffLines = context.diff.split('\n').length;
    if (diffLines > DEFAULT_LIMITS.skipLargeDiffs) {
      return {
        allowed: false,
        reason: `PR too large (${diffLines} lines). Consider breaking into smaller PRs.`,
      };
    }

    // Estimate tokens
    const estimatedTokens = estimateTokenCount(context);
    if (estimatedTokens > DEFAULT_LIMITS.maxTokensPerPR) {
      return {
        allowed: false,
        reason: `PR would exceed token limit (${estimatedTokens} estimated)`,
      };
    }

    return { allowed: true };
  }

  recordUsage(tokens: number, cost: number): void {
    this.dailyUsage.tokens += tokens;
    this.dailyUsage.prs += 1;
    this.dailyUsage.spend += cost;
  }
}

function estimateTokenCount(context: PRContext): number {
  // Rough estimate: 1 token ≈ 4 characters
  const diffTokens = Math.ceil(context.diff.length / 4);
  const contextTokens = Math.ceil(context.body.length / 4);
  const systemPromptTokens = 2000; // Fixed overhead

  return diffTokens + contextTokens + systemPromptTokens;
}

Self-Hosted LLM Option

For sensitive codebases, run locally:

// tools/ai-review/src/providers/local.ts

import Anthropic from '@anthropic-ai/sdk';

interface LLMProvider {
  complete(prompt: string, options: CompletionOptions): Promise<string>;
}

// Use Claude via local proxy or Anthropic API
class AnthropicProvider implements LLMProvider {
  private client: Anthropic;

  constructor() {
    this.client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY,
    });
  }

  async complete(prompt: string, options: CompletionOptions): Promise<string> {
    const response = await this.client.messages.create({
      model: 'claude-sonnet-4-20250514',
      max_tokens: options.maxTokens || 4096,
      messages: [
        { role: 'user', content: prompt }
      ],
    });

    return response.content[0].type === 'text'
      ? response.content[0].text
      : '';
  }
}

// Use Ollama for fully local inference
class OllamaProvider implements LLMProvider {
  constructor(
    private baseUrl = 'http://localhost:11434',
    private model = 'codellama:34b'
  ) {}

  async complete(prompt: string, options: CompletionOptions): Promise<string> {
    const response = await fetch(`${this.baseUrl}/api/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: this.model,
        prompt,
        stream: false,
        options: {
          temperature: options.temperature || 0.2,
          num_predict: options.maxTokens || 4096,
        },
      }),
    });

    const data = await response.json();
    return data.response;
  }
}

// Factory for selecting provider
export function createProvider(): LLMProvider {
  const provider = process.env.LLM_PROVIDER || 'anthropic';

  switch (provider) {
    case 'ollama':
      return new OllamaProvider(
        process.env.OLLAMA_URL,
        process.env.OLLAMA_MODEL
      );
    case 'anthropic':
      return new AnthropicProvider();
    default:
      throw new Error(`Unknown provider: ${provider}`);
  }
}

Metrics and Dashboards

Track the impact of AI-augmented reviews:

// tools/ai-review/src/metrics.ts

interface ReviewMetrics {
  // Volume
  totalPRsAnalyzed: number;
  prsWithFindings: number;
  findingsByCategory: Record<string, number>;

  // Quality
  findingsAccepted: number;
  findingsRejected: number;
  findingsIgnored: number;
  acceptanceRate: number;

  // Impact
  avgTimeToFirstReview: number; // minutes
  avgTimeToMerge: number;
  revertedPRs: number;
  securityIncidentsFromMergedPRs: number;

  // Cost
  totalTokensUsed: number;
  totalAPICost: number;
  costPerPR: number;
}

export async function computeMetrics(
  startDate: Date,
  endDate: Date
): Promise<ReviewMetrics> {
  const feedback = await loadFeedback(startDate, endDate);
  const prs = await loadPRData(startDate, endDate);

  const findingsByCategory: Record<string, number> = {};
  for (const f of feedback) {
    findingsByCategory[f.findingType] = (findingsByCategory[f.findingType] || 0) + 1;
  }

  const accepted = feedback.filter(f => f.action === 'accepted').length;
  const rejected = feedback.filter(f => f.action === 'rejected').length;
  const ignored = feedback.filter(f => f.action === 'ignored').length;

  return {
    totalPRsAnalyzed: prs.length,
    prsWithFindings: prs.filter(p => p.hadFindings).length,
    findingsByCategory,

    findingsAccepted: accepted,
    findingsRejected: rejected,
    findingsIgnored: ignored,
    acceptanceRate: accepted / (accepted + rejected + ignored),

    avgTimeToFirstReview: average(prs.map(p => p.timeToFirstReview)),
    avgTimeToMerge: average(prs.map(p => p.timeToMerge)),
    revertedPRs: prs.filter(p => p.wasReverted).length,
    securityIncidentsFromMergedPRs: await countSecurityIncidents(prs),

    totalTokensUsed: sum(prs.map(p => p.tokensUsed)),
    totalAPICost: sum(prs.map(p => p.apiCost)),
    costPerPR: sum(prs.map(p => p.apiCost)) / prs.length,
  };
}

// Grafana dashboard query
const GRAFANA_DASHBOARD = `
# AI Review Metrics Dashboard

## Acceptance Rate Over Time
SELECT
  date_trunc('week', timestamp) as week,
  count(*) filter (where action = 'accepted') * 100.0 / count(*) as acceptance_rate
FROM ai_review_feedback
GROUP BY week
ORDER BY week;

## Findings by Category
SELECT
  finding_type,
  count(*) as total,
  count(*) filter (where action = 'accepted') as accepted,
  count(*) filter (where action = 'rejected') as rejected
FROM ai_review_feedback
WHERE timestamp > now() - interval '30 days'
GROUP BY finding_type;

## Cost Efficiency
SELECT
  date_trunc('day', created_at) as day,
  sum(api_cost) as daily_cost,
  count(*) as prs_analyzed,
  sum(api_cost) / count(*) as cost_per_pr
FROM ai_review_runs
GROUP BY day
ORDER BY day;
`;

Production Checklist

Pipeline Setup

Gate checks run before AI analysis
AI analysis only on non-draft PRs
Large PR handling (skip or summarize)
Cost controls and rate limiting
Fallback when API unavailable

Quality Controls

AI outputs summary, not inline comments
Clear distinction between AI and human review
No blocking on AI suggestions (advisory only)
Feedback loop to track acceptance
Regular prompt tuning based on data

Integration

Static analysis results fed to AI context
CODEOWNERS respected for routing
Team guidelines document provided
Different strategies for different PR types

Security

API keys in GitHub Secrets
Diff content not logged
Option for self-hosted model
No sensitive code sent to third parties (if required)

Observability

Metrics collection
Cost tracking dashboard
Acceptance rate monitoring
Alert on high rejection rates

Anti-Patterns to Avoid

AI as gatekeeper - AI should never block merges; it's advisory only
Inline comment spam - One summary beats 50 inline comments
Duplicating linters - Don't flag what tools already catch
Style policing - Code style is for formatters, not AI
Ignoring context - Team conventions matter more than general best practices
No feedback loop - Without tracking acceptance, you can't improve
One-size-fits-all - Hotfixes need different treatment than features
Undermining humans - AI augments reviewers, doesn't replace them

Summary

Effective AI-augmented code review is about workflow design, not prompt engineering. The key principles:

Layer appropriately - Static analysis → AI analysis → Human review
Know your boundaries - AI catches patterns; humans judge architecture
Reduce noise - Summaries over comments, categorization over volume
Close the loop - Track what gets accepted and tune accordingly
Respect authority - Human reviewers have final say

The goal isn't to automate code review—it's to focus human attention where it matters most. A senior engineer's time is better spent discussing trade-offs than spotting missing null checks. AI handles the checklist; humans handle the judgment.

What did you think?