AI-Augmented Code Reviews: Architecting the Workflow, Not Just the Prompt
AI-Augmented Code Reviews: Architecting the Workflow, Not Just the Prompt
The naive approach to AI code review is seductive: point an LLM at a diff, ask it to "review this code," and paste the output as a comment. This creates noise, erodes trust, and eventually gets ignored. The challenge isn't getting an LLM to generate review comments—it's architecting a system where AI augments human judgment without drowning it.
This guide covers building an AI-augmented review pipeline that knows its boundaries, integrates with existing tooling, and keeps senior engineers focused on what matters: architecture, design, and mentorship.
The Problem with Naive AI Reviews
┌─────────────────────────────────────────────────────────────────────┐
│ NAIVE AI REVIEW ANTI-PATTERNS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ PR Opened │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ "Review this code and find all issues" │ │
│ │ │ │
│ │ LLM Response: │ │
│ │ 1. Consider adding a docstring here │ │
│ │ 2. This variable name could be more descriptive │ │
│ │ 3. You might want to handle the edge case where... │ │
│ │ 4. Consider using const instead of let │ │
│ │ 5. This function could be split into smaller functions │ │
│ │ 6. Add error handling for network failures │ │
│ │ 7. Consider adding unit tests │ │
│ │ 8. The indentation on line 47 seems inconsistent │ │
│ │ ... 47 more suggestions ... │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Problems: │
│ • No prioritization (style nits mixed with security issues) │
│ • Duplicates what linters already catch │
│ • No context about codebase conventions │
│ • No understanding of PR intent │
│ • Creates alert fatigue → gets ignored │
│ • Undermines human reviewers' authority │
│ • No feedback loop for improvement │
│ │
└─────────────────────────────────────────────────────────────────────┘
What Goes Wrong
- Signal-to-noise ratio collapses - 50 comments where 3 matter
- Redundancy with existing tools - AI suggests what ESLint already enforces
- Context blindness - AI doesn't know your team's conventions
- Authority confusion - Is this a suggestion or a requirement?
- No learning - Same unhelpful comments on every PR
The Right Mental Model
AI should handle what computers do best, freeing humans for what they do best:
┌─────────────────────────────────────────────────────────────────────┐
│ RESPONSIBILITY BOUNDARIES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ AUTOMATED (No Human Needed) HUMAN JUDGMENT REQUIRED │
│ ════════════════════════════ ════════════════════════ │
│ │
│ Static Analysis (ESLint, etc.) Architecture decisions │
│ ├─ Syntax errors ├─ Does this belong here? │
│ ├─ Style violations ├─ Is this the right │
│ ├─ Unused variables │ abstraction? │
│ └─ Import sorting └─ Will this scale? │
│ │
│ Type Checking (TypeScript) Design patterns │
│ ├─ Type mismatches ├─ Is this idiomatic? │
│ ├─ Null safety ├─ Does it follow our │
│ └─ Interface compliance │ conventions? │
│ │
│ Security Scanning (Semgrep) Business logic │
│ ├─ Known vulnerabilities ├─ Does this solve the │
│ ├─ Injection patterns │ actual problem? │
│ └─ Secrets detection ├─ Edge cases? │
│ └─ Correctness? │
│ │
│ AI-AUGMENTED (Human Reviews Output) Mentorship │
│ ═══════════════════════════════════ ├─ Teaching opportunities │
│ ├─ Career growth │
│ Complexity analysis └─ Team knowledge sharing │
│ Documentation gaps │
│ Test coverage suggestions Trade-off decisions │
│ Performance red flags ├─ Tech debt vs velocity │
│ Cross-cutting concern detection ├─ Consistency vs progress │
│ └─ Scope of change │
│ │
└─────────────────────────────────────────────────────────────────────┘
Pipeline Architecture
A well-designed review pipeline runs checks in stages, each with clear responsibilities:
┌─────────────────────────────────────────────────────────────────────┐
│ REVIEW PIPELINE STAGES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ PR Opened/Updated │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stage 1: GATE CHECKS (Block if fail) │ │
│ │ ├─ CI/CD (tests, build) │ │
│ │ ├─ Linting (ESLint, Prettier) │ │
│ │ ├─ Type checking (tsc --noEmit) │ │
│ │ └─ Security scan (Semgrep, Snyk) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ (Only if Stage 1 passes) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stage 2: AI TRIAGE (Categorize & Filter) │ │
│ │ ├─ Classify PR type (feature/bugfix/refactor/docs) │ │
│ │ ├─ Identify high-risk areas (auth, payments, data) │ │
│ │ ├─ Detect cross-cutting concerns │ │
│ │ └─ Flag for specific reviewer expertise │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stage 3: AI ANALYSIS (Generate insights, not comments) │ │
│ │ ├─ Complexity delta analysis │ │
│ │ ├─ Test coverage gaps │ │
│ │ ├─ Documentation requirements │ │
│ │ ├─ Performance implications │ │
│ │ └─ Summary for reviewer │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stage 4: HUMAN REVIEW (Final authority) │ │
│ │ ├─ Reviewer sees AI summary (not raw comments) │ │
│ │ ├─ Reviewer has full context │ │
│ │ ├─ Reviewer makes approve/request changes decision │ │
│ │ └─ AI suggestions are optional, not blocking │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Stage 5: FEEDBACK LOOP │ │
│ │ ├─ Track which AI suggestions were accepted │ │
│ │ ├─ Learn from reviewer corrections │ │
│ │ └─ Tune prompts based on team patterns │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Implementation: GitHub Actions Workflow
Main Workflow Configuration
# .github/workflows/pr-review-pipeline.yml
name: PR Review Pipeline
on:
pull_request:
types: [opened, synchronize, ready_for_review]
permissions:
contents: read
pull-requests: write
checks: write
jobs:
# Stage 1: Gate Checks (must pass before AI runs)
gate-checks:
name: Gate Checks
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Type check
run: npm run typecheck
- name: Lint
run: npm run lint
- name: Test
run: npm run test:ci
- name: Security scan
uses: returntocorp/semgrep-action@v1
with:
config: >-
p/security-audit
p/secrets
p/owasp-top-ten
# Stage 2 & 3: AI Analysis (only runs after gates pass)
ai-analysis:
name: AI Analysis
needs: gate-checks
runs-on: ubuntu-latest
if: github.event.pull_request.draft == false
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for diff
- name: Get PR diff
id: diff
run: |
git diff origin/${{ github.base_ref }}...HEAD > pr.diff
echo "diff_size=$(wc -l < pr.diff)" >> $GITHUB_OUTPUT
- name: Skip large PRs
if: steps.diff.outputs.diff_size > 2000
run: |
echo "PR too large for AI review ($(cat pr.diff | wc -l) lines)"
echo "Consider breaking into smaller PRs"
exit 0
- name: Run AI analysis
id: ai-review
uses: ./.github/actions/ai-review
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
diff-file: pr.diff
pr-number: ${{ github.event.pull_request.number }}
- name: Post review summary
if: steps.ai-review.outputs.has-findings == 'true'
uses: actions/github-script@v7
with:
script: |
const summary = require('./ai-review-output.json');
await github.rest.pulls.createReview({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: context.issue.number,
event: 'COMMENT',
body: summary.reviewerSummary
});
Custom AI Review Action
# .github/actions/ai-review/action.yml
name: 'AI Code Review'
description: 'Runs AI-augmented code review analysis'
inputs:
github-token:
description: 'GitHub token'
required: true
openai-api-key:
description: 'OpenAI API key'
required: true
diff-file:
description: 'Path to diff file'
required: true
pr-number:
description: 'PR number'
required: true
outputs:
has-findings:
description: 'Whether the analysis found anything notable'
value: ${{ steps.analyze.outputs.has-findings }}
runs:
using: 'composite'
steps:
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install analysis tools
shell: bash
run: npm install -g @company/ai-review-cli
- name: Gather context
id: context
shell: bash
run: |
# Get PR metadata
gh pr view ${{ inputs.pr-number }} --json title,body,labels,author > pr-metadata.json
# Get recent commits for context
git log --oneline -20 > recent-commits.txt
# Get changed files with their full content
git diff --name-only origin/${{ github.base_ref }}...HEAD > changed-files.txt
# Get codebase conventions (if exists)
if [ -f ".github/REVIEW_GUIDELINES.md" ]; then
cp .github/REVIEW_GUIDELINES.md review-guidelines.md
fi
env:
GH_TOKEN: ${{ inputs.github-token }}
- name: Run analysis
id: analyze
shell: bash
run: |
ai-review analyze \
--diff "${{ inputs.diff-file }}" \
--context pr-metadata.json \
--guidelines review-guidelines.md \
--output ai-review-output.json
if [ -s ai-review-output.json ]; then
echo "has-findings=true" >> $GITHUB_OUTPUT
else
echo "has-findings=false" >> $GITHUB_OUTPUT
fi
env:
OPENAI_API_KEY: ${{ inputs.openai-api-key }}
The Analysis Engine
The core analysis tool runs multiple specialized prompts, each focused on a specific review concern.
// tools/ai-review/src/analyzer.ts
import OpenAI from 'openai';
import { readFileSync } from 'fs';
interface PRContext {
title: string;
body: string;
labels: string[];
author: string;
diff: string;
changedFiles: string[];
guidelines?: string;
}
interface AnalysisResult {
reviewerSummary: string;
riskLevel: 'low' | 'medium' | 'high';
categories: {
security: Finding[];
performance: Finding[];
testability: Finding[];
documentation: Finding[];
};
suggestedReviewers: string[];
estimatedReviewTime: string;
}
interface Finding {
severity: 'info' | 'warning' | 'critical';
file: string;
line?: number;
title: string;
description: string;
suggestion?: string;
}
const openai = new OpenAI();
export async function analyzepr(context: PRContext): Promise<AnalysisResult> {
// Run specialized analyses in parallel
const [
classification,
securityAnalysis,
performanceAnalysis,
testabilityAnalysis,
documentationAnalysis,
] = await Promise.all([
classifyPR(context),
analyzeSecurityImplications(context),
analyzePerformanceImplications(context),
analyzeTestability(context),
analyzeDocumentationNeeds(context),
]);
// Generate reviewer summary (not raw findings)
const summary = await generateReviewerSummary({
classification,
security: securityAnalysis,
performance: performanceAnalysis,
testability: testabilityAnalysis,
documentation: documentationAnalysis,
context,
});
return {
reviewerSummary: summary,
riskLevel: calculateRiskLevel(securityAnalysis, performanceAnalysis),
categories: {
security: securityAnalysis,
performance: performanceAnalysis,
testability: testabilityAnalysis,
documentation: documentationAnalysis,
},
suggestedReviewers: classification.suggestedReviewers,
estimatedReviewTime: classification.estimatedReviewTime,
};
}
async function classifyPR(context: PRContext): Promise<{
type: string;
riskAreas: string[];
suggestedReviewers: string[];
estimatedReviewTime: string;
}> {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.1, // Low temperature for classification
messages: [
{
role: 'system',
content: `You are a PR triage system. Classify PRs and identify review requirements.
Your job is to:
1. Classify the PR type (feature, bugfix, refactor, docs, deps, config)
2. Identify high-risk areas that need careful review
3. Suggest reviewers based on file ownership patterns
4. Estimate review time
Output JSON only, no explanation.`,
},
{
role: 'user',
content: `
PR Title: ${context.title}
PR Description: ${context.body}
Labels: ${context.labels.join(', ')}
Changed Files: ${context.changedFiles.join('\n')}
Classify this PR.`,
},
],
response_format: { type: 'json_object' },
});
return JSON.parse(response.choices[0].message.content!);
}
async function analyzeSecurityImplications(context: PRContext): Promise<Finding[]> {
// Skip if no security-relevant files changed
const securityRelevantPatterns = [
/auth/i, /login/i, /password/i, /token/i, /secret/i,
/payment/i, /billing/i, /credit/i,
/admin/i, /permission/i, /role/i,
/api.*route/i, /middleware/i,
/\.env/, /config/i,
];
const hasSecurityRelevantChanges = context.changedFiles.some(file =>
securityRelevantPatterns.some(pattern => pattern.test(file))
);
if (!hasSecurityRelevantChanges) {
return [];
}
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.2,
messages: [
{
role: 'system',
content: `You are a security-focused code reviewer. Your ONLY job is to identify potential security issues.
DO NOT comment on:
- Code style
- Performance (unless it's a DoS vector)
- Documentation
- General best practices
ONLY flag:
- Authentication/authorization bypasses
- Injection vulnerabilities (SQL, XSS, command)
- Sensitive data exposure
- Insecure cryptography
- Access control issues
- Input validation gaps in security-critical paths
If you find nothing security-relevant, return an empty array.
Be specific about the vulnerability, not vague concerns.
Output JSON array of findings with: severity, file, line (if known), title, description, suggestion`,
},
{
role: 'user',
content: `Review this diff for security issues:\n\n${context.diff}`,
},
],
response_format: { type: 'json_object' },
});
const result = JSON.parse(response.choices[0].message.content!);
return result.findings || [];
}
async function analyzePerformanceImplications(context: PRContext): Promise<Finding[]> {
// Only analyze if there are code changes (not just docs/config)
const codeExtensions = /\.(ts|tsx|js|jsx|py|go|rs|java)$/;
const hasCodeChanges = context.changedFiles.some(f => codeExtensions.test(f));
if (!hasCodeChanges) {
return [];
}
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.2,
messages: [
{
role: 'system',
content: `You are a performance-focused code reviewer. Identify potential performance issues.
Focus on:
- N+1 query patterns
- Unbounded loops or recursion
- Missing pagination on list endpoints
- Large payload serialization
- Synchronous operations that should be async
- Missing caching opportunities for expensive operations
- Memory leaks (event listeners, subscriptions)
DO NOT flag:
- Micro-optimizations
- Style preferences
- Theoretical issues without real impact
Only flag issues that would materially impact user experience or costs.
Output JSON array with: severity, file, line, title, description, suggestion`,
},
{
role: 'user',
content: `Review for performance issues:\n\n${context.diff}`,
},
],
response_format: { type: 'json_object' },
});
const result = JSON.parse(response.choices[0].message.content!);
return result.findings || [];
}
async function analyzeTestability(context: PRContext): Promise<Finding[]> {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.3,
messages: [
{
role: 'system',
content: `You are reviewing test coverage. Your job is NOT to write tests, but to identify gaps.
Flag when:
- New public functions lack corresponding tests
- Complex branching logic isn't tested
- Error paths aren't tested
- Integration points (API calls, DB) lack tests
DO NOT:
- Suggest specific test implementations
- Flag private/internal functions
- Require 100% coverage
Output JSON array with: severity (info for suggestions, warning for gaps), file, title, description`,
},
{
role: 'user',
content: `Review test coverage:\n\nChanged files:\n${context.changedFiles.join('\n')}\n\nDiff:\n${context.diff}`,
},
],
response_format: { type: 'json_object' },
});
const result = JSON.parse(response.choices[0].message.content!);
return result.findings || [];
}
async function analyzeDocumentationNeeds(context: PRContext): Promise<Finding[]> {
// Only flag documentation for public API changes
const isPublicAPIChange = context.changedFiles.some(f =>
/^(src\/api|src\/lib|packages\/.*\/src)/.test(f)
);
if (!isPublicAPIChange) {
return [];
}
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.3,
messages: [
{
role: 'system',
content: `You review documentation needs for public API changes.
Flag when:
- New exported functions lack JSDoc
- Breaking changes aren't documented
- New configuration options lack explanation
- New environment variables aren't documented
DO NOT flag:
- Internal functions
- Self-explanatory code
- Minor changes
Output JSON array with: severity (info), file, title, description`,
},
{
role: 'user',
content: `Review documentation needs:\n\n${context.diff}`,
},
],
response_format: { type: 'json_object' },
});
const result = JSON.parse(response.choices[0].message.content!);
return result.findings || [];
}
async function generateReviewerSummary(analysis: {
classification: any;
security: Finding[];
performance: Finding[];
testability: Finding[];
documentation: Finding[];
context: PRContext;
}): Promise<string> {
const criticalCount =
analysis.security.filter(f => f.severity === 'critical').length +
analysis.performance.filter(f => f.severity === 'critical').length;
const warningCount =
analysis.security.filter(f => f.severity === 'warning').length +
analysis.performance.filter(f => f.severity === 'warning').length +
analysis.testability.filter(f => f.severity === 'warning').length;
// Don't generate a summary if there's nothing notable
if (criticalCount === 0 && warningCount === 0) {
return '';
}
const response = await openai.chat.completions.create({
model: 'gpt-4o',
temperature: 0.4,
messages: [
{
role: 'system',
content: `You write concise review summaries for human reviewers.
Guidelines:
- Be brief (max 200 words)
- Lead with the most important finding
- Group related issues
- Don't repeat what static analysis already caught
- End with what the human reviewer should focus on
- Use markdown formatting
This is a SUMMARY for the reviewer, not the full review.`,
},
{
role: 'user',
content: `
PR: ${analysis.context.title}
Type: ${analysis.classification.type}
Risk Areas: ${analysis.classification.riskAreas.join(', ')}
Findings:
${JSON.stringify({
security: analysis.security,
performance: analysis.performance,
testability: analysis.testability,
documentation: analysis.documentation,
}, null, 2)}
Write a summary for the human reviewer.`,
},
],
});
return response.choices[0].message.content!;
}
function calculateRiskLevel(
security: Finding[],
performance: Finding[]
): 'low' | 'medium' | 'high' {
const criticalCount =
security.filter(f => f.severity === 'critical').length +
performance.filter(f => f.severity === 'critical').length;
const warningCount =
security.filter(f => f.severity === 'warning').length;
if (criticalCount > 0 || warningCount >= 3) return 'high';
if (warningCount > 0) return 'medium';
return 'low';
}
Review Guidelines Document
Provide the AI with team-specific context:
<!-- .github/REVIEW_GUIDELINES.md -->
# Code Review Guidelines
## Our Conventions
### TypeScript
- Strict mode enabled, no `any` without comment explaining why
- Prefer `interface` over `type` for object shapes
- Use `const` assertions for literal types
### React
- Functional components only
- Colocate styles with components
- Use React Query for server state
- Error boundaries around route components
### API Design
- RESTful endpoints under `/api/v1/`
- Always return `{ data, error, meta }` wrapper
- Use Zod for request validation
### Testing
- Unit tests for business logic
- Integration tests for API endpoints
- E2E tests for critical user flows
- Minimum 80% coverage on new code
## High-Risk Areas (Require Senior Review)
- `src/services/auth/` - Authentication logic
- `src/services/billing/` - Payment processing
- `src/lib/permissions/` - Authorization
- `prisma/migrations/` - Database schema changes
## Code Owners
- `src/services/auth/*` → @security-team
- `src/services/billing/*` → @payments-team
- `src/components/ui/*` → @design-system-team
- `prisma/*` → @backend-team
## What We DON'T Review
These are enforced automatically:
- Formatting (Prettier)
- Import ordering (eslint-plugin-import)
- Unused variables (TypeScript + ESLint)
- Known security patterns (Semgrep)
## What Human Reviewers Focus On
- Is this the right solution to the problem?
- Does the architecture scale?
- Are there edge cases not covered?
- Is the error handling appropriate?
- Would a new team member understand this?
Output Format: Summary, Not Comments
Instead of inline comments that clutter the PR, generate a single summary:
<!-- Example AI Review Summary -->
## 🤖 AI Review Summary
**PR Type:** Feature
**Risk Level:** 🟡 Medium
**Estimated Review Time:** 20 minutes
### Key Findings
#### Security (1 warning)
- **Rate limiting missing on `/api/auth/reset-password`**
`src/app/api/auth/reset-password/route.ts`
The password reset endpoint accepts unlimited requests. Consider adding rate limiting to prevent enumeration attacks.
#### Performance (1 info)
- **N+1 query potential in user list**
`src/services/user.service.ts:45`
The `getUsersWithTeams()` function fetches teams in a loop. Consider using `include` or a join.
### Suggested Focus Areas
1. Verify the rate limiting concern above
2. Check the new permission logic in `src/lib/permissions/team.ts`
3. Confirm the migration is backward compatible
---
*This summary was generated to assist human review. All findings require human verification.*
Feedback Loop Implementation
Track what AI suggestions get accepted to improve over time:
// tools/ai-review/src/feedback.ts
import { Octokit } from '@octokit/rest';
interface FeedbackEvent {
prNumber: number;
findingId: string;
findingType: 'security' | 'performance' | 'testability' | 'documentation';
severity: string;
action: 'accepted' | 'rejected' | 'ignored';
reviewerComment?: string;
timestamp: string;
}
export async function collectFeedback(octokit: Octokit, repo: { owner: string; repo: string }) {
// Get recently merged PRs
const { data: prs } = await octokit.pulls.list({
...repo,
state: 'closed',
sort: 'updated',
direction: 'desc',
per_page: 50,
});
const mergedPRs = prs.filter(pr => pr.merged_at);
for (const pr of mergedPRs) {
// Get AI review comment
const { data: comments } = await octokit.issues.listComments({
...repo,
issue_number: pr.number,
});
const aiComment = comments.find(c =>
c.user?.login === 'github-actions[bot]' &&
c.body?.includes('AI Review Summary')
);
if (!aiComment) continue;
// Get human reviewer reactions/responses
const { data: reviews } = await octokit.pulls.listReviews({
...repo,
pull_number: pr.number,
});
// Parse what happened to each finding
const feedback = analyzeFeedback(aiComment.body!, reviews, pr);
// Store for analysis
await storeFeedback(feedback);
}
}
async function analyzeFeedback(
aiComment: string,
reviews: any[],
pr: any
): Promise<FeedbackEvent[]> {
const events: FeedbackEvent[] = [];
// Parse findings from AI comment
const findings = parseAIFindings(aiComment);
for (const finding of findings) {
// Check if there were commits addressing this finding
const wasAddressed = await checkIfAddressed(finding, pr);
// Check if reviewer explicitly dismissed
const wasDismissed = reviews.some(r =>
r.body?.toLowerCase().includes(finding.title.toLowerCase()) &&
(r.body?.includes('not applicable') || r.body?.includes('false positive'))
);
events.push({
prNumber: pr.number,
findingId: finding.id,
findingType: finding.type,
severity: finding.severity,
action: wasDismissed ? 'rejected' : wasAddressed ? 'accepted' : 'ignored',
timestamp: new Date().toISOString(),
});
}
return events;
}
// Use feedback to tune prompts
export async function generatePromptAdjustments(): Promise<string[]> {
const feedback = await loadRecentFeedback();
const adjustments: string[] = [];
// Find patterns in rejected findings
const rejectedByType = groupBy(
feedback.filter(f => f.action === 'rejected'),
'findingType'
);
for (const [type, rejected] of Object.entries(rejectedByType)) {
const rejectionRate = rejected.length / feedback.filter(f => f.findingType === type).length;
if (rejectionRate > 0.5) {
adjustments.push(
`${type} findings have high rejection rate (${(rejectionRate * 100).toFixed(0)}%). ` +
`Consider being more conservative or adding specific exclusions.`
);
}
}
// Find patterns in ignored findings
const ignoredByType = groupBy(
feedback.filter(f => f.action === 'ignored'),
'findingType'
);
for (const [type, ignored] of Object.entries(ignoredByType)) {
const ignoreRate = ignored.length / feedback.filter(f => f.findingType === type).length;
if (ignoreRate > 0.7) {
adjustments.push(
`${type} findings are frequently ignored (${(ignoreRate * 100).toFixed(0)}%). ` +
`These may not be providing value.`
);
}
}
return adjustments;
}
Handling Different PR Types
Not all PRs need the same level of AI scrutiny:
// tools/ai-review/src/pr-router.ts
interface ReviewStrategy {
runSecurityAnalysis: boolean;
runPerformanceAnalysis: boolean;
runTestabilityAnalysis: boolean;
runDocumentationAnalysis: boolean;
requireHumanReview: boolean;
suggestedReviewers: string[];
}
export function determineReviewStrategy(context: PRContext): ReviewStrategy {
const labels = context.labels.map(l => l.toLowerCase());
const changedFiles = context.changedFiles;
// Documentation-only PRs
if (changedFiles.every(f => /\.(md|mdx|txt|rst)$/.test(f) || f.startsWith('docs/'))) {
return {
runSecurityAnalysis: false,
runPerformanceAnalysis: false,
runTestabilityAnalysis: false,
runDocumentationAnalysis: false, // Docs reviewing docs is circular
requireHumanReview: true,
suggestedReviewers: ['@docs-team'],
};
}
// Dependency updates (Dependabot, Renovate)
if (labels.includes('dependencies') || context.author.includes('[bot]')) {
return {
runSecurityAnalysis: true, // Always check security for deps
runPerformanceAnalysis: false,
runTestabilityAnalysis: false,
runDocumentationAnalysis: false,
requireHumanReview: hasBreakingDependencyChange(changedFiles),
suggestedReviewers: ['@security-team'],
};
}
// Hotfix/emergency PRs
if (labels.includes('hotfix') || labels.includes('emergency')) {
return {
runSecurityAnalysis: true,
runPerformanceAnalysis: false, // Speed over thoroughness
runTestabilityAnalysis: false,
runDocumentationAnalysis: false,
requireHumanReview: true,
suggestedReviewers: ['@on-call'],
};
}
// Security-sensitive areas
if (changedFiles.some(f => isSecuritySensitive(f))) {
return {
runSecurityAnalysis: true,
runPerformanceAnalysis: true,
runTestabilityAnalysis: true,
runDocumentationAnalysis: true,
requireHumanReview: true,
suggestedReviewers: ['@security-team', '@senior-engineers'],
};
}
// Default: full analysis
return {
runSecurityAnalysis: true,
runPerformanceAnalysis: true,
runTestabilityAnalysis: true,
runDocumentationAnalysis: true,
requireHumanReview: true,
suggestedReviewers: determineCodeOwners(changedFiles),
};
}
function isSecuritySensitive(file: string): boolean {
const patterns = [
/auth/i,
/login/i,
/password/i,
/session/i,
/token/i,
/permission/i,
/role/i,
/admin/i,
/payment/i,
/billing/i,
/credit/i,
/api\/.*route/i,
/middleware/i,
/\.env/,
/secret/i,
/key/i,
/credential/i,
];
return patterns.some(p => p.test(file));
}
Integrating Static Analysis Results
Don't let AI duplicate what tools already do—feed static analysis results into context:
// tools/ai-review/src/static-analysis-context.ts
interface StaticAnalysisResults {
eslint: ESLintResult[];
typescript: TypeScriptDiagnostic[];
semgrep: SemgrepFinding[];
testCoverage: CoverageReport;
}
export async function gatherStaticAnalysisContext(
prFiles: string[]
): Promise<StaticAnalysisResults> {
const [eslint, typescript, semgrep, coverage] = await Promise.all([
runESLint(prFiles),
runTypeScript(prFiles),
runSemgrep(prFiles),
getCoverageReport(prFiles),
]);
return { eslint, typescript, semgrep, testCoverage: coverage };
}
export function buildContextPrompt(staticResults: StaticAnalysisResults): string {
const sections: string[] = [];
// ESLint already caught these - don't duplicate
if (staticResults.eslint.length > 0) {
sections.push(`
## Already Flagged by ESLint
The following issues are already caught by ESLint and should NOT be mentioned:
${staticResults.eslint.map(e => `- ${e.ruleId}: ${e.message} (${e.filePath}:${e.line})`).join('\n')}
`);
}
// TypeScript errors - don't duplicate
if (staticResults.typescript.length > 0) {
sections.push(`
## Already Flagged by TypeScript
These type errors are caught by the compiler:
${staticResults.typescript.map(d => `- ${d.code}: ${d.message} (${d.file}:${d.line})`).join('\n')}
`);
}
// Security findings from Semgrep
if (staticResults.semgrep.length > 0) {
sections.push(`
## Already Flagged by Semgrep
These security issues are already caught:
${staticResults.semgrep.map(f => `- ${f.check_id}: ${f.message} (${f.path}:${f.line})`).join('\n')}
`);
}
// Coverage gaps for context
if (staticResults.testCoverage.uncoveredFiles.length > 0) {
sections.push(`
## Test Coverage Context
Files with low coverage (consider mentioning if critical):
${staticResults.testCoverage.uncoveredFiles.map(f => `- ${f.path}: ${f.coverage}%`).join('\n')}
`);
}
return sections.join('\n');
}
// Use in main analyzer
export async function analyzeWithStaticContext(context: PRContext): Promise<AnalysisResult> {
const staticResults = await gatherStaticAnalysisContext(context.changedFiles);
const staticContext = buildContextPrompt(staticResults);
// Pass to AI with explicit instruction not to duplicate
const enrichedContext = {
...context,
systemAddendum: `
${staticContext}
IMPORTANT: Do not flag any issues that are listed above as "Already Flagged."
Focus only on issues that static analysis cannot catch:
- Business logic errors
- Architectural concerns
- Cross-cutting implications
- Performance patterns that require understanding intent
- Security issues beyond pattern matching
`,
};
return analyzepr(enrichedContext);
}
Rate Limiting and Cost Control
Prevent runaway API costs:
// tools/ai-review/src/cost-control.ts
interface CostLimits {
maxTokensPerPR: number;
maxPRsPerDay: number;
maxDailySpend: number;
skipLargeDiffs: number; // lines
}
const DEFAULT_LIMITS: CostLimits = {
maxTokensPerPR: 100000,
maxPRsPerDay: 100,
maxDailySpend: 50, // USD
skipLargeDiffs: 2000,
};
class CostController {
private dailyUsage = {
tokens: 0,
prs: 0,
spend: 0,
date: new Date().toDateString(),
};
async shouldProcess(context: PRContext): Promise<{ allowed: boolean; reason?: string }> {
// Reset daily counters
if (this.dailyUsage.date !== new Date().toDateString()) {
this.dailyUsage = {
tokens: 0,
prs: 0,
spend: 0,
date: new Date().toDateString(),
};
}
// Check daily PR limit
if (this.dailyUsage.prs >= DEFAULT_LIMITS.maxPRsPerDay) {
return { allowed: false, reason: 'Daily PR limit reached' };
}
// Check daily spend limit
if (this.dailyUsage.spend >= DEFAULT_LIMITS.maxDailySpend) {
return { allowed: false, reason: 'Daily spend limit reached' };
}
// Check diff size
const diffLines = context.diff.split('\n').length;
if (diffLines > DEFAULT_LIMITS.skipLargeDiffs) {
return {
allowed: false,
reason: `PR too large (${diffLines} lines). Consider breaking into smaller PRs.`,
};
}
// Estimate tokens
const estimatedTokens = estimateTokenCount(context);
if (estimatedTokens > DEFAULT_LIMITS.maxTokensPerPR) {
return {
allowed: false,
reason: `PR would exceed token limit (${estimatedTokens} estimated)`,
};
}
return { allowed: true };
}
recordUsage(tokens: number, cost: number): void {
this.dailyUsage.tokens += tokens;
this.dailyUsage.prs += 1;
this.dailyUsage.spend += cost;
}
}
function estimateTokenCount(context: PRContext): number {
// Rough estimate: 1 token ≈ 4 characters
const diffTokens = Math.ceil(context.diff.length / 4);
const contextTokens = Math.ceil(context.body.length / 4);
const systemPromptTokens = 2000; // Fixed overhead
return diffTokens + contextTokens + systemPromptTokens;
}
Self-Hosted LLM Option
For sensitive codebases, run locally:
// tools/ai-review/src/providers/local.ts
import Anthropic from '@anthropic-ai/sdk';
interface LLMProvider {
complete(prompt: string, options: CompletionOptions): Promise<string>;
}
// Use Claude via local proxy or Anthropic API
class AnthropicProvider implements LLMProvider {
private client: Anthropic;
constructor() {
this.client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
}
async complete(prompt: string, options: CompletionOptions): Promise<string> {
const response = await this.client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: options.maxTokens || 4096,
messages: [
{ role: 'user', content: prompt }
],
});
return response.content[0].type === 'text'
? response.content[0].text
: '';
}
}
// Use Ollama for fully local inference
class OllamaProvider implements LLMProvider {
constructor(
private baseUrl = 'http://localhost:11434',
private model = 'codellama:34b'
) {}
async complete(prompt: string, options: CompletionOptions): Promise<string> {
const response = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: this.model,
prompt,
stream: false,
options: {
temperature: options.temperature || 0.2,
num_predict: options.maxTokens || 4096,
},
}),
});
const data = await response.json();
return data.response;
}
}
// Factory for selecting provider
export function createProvider(): LLMProvider {
const provider = process.env.LLM_PROVIDER || 'anthropic';
switch (provider) {
case 'ollama':
return new OllamaProvider(
process.env.OLLAMA_URL,
process.env.OLLAMA_MODEL
);
case 'anthropic':
return new AnthropicProvider();
default:
throw new Error(`Unknown provider: ${provider}`);
}
}
Metrics and Dashboards
Track the impact of AI-augmented reviews:
// tools/ai-review/src/metrics.ts
interface ReviewMetrics {
// Volume
totalPRsAnalyzed: number;
prsWithFindings: number;
findingsByCategory: Record<string, number>;
// Quality
findingsAccepted: number;
findingsRejected: number;
findingsIgnored: number;
acceptanceRate: number;
// Impact
avgTimeToFirstReview: number; // minutes
avgTimeToMerge: number;
revertedPRs: number;
securityIncidentsFromMergedPRs: number;
// Cost
totalTokensUsed: number;
totalAPICost: number;
costPerPR: number;
}
export async function computeMetrics(
startDate: Date,
endDate: Date
): Promise<ReviewMetrics> {
const feedback = await loadFeedback(startDate, endDate);
const prs = await loadPRData(startDate, endDate);
const findingsByCategory: Record<string, number> = {};
for (const f of feedback) {
findingsByCategory[f.findingType] = (findingsByCategory[f.findingType] || 0) + 1;
}
const accepted = feedback.filter(f => f.action === 'accepted').length;
const rejected = feedback.filter(f => f.action === 'rejected').length;
const ignored = feedback.filter(f => f.action === 'ignored').length;
return {
totalPRsAnalyzed: prs.length,
prsWithFindings: prs.filter(p => p.hadFindings).length,
findingsByCategory,
findingsAccepted: accepted,
findingsRejected: rejected,
findingsIgnored: ignored,
acceptanceRate: accepted / (accepted + rejected + ignored),
avgTimeToFirstReview: average(prs.map(p => p.timeToFirstReview)),
avgTimeToMerge: average(prs.map(p => p.timeToMerge)),
revertedPRs: prs.filter(p => p.wasReverted).length,
securityIncidentsFromMergedPRs: await countSecurityIncidents(prs),
totalTokensUsed: sum(prs.map(p => p.tokensUsed)),
totalAPICost: sum(prs.map(p => p.apiCost)),
costPerPR: sum(prs.map(p => p.apiCost)) / prs.length,
};
}
// Grafana dashboard query
const GRAFANA_DASHBOARD = `
# AI Review Metrics Dashboard
## Acceptance Rate Over Time
SELECT
date_trunc('week', timestamp) as week,
count(*) filter (where action = 'accepted') * 100.0 / count(*) as acceptance_rate
FROM ai_review_feedback
GROUP BY week
ORDER BY week;
## Findings by Category
SELECT
finding_type,
count(*) as total,
count(*) filter (where action = 'accepted') as accepted,
count(*) filter (where action = 'rejected') as rejected
FROM ai_review_feedback
WHERE timestamp > now() - interval '30 days'
GROUP BY finding_type;
## Cost Efficiency
SELECT
date_trunc('day', created_at) as day,
sum(api_cost) as daily_cost,
count(*) as prs_analyzed,
sum(api_cost) / count(*) as cost_per_pr
FROM ai_review_runs
GROUP BY day
ORDER BY day;
`;
Production Checklist
Pipeline Setup
- Gate checks run before AI analysis
- AI analysis only on non-draft PRs
- Large PR handling (skip or summarize)
- Cost controls and rate limiting
- Fallback when API unavailable
Quality Controls
- AI outputs summary, not inline comments
- Clear distinction between AI and human review
- No blocking on AI suggestions (advisory only)
- Feedback loop to track acceptance
- Regular prompt tuning based on data
Integration
- Static analysis results fed to AI context
- CODEOWNERS respected for routing
- Team guidelines document provided
- Different strategies for different PR types
Security
- API keys in GitHub Secrets
- Diff content not logged
- Option for self-hosted model
- No sensitive code sent to third parties (if required)
Observability
- Metrics collection
- Cost tracking dashboard
- Acceptance rate monitoring
- Alert on high rejection rates
Anti-Patterns to Avoid
- AI as gatekeeper - AI should never block merges; it's advisory only
- Inline comment spam - One summary beats 50 inline comments
- Duplicating linters - Don't flag what tools already catch
- Style policing - Code style is for formatters, not AI
- Ignoring context - Team conventions matter more than general best practices
- No feedback loop - Without tracking acceptance, you can't improve
- One-size-fits-all - Hotfixes need different treatment than features
- Undermining humans - AI augments reviewers, doesn't replace them
Summary
Effective AI-augmented code review is about workflow design, not prompt engineering. The key principles:
- Layer appropriately - Static analysis → AI analysis → Human review
- Know your boundaries - AI catches patterns; humans judge architecture
- Reduce noise - Summaries over comments, categorization over volume
- Close the loop - Track what gets accepted and tune accordingly
- Respect authority - Human reviewers have final say
The goal isn't to automate code review—it's to focus human attention where it matters most. A senior engineer's time is better spent discussing trade-offs than spotting missing null checks. AI handles the checklist; humans handle the judgment.
What did you think?