AI Content Moderation and Frontend Toxicity Detection
AI Content Moderation and Frontend Toxicity Detection
Real-World Problem Context
A social platform with user-generated content (comments, posts, profile bios, image uploads) handles 50,000 new text submissions per hour. The moderation team of 12 reviewers processes flagged content, but the volume has outgrown manual capacity — average review time is 8 hours for reported content, and harmful content stays visible during that window. The existing keyword-based filter catches only 15% of toxic content because users employ evasion techniques: leetspeak ("h4te"), Unicode substitution ("hаte" with Cyrillic 'а'), zero-width characters, and context-dependent toxicity ("you should try jumping" in a gaming context vs a self-harm context). The frontend team integrates AI moderation at three layers: (1) client-side pre-screening using a lightweight TensorFlow.js model that runs in the browser before submission, providing instant feedback, (2) edge/API moderation using a hosted classifier that blocks content before it enters the database, and (3) async review using an LLM that analyzes context, intent, and severity for nuanced cases. This post covers how each layer works, the model architectures involved, and the tradeoffs between speed, accuracy, and false positive rates.
Problem Statements
-
Client-Side Pre-Screening: How do you run a toxicity classifier in the browser using TensorFlow.js without blocking the UI thread? How do you handle model size, inference latency, and the tradeoff between catching harmful content and false-flagging legitimate text?
-
Evasion-Resistant Text Normalization: How do you detect and normalize text obfuscation techniques (leetspeak, Unicode homoglyphs, zero-width characters, whitespace insertion) before classification? How do you maintain a pipeline that adapts to new evasion patterns?
-
Context-Aware Severity Assessment: How does an LLM distinguish between context-dependent toxicity (e.g., "kill" in a gaming context vs a threat) and assign appropriate severity levels? How do you prevent both over-moderation (censoring legitimate speech) and under-moderation (missing subtle harassment)?
Deep Dive: Internal Mechanisms
1. Client-Side Toxicity Detection with TensorFlow.js
/*
* Running a toxicity model in the browser provides INSTANT feedback
* before the content is even submitted to the server.
*
* Architecture:
* ┌──────────────────────────────────────────────────┐
* │ Browser │
* │ ┌────────────────────────┐ │
* │ │ User types comment │ │
* │ └──────────┬─────────────┘ │
* │ │ debounce 300ms │
* │ ▼ │
* │ ┌────────────────────────┐ │
* │ │ Web Worker │ Main thread free! │
* │ │ ├─ Tokenize text │ │
* │ │ ├─ Run TF.js model │ │
* │ │ └─ Return scores │ │
* │ └──────────┬─────────────┘ │
* │ │ │
* │ ▼ │
* │ ┌────────────────────────┐ │
* │ │ UI: show warning if │ │
* │ │ toxicity > threshold │ │
* │ └────────────────────────┘ │
* │ │
* │ Model: ~5MB quantized, loaded once │
* │ Inference: ~50ms per prediction │
* │ Categories: toxic, severe_toxic, obscene, │
* │ threat, insult, identity_hate │
* └──────────────────────────────────────────────────┘
*/
// toxicity-worker.js — Web Worker for off-thread inference
import * as tf from '@tensorflow/tfjs';
let model = null;
let tokenizer = null;
// Load model once when worker starts:
async function initModel() {
// Use quantized model for smaller size (~5MB vs ~20MB):
model = await tf.loadLayersModel('/models/toxicity/model.json');
tokenizer = await loadTokenizer('/models/toxicity/tokenizer.json');
// Warm up: run a dummy prediction to compile the graph:
const dummy = tokenizer.encode('warm up');
const dummyTensor = tf.tensor2d([padSequence(dummy, 128)]);
model.predict(dummyTensor).dispose();
dummyTensor.dispose();
postMessage({ type: 'ready' });
}
self.onmessage = async (event) => {
if (event.data.type === 'predict') {
const { text, requestId } = event.data;
// Tokenize:
const tokens = tokenizer.encode(text.toLowerCase());
const padded = padSequence(tokens, 128); // Fixed input length
// Run inference:
const inputTensor = tf.tensor2d([padded]);
const prediction = model.predict(inputTensor);
const scores = await prediction.data();
// Clean up tensors (prevent memory leak):
inputTensor.dispose();
prediction.dispose();
// Map scores to categories:
const categories = ['toxic', 'severe_toxic', 'obscene',
'threat', 'insult', 'identity_hate'];
const result = {};
categories.forEach((cat, i) => {
result[cat] = scores[i];
});
postMessage({ type: 'result', requestId, scores: result });
}
};
function padSequence(tokens, maxLength) {
if (tokens.length >= maxLength) {
return tokens.slice(0, maxLength);
}
return [...tokens, ...new Array(maxLength - tokens.length).fill(0)];
}
initModel();
// React hook for using the toxicity worker:
function useToxicityDetection(options = {}) {
const { threshold = 0.7, debounceMs = 300 } = options;
const workerRef = useRef(null);
const [result, setResult] = useState(null);
const [isReady, setIsReady] = useState(false);
const requestIdRef = useRef(0);
useEffect(() => {
const worker = new Worker(
new URL('./toxicity-worker.js', import.meta.url),
{ type: 'module' }
);
worker.onmessage = (event) => {
if (event.data.type === 'ready') {
setIsReady(true);
}
if (event.data.type === 'result') {
setResult(event.data);
}
};
workerRef.current = worker;
return () => worker.terminate();
}, []);
const analyze = useMemo(() =>
debounce((text) => {
if (!workerRef.current || !isReady) return;
const requestId = ++requestIdRef.current;
workerRef.current.postMessage({
type: 'predict',
text,
requestId
});
}, debounceMs),
[isReady, debounceMs]
);
const isToxic = result?.scores &&
Object.values(result.scores).some(score => score > threshold);
const highestCategory = result?.scores &&
Object.entries(result.scores)
.sort(([,a], [,b]) => b - a)[0];
return { analyze, result: result?.scores, isToxic, highestCategory, isReady };
}
// Usage in a comment form:
function CommentForm({ onSubmit }) {
const [text, setText] = useState('');
const { analyze, isToxic, highestCategory, isReady } = useToxicityDetection();
const handleChange = (e) => {
setText(e.target.value);
analyze(e.target.value);
};
return (
<form onSubmit={(e) => {
e.preventDefault();
if (!isToxic) onSubmit(text);
}}>
<textarea value={text} onChange={handleChange} />
{isToxic && (
<div role="alert" className="warning">
This comment may violate our community guidelines.
Detected: {highestCategory?.[0]}
(confidence: {Math.round(highestCategory?.[1] * 100)}%)
</div>
)}
<button type="submit" disabled={isToxic || !isReady}>
Post Comment
</button>
</form>
);
}
2. Text Normalization and Evasion Detection
/*
* Users evade content filters using:
*
* 1. Leetspeak: h4t3, @ss, sh1t
* 2. Unicode homoglyphs: hаte (Cyrillic а), ℎate (script h)
* 3. Zero-width characters: hate (with ZWJ/ZWNJ)
* 4. Whitespace insertion: h a t e, h.a.t.e
* 5. Combining diacritics: h̶a̶t̶e̶ (strikethrough)
* 6. Mixed scripts: HλTE (Greek lambda)
* 7. Reversed text / Unicode tricks
* 8. Emoji substitution: 🖕 or coded sequences
*
* Normalization pipeline:
*
* ┌──────────────────────────────────────────────────┐
* │ "H4T∈ u" (input) │
* │ │ │
* │ ▼ │
* │ 1. Strip zero-width chars → "H4T∈ u" │
* │ │ │
* │ ▼ │
* │ 2. Unicode normalize (NFKC) → "H4T∈ u" │
* │ │ │
* │ ▼ │
* │ 3. Homoglyph → ASCII map → "H4TE u" │
* │ │ │
* │ ▼ │
* │ 4. Leetspeak decode → "HATE u" │
* │ │ │
* │ ▼ │
* │ 5. Lowercase → "hate u" │
* │ │ │
* │ ▼ │
* │ 6. Remove repeated chars → "hate u" │
* │ │ │
* │ ▼ │
* │ Both original AND normalized fed to classifier │
* └──────────────────────────────────────────────────┘
*/
class TextNormalizer {
constructor() {
// Leetspeak mapping:
this.leetMap = {
'0': 'o', '1': 'i', '3': 'e', '4': 'a', '5': 's',
'7': 't', '8': 'b', '9': 'g', '@': 'a', '$': 's',
'!': 'i', '|': 'l', '+': 't', '€': 'e', '¢': 'c',
};
// Common Unicode homoglyphs → ASCII:
this.homoglyphMap = new Map([
// Cyrillic:
['а', 'a'], ['е', 'e'], ['о', 'o'], ['р', 'p'],
['с', 'c'], ['у', 'y'], ['х', 'x'], ['А', 'A'],
['В', 'B'], ['Е', 'E'], ['К', 'K'], ['М', 'M'],
['Н', 'H'], ['О', 'O'], ['Р', 'P'], ['С', 'C'],
['Т', 'T'], ['Х', 'X'],
// Greek:
['α', 'a'], ['β', 'b'], ['ε', 'e'], ['η', 'n'],
['ι', 'i'], ['κ', 'k'], ['ο', 'o'], ['ρ', 'p'],
['τ', 't'], ['υ', 'u'], ['ν', 'v'], ['λ', 'l'],
// Math/special:
['∈', 'e'], ['∅', '0'], ['ℎ', 'h'], ['ℓ', 'l'],
]);
// Zero-width character ranges:
this.zeroWidthRegex = /[\u200B-\u200F\u2028-\u202F\uFEFF\u00AD]/g;
// Combining diacritical marks:
this.combiningMarksRegex = /[\u0300-\u036F\u0489\u20D0-\u20FF]/g;
}
normalize(text) {
let result = text;
// 1. Remove zero-width characters:
result = result.replace(this.zeroWidthRegex, '');
// 2. Unicode NFKC normalization:
result = result.normalize('NFKC');
// 3. Remove combining diacritical marks:
result = result.replace(this.combiningMarksRegex, '');
// 4. Replace homoglyphs:
result = [...result].map(char =>
this.homoglyphMap.get(char) || char
).join('');
// 5. Decode leetspeak:
result = [...result].map(char =>
this.leetMap[char] || char
).join('');
// 6. Lowercase:
result = result.toLowerCase();
// 7. Remove separator characters between letters:
// "h.a.t.e" → "hate", "h a t e" → "hate"
result = result.replace(/(\w)[.\-_\s]{1,2}(?=\w)/g, '$1');
// 8. Collapse repeated characters (3+ → 2):
// "haaaate" → "haate"
result = result.replace(/(.)\1{2,}/g, '$1$1');
return result;
}
// Detect if text contains evasion attempts:
detectEvasion(text) {
const indicators = [];
// Mixed scripts:
const scripts = detectScripts(text);
if (scripts.size > 1) {
indicators.push({
type: 'mixed-scripts',
detail: `Contains ${[...scripts].join(', ')} scripts`,
});
}
// Zero-width characters:
const zwCount = (text.match(this.zeroWidthRegex) || []).length;
if (zwCount > 0) {
indicators.push({
type: 'zero-width-chars',
detail: `Contains ${zwCount} zero-width characters`,
});
}
// Excessive leetspeak:
const leetCount = [...text].filter(c => c in this.leetMap).length;
const letterCount = [...text].filter(c => /[a-zA-Z]/.test(c)).length;
if (leetCount > 0 && leetCount / (letterCount + leetCount) > 0.3) {
indicators.push({
type: 'leetspeak',
detail: `${Math.round(leetCount / (letterCount + leetCount) * 100)}% leetspeak characters`,
});
}
// Spaced-out text:
if (text.match(/\w[.\-\s]\w[.\-\s]\w[.\-\s]\w/)) {
indicators.push({
type: 'spaced-text',
detail: 'Individual characters separated by spaces/punctuation',
});
}
return {
hasEvasion: indicators.length > 0,
indicators,
normalizedText: this.normalize(text),
};
}
}
function detectScripts(text) {
const scripts = new Set();
for (const char of text) {
const code = char.codePointAt(0);
if (code >= 0x0041 && code <= 0x024F) scripts.add('Latin');
else if (code >= 0x0400 && code <= 0x04FF) scripts.add('Cyrillic');
else if (code >= 0x0370 && code <= 0x03FF) scripts.add('Greek');
else if (code >= 0x0600 && code <= 0x06FF) scripts.add('Arabic');
else if (code >= 0x3000 && code <= 0x9FFF) scripts.add('CJK');
}
return scripts;
}
3. Server-Side Classification Pipeline
/*
* The server-side pipeline handles content that passes
* client-side screening OR is submitted via API directly.
*
* Multi-stage classification:
*
* ┌──────────────────────────────────────────────────┐
* │ Stage 1: Fast filter (< 5ms) │
* │ ├─ Blocklist check (exact match) │
* │ ├─ Regex pattern match │
* │ └─ Text normalization + blocklist │
* │ │ Pass? ──────────────────────┐ │
* │ ▼ │ │
* │ Stage 2: ML classifier (< 50ms) │ Block │
* │ ├─ Toxicity model (ONNX) │ │
* │ ├─ Spam model │ │
* │ └─ PII detection │ │
* │ │ Uncertain? ────────────────┤ │
* │ ▼ │ │
* │ Stage 3: LLM review (< 2s) │ │
* │ ├─ Context analysis │ │
* │ ├─ Intent classification │ │
* │ └─ Severity scoring │ │
* │ │ │ │
* │ ▼ ▼ │
* │ Decision: allow / block / flag for human review │
* └──────────────────────────────────────────────────┘
*/
class ContentModerationPipeline {
constructor(config) {
this.blocklist = new Set(config.blockedTerms);
this.normalizer = new TextNormalizer();
this.classifier = null; // ONNX model loaded separately
this.thresholds = config.thresholds || {
autoBlock: 0.95, // High confidence → block immediately
humanReview: 0.7, // Medium confidence → queue for review
autoAllow: 0.3, // Low confidence → allow
};
}
async moderate(content, context = {}) {
const startTime = Date.now();
// Stage 1: Fast filter
const stage1 = this.fastFilter(content);
if (stage1.blocked) {
return this.buildResult('blocked', 'blocklist', stage1, startTime);
}
// Stage 2: ML classifier
const normalized = this.normalizer.normalize(content.text);
const evasion = this.normalizer.detectEvasion(content.text);
const stage2 = await this.classifyText(
content.text,
normalized,
evasion
);
if (stage2.maxScore >= this.thresholds.autoBlock) {
return this.buildResult('blocked', 'classifier', stage2, startTime);
}
if (stage2.maxScore <= this.thresholds.autoAllow && !evasion.hasEvasion) {
return this.buildResult('allowed', 'classifier', stage2, startTime);
}
// Stage 3: LLM review for uncertain cases
const stage3 = await this.llmReview(content, context, stage2);
if (stage3.severity === 'high' || stage3.severity === 'critical') {
return this.buildResult('blocked', 'llm', stage3, startTime);
}
if (stage3.severity === 'medium') {
return this.buildResult('flagged', 'llm', stage3, startTime);
}
return this.buildResult('allowed', 'llm', stage3, startTime);
}
fastFilter(content) {
const text = content.text.toLowerCase();
const normalized = this.normalizer.normalize(content.text);
// Check blocklist against both original and normalized:
for (const term of this.blocklist) {
if (text.includes(term) || normalized.includes(term)) {
return { blocked: true, reason: 'blocklist', matchedTerm: term };
}
}
return { blocked: false };
}
async classifyText(originalText, normalizedText, evasion) {
// Run classification on both versions:
const originalScores = await this.runModel(originalText);
const normalizedScores = await this.runModel(normalizedText);
// Take the MAX score across both — evasion shouldn't reduce scores:
const combinedScores = {};
for (const category of Object.keys(originalScores)) {
combinedScores[category] = Math.max(
originalScores[category],
normalizedScores[category]
);
}
// Boost scores if evasion was detected (suspicious):
if (evasion.hasEvasion) {
for (const category of Object.keys(combinedScores)) {
combinedScores[category] = Math.min(1,
combinedScores[category] * 1.2
);
}
}
return {
scores: combinedScores,
maxScore: Math.max(...Object.values(combinedScores)),
topCategory: Object.entries(combinedScores)
.sort(([,a], [,b]) => b - a)[0],
evasionDetected: evasion.hasEvasion,
evasionIndicators: evasion.indicators,
};
}
buildResult(decision, stage, details, startTime) {
return {
decision, // 'allowed' | 'blocked' | 'flagged'
stage, // which stage made the decision
details,
latencyMs: Date.now() - startTime,
timestamp: Date.now(),
};
}
}
4. LLM-Based Context Analysis
/*
* The LLM excels at understanding CONTEXT and INTENT
* that ML classifiers miss:
*
* "I'm going to kill it in this interview" → Not a threat
* "I'm going to kill you" → Threat
* "Kill the process running on port 3000" → Technical
*
* "You're so toxic" → Could be insult OR gaming term
* "This build is cancer" → Frustration, not hate speech
*
* The LLM evaluates:
* 1. Intent: hostile, frustrated, sarcastic, neutral
* 2. Context: gaming, technical, social, political
* 3. Target: individual, group, self, nobody
* 4. Severity: critical, high, medium, low, none
*/
async function llmContextualReview(content, context, classifierScores) {
const prompt = `You are a content moderator for a social platform.
Analyze this user submission for policy violations.
CONTENT:
"${content.text}"
CONTEXT:
- Posted in: ${context.section || 'general feed'}
- Thread topic: ${context.threadTitle || 'none'}
- User history: ${context.userReputation || 'unknown'}
- Previous messages in thread: ${context.recentMessages?.slice(-3).map(m => `"${m}"`).join(', ') || 'none'}
ML CLASSIFIER SCORES (0-1):
${Object.entries(classifierScores.scores).map(([cat, score]) =>
` ${cat}: ${score.toFixed(3)}`
).join('\n')}
${classifierScores.evasionDetected ?
`\nEVASION DETECTED: ${classifierScores.evasionIndicators.map(i => i.type).join(', ')}` : ''}
EVALUATE:
1. INTENT: What is the user trying to communicate? (hostile/frustrated/sarcastic/humorous/neutral)
2. CONTEXT: Does the surrounding context change the meaning? (explain briefly)
3. TARGET: Is this directed at a person, group, or nobody?
4. POLICY: Does this violate community guidelines? Which specific rule?
5. SEVERITY: none / low / medium / high / critical
6. CONFIDENCE: How confident are you? (0-1)
7. ACTION: allow / warn-user / hide-pending-review / block
8. EXPLANATION: One sentence explaining your decision (shown to moderator)
Respond as JSON.`;
const response = await callLLM(prompt, {
model: 'gpt-4o-mini', // Fast model for moderation
temperature: 0, // Deterministic for consistency
maxTokens: 300,
});
return JSON.parse(response);
}
// Example responses:
/*
* Input: "I'm going to kill it in this interview tomorrow!"
* Response:
* {
* "intent": "neutral",
* "context": "Idiomatic expression meaning 'do very well'",
* "target": "nobody",
* "policy": "No violation",
* "severity": "none",
* "confidence": 0.95,
* "action": "allow",
* "explanation": "Idiomatic expression 'kill it' meaning to perform excellently"
* }
*
* Input: "u should kys lol" (in a gaming forum)
* Response:
* {
* "intent": "hostile",
* "context": "Despite gaming context, 'kys' (kill yourself) is a direct incitement to self-harm",
* "target": "individual (addressed with 'u')",
* "policy": "Violates self-harm/suicide policy",
* "severity": "critical",
* "confidence": 0.98,
* "action": "block",
* "explanation": "Direct incitement to self-harm, regardless of gaming context"
* }
*/
5. Image and Multimodal Content Moderation
/*
* Text-only moderation misses:
* - Offensive images
* - Text-in-images (bypassing text filters)
* - Memes with hateful overlay text
* - Profile pictures with symbols/imagery
*
* Client-side image pre-screening:
* 1. NSFW detection using a lightweight CNN
* 2. OCR to extract text from images
* 3. Combined text + image classification
*/
// Client-side NSFW image detection:
class ImageModerator {
constructor() {
this.model = null;
this.canvas = document.createElement('canvas');
this.ctx = this.canvas.getContext('2d');
}
async init() {
// Load NSFWJS model (MobileNet-based, ~2MB quantized):
const nsfwjs = await import('nsfwjs');
this.model = await nsfwjs.load('/models/nsfw/model.json', {
type: 'graph',
});
}
async classifyImage(imageElement) {
// Resize to model input size:
this.canvas.width = 224;
this.canvas.height = 224;
this.ctx.drawImage(imageElement, 0, 0, 224, 224);
// Classify:
const predictions = await this.model.classify(this.canvas);
// predictions = [
// { className: 'Neutral', probability: 0.85 },
// { className: 'Drawing', probability: 0.10 },
// { className: 'Sexy', probability: 0.03 },
// { className: 'Hentai', probability: 0.01 },
// { className: 'Porn', probability: 0.01 },
// ]
const result = {};
for (const pred of predictions) {
result[pred.className.toLowerCase()] = pred.probability;
}
return {
scores: result,
isSafe: result.neutral > 0.7 || result.drawing > 0.5,
isNSFW: (result.porn || 0) + (result.hentai || 0) + (result.sexy || 0) > 0.5,
};
}
// Extract text from image for content check:
async extractTextFromImage(imageElement) {
// Use Tesseract.js for client-side OCR:
const { createWorker } = await import('tesseract.js');
const worker = await createWorker('eng');
const { data } = await worker.recognize(imageElement);
await worker.terminate();
return data.text;
}
// Combined image + text moderation:
async moderateUpload(file) {
const img = await loadImage(file);
// Run in parallel: NSFW check + OCR:
const [nsfwResult, extractedText] = await Promise.all([
this.classifyImage(img),
this.extractTextFromImage(img),
]);
// Check extracted text for toxicity:
let textResult = null;
if (extractedText.trim().length > 5) {
textResult = await analyzeText(extractedText);
}
return {
image: nsfwResult,
extractedText,
textAnalysis: textResult,
isAcceptable: nsfwResult.isSafe &&
(!textResult || !textResult.isToxic),
};
}
}
function loadImage(file) {
return new Promise((resolve) => {
const img = new Image();
img.onload = () => resolve(img);
img.src = URL.createObjectURL(file);
});
}
6. Rate Limiting and Abuse Prevention
/*
* Content moderation must also handle SPAM and ABUSE patterns:
* - Rapid-fire posting
* - Duplicate content across threads
* - Coordinated inauthentic behavior (multiple accounts posting same content)
* - Gradual policy testing (slowly escalating toxicity to find the threshold)
*
* These require BEHAVIORAL analysis, not just content analysis.
*/
class BehavioralModerator {
constructor(redis) {
this.redis = redis;
}
async checkBehavior(userId, content) {
const flags = [];
// 1. Rate limiting:
const recentPostCount = await this.redis.incr(
`ratelimit:posts:${userId}`
);
if (recentPostCount === 1) {
await this.redis.expire(`ratelimit:posts:${userId}`, 60);
}
if (recentPostCount > 10) { // >10 posts/minute
flags.push({
type: 'rate-limit',
severity: 'high',
detail: `${recentPostCount} posts in last 60 seconds`,
});
}
// 2. Duplicate content detection:
const contentHash = hashContent(content.text);
const duplicateCount = await this.redis.scard(
`duplicates:${contentHash}`
);
await this.redis.sadd(`duplicates:${contentHash}`, userId);
await this.redis.expire(`duplicates:${contentHash}`, 3600);
if (duplicateCount > 3) { // Same content posted by 3+ users
flags.push({
type: 'coordinated-duplicate',
severity: 'medium',
detail: `Same content posted by ${duplicateCount + 1} users`,
});
}
// 3. Escalation pattern:
// Track user's toxicity scores over time:
const key = `toxicity-history:${userId}`;
const recentScores = await this.redis.lrange(key, 0, 19);
if (recentScores.length >= 5) {
const scores = recentScores.map(Number);
const trend = calculateTrend(scores);
if (trend > 0.05) { // Increasing toxicity
flags.push({
type: 'escalation-pattern',
severity: 'medium',
detail: `Toxicity trending upward (${scores.slice(-3).map(s => s.toFixed(2)).join(' → ')})`,
});
}
}
return flags;
}
}
function hashContent(text) {
// Simple content hash (normalize whitespace and case):
const normalized = text.toLowerCase().replace(/\s+/g, ' ').trim();
let hash = 0;
for (const char of normalized) {
hash = ((hash << 5) - hash) + char.charCodeAt(0);
hash |= 0;
}
return hash.toString(36);
}
function calculateTrend(scores) {
// Simple linear regression slope:
const n = scores.length;
const sumX = n * (n - 1) / 2;
const sumY = scores.reduce((a, b) => a + b, 0);
const sumXY = scores.reduce((sum, y, x) => sum + x * y, 0);
const sumX2 = n * (n - 1) * (2 * n - 1) / 6;
return (n * sumXY - sumX * sumY) / (n * sumX2 - sumX * sumX);
}
7. User Feedback and Appeal Flow
/*
* False positives damage user trust.
* Users must be able to:
* 1. Understand WHY their content was moderated
* 2. Appeal the decision
* 3. Get a timely response
*
* ┌──────────────────────────────────────────────────┐
* │ Content blocked │
* │ │ │
* │ ▼ │
* │ Show reason: "Your comment was flagged for │
* │ potential harassment. [Appeal] [Edit & Resubmit]" │
* │ │ │
* │ ├─ [Edit & Resubmit] → Re-run moderation │
* │ │ │
* │ └─ [Appeal] → Queue for human review │
* │ │ │
* │ ▼ │
* │ LLM pre-screens appeal: │
* │ "Is the original moderation likely correct?" │
* │ │ │
* │ ├─ Likely incorrect → Auto-approve + flag │
* │ │ for policy review │
* │ │ │
* │ └─ Likely correct → Queue for human review │
* │ (prioritized by user reputation) │
* └──────────────────────────────────────────────────┘
*/
function ModerationNotice({ decision, content, onAppeal, onEdit }) {
const reasonMap = {
toxic: 'potential toxicity',
threat: 'threatening language',
insult: 'personal attacks',
identity_hate: 'hateful content targeting identity groups',
obscene: 'explicit language',
spam: 'spam or repetitive content',
};
const reason = reasonMap[decision.details?.topCategory?.[0]] ||
'community guideline violation';
return (
<div role="alert" className="moderation-notice">
<div className="notice-header">
<WarningIcon />
<h3>Content not posted</h3>
</div>
<p>
Your comment was flagged for {reason}.
This helps keep our community safe.
</p>
<div className="notice-content">
<blockquote>{content.text}</blockquote>
</div>
<p className="notice-help">
You can edit your comment and try again, or appeal
if you believe this was a mistake.
</p>
<div className="notice-actions">
<button onClick={onEdit} className="btn-primary">
Edit & Resubmit
</button>
<button onClick={onAppeal} className="btn-secondary">
Appeal Decision
</button>
</div>
<details className="notice-details">
<summary>Technical details</summary>
<dl>
<dt>Decision stage</dt>
<dd>{decision.stage}</dd>
<dt>Processing time</dt>
<dd>{decision.latencyMs}ms</dd>
<dt>Category scores</dt>
<dd>
{Object.entries(decision.details?.scores || {})
.filter(([, score]) => score > 0.1)
.map(([cat, score]) => (
<span key={cat}>
{cat}: {Math.round(score * 100)}%
</span>
))}
</dd>
</dl>
</details>
</div>
);
}
// Appeal pre-screening:
async function preScreenAppeal(originalContent, decision, appealText) {
const response = await callLLM(`
A user is appealing a content moderation decision.
ORIGINAL CONTENT:
"${originalContent}"
MODERATION DECISION:
Action: ${decision.decision}
Reason: ${decision.details?.topCategory?.[0]}
Confidence: ${decision.details?.scores?.[decision.details.topCategory[0]]?.toFixed(3)}
USER'S APPEAL:
"${appealText}"
Evaluate:
1. Was the moderation likely CORRECT or INCORRECT?
2. Is there a reasonable interpretation where the content is non-violating?
3. Recommendation: auto-approve / human-review-needed / uphold-block
Respond as JSON: { "likely_correct": bool, "reasoning": "...", "recommendation": "..." }`,
{ model: 'gpt-4o-mini', temperature: 0, maxTokens: 150 }
);
return JSON.parse(response);
}
8. Real-Time Content Filtering in Chat
/*
* For real-time chat (WebSocket), moderation must be:
* - Sub-50ms (messages appear instantly)
* - Non-blocking (other messages keep flowing)
* - Retroactive (can remove messages after deeper analysis)
*/
class RealTimeChatModerator {
constructor(pipeline) {
this.pipeline = pipeline;
this.pendingReviews = new Map();
}
async moderateMessage(message, socket) {
// Stage 1: Synchronous fast check (< 5ms):
const fastResult = this.pipeline.fastFilter(message);
if (fastResult.blocked) {
// Block immediately, don't broadcast:
socket.emit('message:blocked', {
messageId: message.id,
reason: 'policy_violation',
});
return;
}
// Allow message through immediately (optimistic):
socket.broadcast.emit('message:new', message);
// Stage 2: Async deeper analysis:
this.deepAnalyze(message, socket);
}
async deepAnalyze(message, socket) {
const result = await this.pipeline.moderate(message, {
section: 'chat',
threadTitle: message.channelName,
});
if (result.decision === 'blocked') {
// Retroactively remove the message:
socket.broadcast.emit('message:removed', {
messageId: message.id,
reason: 'Content removed by automated moderation',
});
// Notify the sender:
socket.emit('message:moderated', {
messageId: message.id,
reason: result.details?.topCategory?.[0],
});
} else if (result.decision === 'flagged') {
// Add to human review queue but keep visible:
await this.queueForReview(message, result);
}
}
// Shadow-ban: user sees their messages, others don't:
async applyShadowBan(userId, socket) {
// Messages from this user are:
// 1. Shown to the user (socket.emit to them)
// 2. NOT broadcast to others
// 3. Queued for human review
socket.onBeforeBroadcast = (event, data) => {
if (data.userId === userId) {
// Send only to the user, not to room:
socket.emit(event, data);
return false; // Prevent broadcast
}
return true;
};
}
}
9. Model Training Data and Bias Mitigation
/*
* Content moderation models have known biases:
* - African American Vernacular English (AAVE) flagged as toxic at higher rates
* - Reclaimed terms ("queer" in LGBTQ+ context) flagged as identity hate
* - Non-English text under-moderated (model trained mostly on English)
* - Sarcasm and irony misclassified
*
* Mitigation strategies:
*/
class BiasAwareModerator {
constructor(baseModel, biasConfig) {
this.baseModel = baseModel;
this.biasConfig = biasConfig;
// Context-dependent term overrides:
this.contextOverrides = new Map([
// Term → contexts where it's acceptable
['queer', ['lgbtq', 'pride', 'identity', 'community']],
['crip', ['disability', 'accessibility', 'crip-theory']],
['bitch', ['dog-breeding', 'veterinary']],
]);
}
async classify(text, context) {
const baseResult = await this.baseModel.classify(text);
// Apply context overrides:
const adjustedResult = this.applyContextOverrides(
text, baseResult, context
);
// Apply demographic parity adjustments:
const debiasedResult = this.applyDebiasing(
text, adjustedResult
);
return debiasedResult;
}
applyContextOverrides(text, result, context) {
const lowerText = text.toLowerCase();
for (const [term, safeContexts] of this.contextOverrides) {
if (lowerText.includes(term)) {
const inSafeContext = safeContexts.some(ctx =>
context.section?.includes(ctx) ||
context.threadTitle?.toLowerCase().includes(ctx)
);
if (inSafeContext) {
// Reduce toxicity scores — term used in appropriate context:
for (const category of Object.keys(result.scores)) {
result.scores[category] *= 0.3;
}
result.contextNote = `"${term}" used in ${context.section} context`;
}
}
}
return result;
}
applyDebiasing(text, result) {
// Detect potential dialect bias:
// If text uses AAVE patterns AND toxicity is medium (not clearly toxic),
// reduce confidence and flag for human review instead of auto-blocking.
const aavePatterns = [
/\bfinna\b/, /\bion\b/, /\baint\b/, /\bbruh\b/,
/\bbet\b/, /\bfr\b/, /\bnocap\b/, /\bslay\b/,
];
const hasDialectMarkers = aavePatterns.filter(p =>
p.test(text.toLowerCase())
).length >= 2;
if (hasDialectMarkers && result.maxScore > 0.5 && result.maxScore < 0.85) {
// Medium confidence + dialect markers → needs human review:
result.biasFlag = 'potential-dialect-bias';
result.recommendHumanReview = true;
// Don't auto-block; route to human:
result.maxScore = Math.min(result.maxScore, 0.69);
}
return result;
}
}
// Audit moderation decisions for demographic disparities:
async function auditModerationBias(decisions, userDemographics) {
// Group decisions by demographic attributes:
const byGroup = {};
for (const decision of decisions) {
const group = userDemographics[decision.userId]?.group || 'unknown';
if (!byGroup[group]) {
byGroup[group] = { total: 0, blocked: 0, flagged: 0 };
}
byGroup[group].total++;
if (decision.decision === 'blocked') byGroup[group].blocked++;
if (decision.decision === 'flagged') byGroup[group].flagged++;
}
// Calculate block rates per group:
const rates = {};
for (const [group, counts] of Object.entries(byGroup)) {
rates[group] = {
blockRate: counts.blocked / Math.max(counts.total, 1),
flagRate: counts.flagged / Math.max(counts.total, 1),
total: counts.total,
};
}
// Flag significant disparities:
const avgBlockRate = decisions.filter(d => d.decision === 'blocked').length / decisions.length;
const disparities = [];
for (const [group, rate] of Object.entries(rates)) {
if (rate.blockRate > avgBlockRate * 1.5 && rate.total > 100) {
disparities.push({
group,
blockRate: rate.blockRate,
avgBlockRate,
ratio: rate.blockRate / avgBlockRate,
sampleSize: rate.total,
});
}
}
return { rates, disparities, needsReview: disparities.length > 0 };
}
10. Moderation Dashboard and Analytics
/*
* The frontend moderation dashboard shows:
* 1. Real-time moderation metrics
* 2. Queue of content flagged for human review
* 3. Appeal queue
* 4. Model performance metrics
* 5. Bias audit results
*/
// React dashboard component:
function ModerationDashboard() {
const metrics = useModerationMetrics();
const queue = useModerationQueue();
return (
<div className="mod-dashboard">
{/* Real-time metrics */}
<div className="metrics-grid">
<MetricCard
title="Processed / hour"
value={metrics.processedPerHour}
trend={metrics.processedTrend}
/>
<MetricCard
title="Auto-blocked"
value={`${metrics.autoBlockRate.toFixed(1)}%`}
color={metrics.autoBlockRate > 5 ? 'red' : 'green'}
/>
<MetricCard
title="False positive rate"
value={`${metrics.falsePositiveRate.toFixed(1)}%`}
color={metrics.falsePositiveRate > 2 ? 'red' : 'green'}
/>
<MetricCard
title="Avg latency"
value={`${metrics.avgLatencyMs}ms`}
color={metrics.avgLatencyMs > 200 ? 'yellow' : 'green'}
/>
<MetricCard
title="Pending review"
value={queue.length}
color={queue.length > 100 ? 'red' : 'green'}
/>
<MetricCard
title="Evasion detected"
value={`${metrics.evasionRate.toFixed(1)}%`}
/>
</div>
{/* Category breakdown chart */}
<div className="category-chart">
<h3>Moderation by Category (24h)</h3>
<BarChart data={metrics.categoryBreakdown} />
</div>
{/* Review queue */}
<div className="review-queue">
<h3>Pending Human Review ({queue.length})</h3>
{queue.map(item => (
<ReviewCard
key={item.id}
content={item.content}
scores={item.scores}
context={item.context}
onApprove={() => handleApprove(item.id)}
onReject={() => handleReject(item.id)}
onEscalate={() => handleEscalate(item.id)}
/>
))}
</div>
</div>
);
}
function ReviewCard({ content, scores, context, onApprove, onReject, onEscalate }) {
const [expanded, setExpanded] = useState(false);
return (
<div className="review-card">
<div className="review-header">
<span className={`severity-badge ${getSeverityClass(scores)}`}>
{scores.topCategory[0]}
</span>
<span className="confidence">
{Math.round(scores.topCategory[1] * 100)}% confidence
</span>
<time>{formatRelative(content.timestamp)}</time>
</div>
<blockquote className="review-content">
{content.text}
</blockquote>
{context && (
<div className="review-context">
<small>
Posted in {context.section}
{context.threadTitle && ` • Thread: ${context.threadTitle}`}
</small>
</div>
)}
{scores.evasionDetected && (
<div className="evasion-warning">
Evasion detected: {scores.evasionIndicators.map(i => i.type).join(', ')}
</div>
)}
<div className="review-actions">
<button onClick={onApprove} className="btn-approve">
Approve (Not Violating)
</button>
<button onClick={onReject} className="btn-reject">
Confirm Violation
</button>
<button onClick={onEscalate} className="btn-escalate">
Escalate to Senior Mod
</button>
</div>
</div>
);
}
Trade-offs & Considerations
| Aspect | Keyword Filter | ML Classifier | LLM Analysis | Multi-Layer |
|---|---|---|---|---|
| Latency | < 1ms | 20-50ms | 500-2000ms | 5ms-2s (staged) |
| Accuracy | 15-30% | 70-85% | 85-95% | 90-95% |
| False positives | Very high | Medium | Low | Low |
| Evasion resistance | None | Medium | High | High |
| Context understanding | None | Limited | Strong | Strong |
| Cost per check | Free | ~$0.001 | ~$0.01 | ~$0.003 avg |
| Multi-language | Per-language rules | Model-dependent | Good | Good |
| Bias risk | Low | Medium-High | Medium | Mitigated |
Best Practices
-
Use a multi-stage pipeline — fast filter first, ML second, LLM for uncertain cases only — run a blocklist check (< 1ms) before the ML classifier (20-50ms) before the LLM (500ms+); this keeps average latency under 30ms because 80% of content is clearly safe (passes Stage 1) and only 5-10% of content needs LLM analysis; the LLM handles nuanced cases where context matters.
-
Normalize text before classification to defeat evasion, but classify BOTH original and normalized versions — strip zero-width characters, replace Unicode homoglyphs with ASCII equivalents, decode leetspeak, and collapse spaced-out text; take the MAX toxicity score across original and normalized versions so evasion can only increase detection, never decrease it; boost scores when evasion indicators are detected.
-
Run client-side models in a Web Worker to avoid blocking the main thread — a TensorFlow.js toxicity model runs inference in ~50ms, which causes visible jank on the main thread; move it to a Web Worker, debounce input by 300ms, and warm up the model on load; this provides instant user feedback ("This comment may violate guidelines") before submission.
-
Implement bias mitigation as a core feature, not an afterthought — audit moderation decisions for demographic disparities; apply context overrides for reclaimed terms in appropriate spaces; when dialect markers are detected with medium-confidence toxicity scores, route to human review instead of auto-blocking; publish moderation transparency reports.
-
Provide clear moderation feedback with appeal paths — opaque moderation erodes trust — show users WHY their content was moderated (category, not raw scores), let them edit and resubmit, and provide a one-click appeal; pre-screen appeals with an LLM to auto-approve likely false positives; track appeal overturn rates as a key metric for model quality.
Conclusion
Frontend content moderation requires a multi-layered approach that balances speed, accuracy, and fairness. The pipeline stages content through increasingly sophisticated checks: blocklist matching (< 1ms) catches known bad content, ML classifiers (20-50ms via ONNX or TensorFlow.js) handle the bulk of classification, and LLMs (500ms+) analyze uncertain cases requiring context understanding. Text normalization defeats evasion by stripping zero-width characters, replacing Unicode homoglyphs with ASCII equivalents, and decoding leetspeak — then scoring both original and normalized text and taking the maximum. Client-side pre-screening in a Web Worker provides instant feedback using a lightweight TensorFlow.js model (~5MB quantized), preventing obviously toxic content from being submitted. Image moderation adds NSFW classification via MobileNet-based CNN and OCR text extraction to catch text-in-image bypass attempts. Bias mitigation requires active measures: context-aware overrides for reclaimed terms, dialect-aware scoring adjustments, and systematic auditing of block rates across demographic groups. The appeal flow preserves user trust by explaining moderation reasons, offering edit-and-resubmit, and pre-screening appeals with LLM analysis. Behavioral analysis (rate limiting, duplicate detection, escalation pattern tracking) catches spam and coordinated abuse that content analysis alone misses. The moderation dashboard surfaces real-time metrics (false positive rate, latency, category distribution), the human review queue, and bias audit results — keeping the system accountable and continuously improvable.
What did you think?