AI-Powered Internationalization, Translation, and Localization Automation
AI-Powered Internationalization, Translation, and Localization Automation
Real-World Problem Context
A B2B SaaS product supports 14 languages across 22 markets. The frontend monorepo contains 4,200 translation keys spread across 180 React components. The traditional workflow involves: developers write English strings in code, a localization engineer extracts them into JSON files, sends batches to a translation vendor (2-3 day turnaround), QC reviewers check translations in context (another day), and finally the translations are merged. This 4-5 day cycle means features ship English-only for a week, creating a two-tier user experience. String extraction is manual and error-prone — developers forget to mark strings for translation (hardcoded text), ICU message format syntax errors appear frequently, and plural rules for languages like Arabic or Polish are consistently wrong. The team integrates AI at four points: (1) automatic detection of untranslated strings in JSX at build time, (2) instant machine translation with quality estimation for rapid iteration, (3) context-aware translation that understands UI constraints (button width, placeholder length), and (4) automated quality checks for ICU format, gender agreement, plural rules, and truncation risk. This post covers how each mechanism works internally.
Problem Statements
-
Hardcoded String Detection: How do you automatically find strings in JSX/TSX that should be internationalized but aren't? How does AST analysis distinguish UI text from technical strings (CSS class names, event names, API keys)?
-
Context-Aware Machine Translation: How does AI translate UI strings while respecting constraints like maximum character length, plural forms, gender agreement, and variable interpolation? How do you maintain terminology consistency across thousands of keys?
-
Translation Quality Validation: How do you automatically detect translation errors — broken ICU format, missing interpolation variables, wrong plural categories, text that will overflow UI containers — before they reach production?
Deep Dive: Internal Mechanisms
1. AST-Based Hardcoded String Detection
/*
* Detecting untranslated strings in JSX requires understanding
* what IS a user-visible string vs what is NOT.
*
* User-visible (should be translated):
* - Text content in JSX: <h1>Dashboard</h1>
* - String props: placeholder="Enter name"
* - aria-label="Close dialog"
* - title="Click to expand"
*
* NOT user-visible (should NOT be translated):
* - className="header-title"
* - data-testid="dashboard-heading"
* - event handlers: onClick={...}
* - import paths: import Foo from './Foo'
* - Object keys: { type: 'submit' }
* - Enum values, constants: STATUS.ACTIVE
*
* AST approach:
*
* ┌──────────────────────────────────────────────────┐
* │ Source file (TSX) │
* │ │ │
* │ ▼ │
* │ Parse AST (Babel/TypeScript parser) │
* │ │ │
* │ ▼ │
* │ Walk JSX nodes: │
* │ ├── JSXText: "Dashboard" → CANDIDATE │
* │ ├── JSXAttribute: │
* │ │ ├── name="className" → SKIP │
* │ │ ├── name="placeholder" → CANDIDATE │
* │ │ └── name="data-testid" → SKIP │
* │ └── StringLiteral in JSXExpression: │
* │ ├── in ternary: cond ? "Yes" : "No" → CAND│
* │ └── in template: `${count} items` → CAND │
* │ │ │
* │ ▼ │
* │ Filter candidates: │
* │ - Remove single-word technical strings │
* │ - Remove strings matching known patterns │
* │ - Score remaining by likelihood of being UI text │
* │ │ │
* │ ▼ │
* │ Report: file, line, string, confidence │
* └──────────────────────────────────────────────────┘
*/
const parser = require('@babel/parser');
const traverse = require('@babel/traverse').default;
// Props that contain user-visible text:
const TRANSLATABLE_PROPS = new Set([
'placeholder', 'title', 'aria-label', 'aria-description',
'aria-placeholder', 'aria-roledescription', 'aria-valuetext',
'alt', 'label', 'helperText', 'errorMessage', 'description',
'tooltip', 'caption', 'heading', 'subheading', 'confirmText',
'cancelText', 'emptyText', 'loadingText',
]);
// Props that should NOT be translated:
const NON_TRANSLATABLE_PROPS = new Set([
'className', 'style', 'id', 'key', 'ref', 'data-testid',
'data-cy', 'name', 'type', 'value', 'href', 'src', 'action',
'method', 'role', 'tabIndex', 'htmlFor',
]);
function detectHardcodedStrings(sourceCode, filePath) {
const ast = parser.parse(sourceCode, {
sourceType: 'module',
plugins: ['jsx', 'typescript'],
});
const candidates = [];
traverse(ast, {
// Direct text in JSX: <div>Hello World</div>
JSXText(path) {
const text = path.node.value.trim();
if (text && !isWhitespaceOnly(text)) {
candidates.push({
type: 'jsx-text',
value: text,
line: path.node.loc.start.line,
confidence: 0.95,
});
}
},
// String attributes: placeholder="Enter email"
JSXAttribute(path) {
const propName = path.node.name.name ||
path.node.name.namespace?.name + ':' +
path.node.name.name?.name;
if (NON_TRANSLATABLE_PROPS.has(propName)) return;
const value = path.node.value;
if (value?.type === 'StringLiteral' && value.value.trim()) {
const isKnownTranslatable = TRANSLATABLE_PROPS.has(propName);
candidates.push({
type: 'jsx-attribute',
prop: propName,
value: value.value,
line: value.loc.start.line,
confidence: isKnownTranslatable ? 0.95 : 0.6,
});
}
},
// String literals in JSX expressions: {condition ? "Yes" : "No"}
StringLiteral(path) {
if (!isInsideJSX(path)) return;
if (isInsideImport(path)) return;
if (isObjectKey(path)) return;
const value = path.node.value.trim();
if (value && looksLikeUIText(value)) {
candidates.push({
type: 'jsx-expression-string',
value,
line: path.node.loc.start.line,
confidence: 0.7,
});
}
},
// Template literals: {`Welcome, ${name}!`}
TemplateLiteral(path) {
if (!isInsideJSX(path)) return;
const quasis = path.node.quasis.map(q => q.value.raw).join('{}');
if (looksLikeUIText(quasis)) {
candidates.push({
type: 'template-literal',
value: quasis,
line: path.node.loc.start.line,
confidence: 0.8,
});
}
},
});
// Filter out false positives:
return candidates
.filter(c => c.confidence > 0.5)
.filter(c => !isTechnicalString(c.value))
.sort((a, b) => b.confidence - a.confidence);
}
function looksLikeUIText(str) {
// Heuristics for identifying user-visible text:
if (str.length < 2) return false;
if (str.match(/^[a-z][a-zA-Z]+$/)) return false; // camelCase identifier
if (str.match(/^[A-Z_]+$/)) return false; // CONSTANT_CASE
if (str.match(/^[a-z-]+$/)) return false; // kebab-case (CSS class)
if (str.match(/^(https?:|\/|\.)/)) return false; // URL or path
if (str.match(/^\d+(\.\d+)?$/)) return false; // Number
if (str.includes(' ') || str.match(/^[A-Z]/)) return true; // Contains spaces or starts uppercase
return false;
}
function isTechnicalString(str) {
const technicalPatterns = [
/^#[0-9a-f]{3,8}$/i, // Color hex
/^\d+(\.\d+)?(px|rem|em|%)$/, // CSS units
/^(GET|POST|PUT|DELETE|PATCH)$/,
/^(primary|secondary|default|error|warning|info|success)$/,
/^(sm|md|lg|xl|xxl)$/,
/^(left|right|center|top|bottom)$/,
];
return technicalPatterns.some(p => p.test(str));
}
2. Translation Key Extraction and Wrapping
/*
* Once hardcoded strings are detected, the next step
* is to automatically wrap them with i18n function calls.
*
* Before:
* <button>Submit Order</button>
* <input placeholder="Search products..." />
* {count === 0 ? "No results" : `${count} results found`}
*
* After:
* <button>{t('order.submitButton')}</button>
* <input placeholder={t('search.placeholder')} />
* {count === 0 ? t('search.noResults') : t('search.resultsCount', { count })}
*
* The AI generates semantically meaningful key names
* instead of generic keys like "button_1" or "text_42".
*/
async function autoWrapStrings(candidates, filePath, componentName) {
// Group candidates by semantic context:
const grouped = groupByContext(candidates);
// Generate key names using AI:
const keyPrompt = `Generate i18n translation key names for these UI strings.
Component: ${componentName}
File: ${filePath}
Strings to internationalize:
${candidates.map((c, i) => `${i + 1}. "${c.value}" (${c.type}, line ${c.line})`).join('\n')}
Rules:
- Use dot-notation namespacing: component.element.descriptor
- Keys should be descriptive: "userProfile.nameField.placeholder" not "text_1"
- Group related strings under the same namespace
- Use camelCase for the final segment
- Keep keys under 50 characters
- For strings with variables, note the interpolation
Return JSON: [{ "string": "...", "key": "...", "variables": [] }]`;
const keyMappings = JSON.parse(await callLLM(keyPrompt, {
model: 'gpt-4o-mini',
temperature: 0.1,
maxTokens: 500,
}));
// Apply transformations using AST:
return generateCodemod(filePath, keyMappings);
}
function generateCodemod(filePath, keyMappings) {
// Generate a jscodeshift codemod:
return `
// Auto-generated codemod for ${filePath}
export default function transformer(file, api) {
const j = api.jscodeshift;
const root = j(file.source);
// Ensure i18n import exists:
const hasI18nImport = root.find(j.ImportDeclaration, {
source: { value: 'react-i18next' }
}).length > 0;
if (!hasI18nImport) {
const firstImport = root.find(j.ImportDeclaration).at(0);
firstImport.insertBefore(
j.importDeclaration(
[j.importSpecifier(j.identifier('useTranslation'))],
j.literal('react-i18next')
)
);
}
${keyMappings.map(mapping => `
// "${mapping.string}" → t('${mapping.key}')
root.find(j.StringLiteral, { value: ${JSON.stringify(mapping.string)} })
.forEach(path => {
if (isTranslatablePosition(path)) {
path.replace(
j.callExpression(
j.identifier('t'),
[j.literal('${mapping.key}')${
mapping.variables?.length
? `, j.objectExpression([${
mapping.variables.map(v =>
`j.property('init', j.identifier('${v}'), j.identifier('${v}'))`
).join(', ')
}])`
: ''
}]
)
);
}
});`).join('\n')}
return root.toSource();
};`;
}
3. Context-Aware Machine Translation
/*
* Naive translation: send string to translation API, get back text.
* Problem: UI strings need CONTEXT to translate correctly.
*
* Examples:
* - "Save" (button) vs "Save" (noun, as in "a save file")
* German: "Speichern" vs "Speicherstand"
* - "Post" (verb, submit) vs "Post" (noun, blog post)
* Spanish: "Publicar" vs "Publicación"
* - "Close" (verb, close dialog) vs "Close" (adjective, nearby)
* French: "Fermer" vs "Proche"
*
* Context-aware translation includes:
* 1. Component type (button, heading, tooltip, error message)
* 2. Surrounding UI text (what's nearby)
* 3. Screenshot or visual context
* 4. ICU message format details (plurals, gender)
* 5. Max character length for the UI element
*/
async function translateWithContext(entries, targetLocale, glossary) {
// Build context for each entry:
const contextualEntries = entries.map(entry => ({
key: entry.key,
source: entry.value,
context: entry.context || inferContext(entry),
maxLength: entry.maxLength || null,
icuFormat: detectICUFormat(entry.value),
glossaryTerms: findGlossaryMatches(entry.value, glossary),
}));
const prompt = `Translate these UI strings to ${getLanguageName(targetLocale)}.
GLOSSARY (must use these exact translations):
${glossary.filter(g => g.locale === targetLocale).map(g =>
`"${g.source}" → "${g.target}"`
).join('\n')}
STRINGS TO TRANSLATE:
${contextualEntries.map((e, i) => `
${i + 1}. Key: ${e.key}
English: "${e.source}"
Context: ${e.context}
${e.maxLength ? `Max length: ${e.maxLength} characters` : ''}
${e.icuFormat ? `ICU format: ${e.icuFormat}` : ''}
${e.glossaryTerms.length ? `Contains glossary terms: ${e.glossaryTerms.map(g => g.source).join(', ')}` : ''}
`).join('\n')}
RULES:
1. Preserve all ICU syntax exactly: {variable}, {count, plural, ...}, {gender, select, ...}
2. Preserve all HTML tags: <b>, <a>, <br/>
3. Do NOT translate variable names inside {}
4. Respect max character length — use abbreviations if needed
5. Use the glossary translations exactly as given
6. For plural forms, provide ALL required categories for ${targetLocale}
7. Match formality level: ${getFormalityForLocale(targetLocale)}
Return JSON array: [{ "key": "...", "translation": "...", "notes": "..." }]`;
const translations = JSON.parse(await callLLM(prompt, {
model: 'gpt-4o',
temperature: 0.1,
maxTokens: 2000,
}));
// Validate each translation:
return translations.map(t => ({
...t,
validation: validateTranslation(
contextualEntries.find(e => e.key === t.key),
t.translation,
targetLocale
),
}));
}
function inferContext(entry) {
// Infer context from the translation key:
const parts = entry.key.split('.');
const contexts = [];
if (parts.some(p => p.match(/button|btn|cta/i))) contexts.push('button text');
if (parts.some(p => p.match(/title|heading|header/i))) contexts.push('heading/title');
if (parts.some(p => p.match(/placeholder/i))) contexts.push('input placeholder');
if (parts.some(p => p.match(/error|validation/i))) contexts.push('error message');
if (parts.some(p => p.match(/tooltip|hint/i))) contexts.push('tooltip');
if (parts.some(p => p.match(/label/i))) contexts.push('form label');
if (parts.some(p => p.match(/confirm|dialog|modal/i))) contexts.push('dialog text');
return contexts.join(', ') || 'general UI text';
}
function detectICUFormat(str) {
if (str.includes('{') && str.includes(', plural,')) return 'plural';
if (str.includes('{') && str.includes(', select,')) return 'select';
if (str.includes('{') && str.includes(', selectordinal,')) return 'ordinal';
if (str.includes('{') && str.match(/\{[a-zA-Z]+\}/)) return 'interpolation';
return null;
}
4. ICU Message Format Validation
/*
* ICU MessageFormat is the standard for i18n strings
* with variables, plurals, and selections.
*
* Common bugs AI introduces or developers write:
*
* 1. Missing plural categories:
* English needs: one, other
* Arabic needs: zero, one, two, few, many, other
* Polish needs: one, few, many, other
*
* 2. Broken syntax:
* "{count, plural, one {# item} other {# items}" ← missing closing }
*
* 3. Missing variables:
* Source: "Hello, {name}!"
* Translation: "Bonjour!" ← lost {name}
*
* 4. Wrong nesting:
* "{gender, select, male {He} female {She}} has {count, plural, ...}"
* Nesting these correctly is hard for both humans and AI.
*/
const { parse } = require('@formatjs/icu-messageformat-parser');
// Plural categories required by each locale (CLDR data):
const PLURAL_RULES = {
en: ['one', 'other'],
ar: ['zero', 'one', 'two', 'few', 'many', 'other'],
pl: ['one', 'few', 'many', 'other'],
fr: ['one', 'many', 'other'],
ja: ['other'],
zh: ['other'],
ru: ['one', 'few', 'many', 'other'],
de: ['one', 'other'],
cs: ['one', 'few', 'many', 'other'],
// ... loaded from CLDR
};
function validateICUMessage(source, translation, targetLocale) {
const errors = [];
// 1. Parse both strings:
let sourceAST, translationAST;
try {
sourceAST = parse(source);
} catch (e) {
errors.push({ type: 'source-parse-error', message: e.message });
return errors; // Can't validate further if source is broken
}
try {
translationAST = parse(translation);
} catch (e) {
errors.push({
type: 'translation-parse-error',
message: `ICU syntax error in translation: ${e.message}`,
severity: 'error',
});
return errors;
}
// 2. Check variable names match:
const sourceVars = extractVariables(sourceAST);
const translationVars = extractVariables(translationAST);
for (const v of sourceVars) {
if (!translationVars.has(v)) {
errors.push({
type: 'missing-variable',
message: `Variable {${v}} present in source but missing in translation`,
severity: 'error',
});
}
}
for (const v of translationVars) {
if (!sourceVars.has(v)) {
errors.push({
type: 'extra-variable',
message: `Variable {${v}} in translation not present in source`,
severity: 'warning',
});
}
}
// 3. Check plural categories:
const translationPlurals = extractPluralNodes(translationAST);
const requiredCategories = PLURAL_RULES[targetLocale] || PLURAL_RULES.en;
for (const plural of translationPlurals) {
const providedCategories = plural.options.map(o => o.key);
for (const required of requiredCategories) {
if (!providedCategories.includes(required)) {
errors.push({
type: 'missing-plural-category',
message: `Plural for {${plural.variable}} missing required category "${required}" for locale ${targetLocale}`,
severity: 'error',
});
}
}
}
// 4. Check HTML tag balance:
const sourceTags = extractHTMLTags(source);
const translationTags = extractHTMLTags(translation);
if (!arraysEqual(sourceTags, translationTags)) {
errors.push({
type: 'html-tag-mismatch',
message: `HTML tags differ: source has ${sourceTags.join(',')} but translation has ${translationTags.join(',')}`,
severity: 'error',
});
}
return errors;
}
function extractVariables(ast) {
const vars = new Set();
function walk(nodes) {
for (const node of nodes) {
if (node.type === 1) { // Argument
vars.add(node.value);
}
if (node.type === 5 || node.type === 6) { // Select or Plural
vars.add(node.value);
for (const option of Object.values(node.options)) {
walk(option.value);
}
}
}
}
walk(ast);
return vars;
}
5. Character Length and UI Overflow Detection
/*
* German is ~30% longer than English.
* Finnish can be ~40% longer.
* Chinese/Japanese are typically shorter.
*
* A button that fits "Submit" may overflow with "Absenden" (German)
* or "Vahvista lähetys" (Finnish).
*
* ┌────────────────────────────────────────────────────┐
* │ Approaches to length validation: │
* │ │
* │ 1. Character count ratio (rough): │
* │ If translation > source * expansionFactor → warn │
* │ │
* │ 2. Text measurement (precise): │
* │ Use Canvas API to measure pixel width │
* │ in the actual font used by the component │
* │ │
* │ 3. Visual regression (most accurate): │
* │ Render component with translations │
* │ Check for overflow, truncation, wrapping │
* └────────────────────────────────────────────────────┘
*/
// Expansion factors by locale (approximate):
const EXPANSION_FACTORS = {
de: 1.35, // German
fr: 1.30, // French
es: 1.25, // Spanish
it: 1.25, // Italian
pt: 1.25, // Portuguese
nl: 1.30, // Dutch
fi: 1.40, // Finnish
sv: 1.25, // Swedish
ja: 0.60, // Japanese (shorter)
zh: 0.60, // Chinese (shorter)
ko: 0.70, // Korean (shorter)
ar: 1.25, // Arabic
ru: 1.30, // Russian
pl: 1.30, // Polish
};
function checkTranslationLength(source, translation, targetLocale, maxLength) {
const warnings = [];
const factor = EXPANSION_FACTORS[targetLocale] || 1.2;
// 1. Basic character count check:
if (maxLength && translation.length > maxLength) {
warnings.push({
type: 'exceeds-max-length',
severity: 'error',
message: `Translation is ${translation.length} chars, max is ${maxLength}`,
suggestion: `Shorten to ${maxLength} chars or use abbreviation`,
});
}
// 2. Expansion ratio check:
const expectedMax = Math.ceil(source.length * factor * 1.1); // 10% tolerance
if (translation.length > expectedMax) {
warnings.push({
type: 'unusual-expansion',
severity: 'warning',
message: `Translation (${translation.length} chars) is ${
Math.round(translation.length / source.length * 100)}% of source (${source.length} chars). ` +
`Expected ~${Math.round(factor * 100)}% for ${targetLocale}`,
});
}
// 3. Pixel width measurement (for precise validation):
return warnings;
}
// Canvas-based pixel width measurement:
function measureTextWidth(text, font) {
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.font = font;
return ctx.measureText(text).width;
}
async function validateTranslationFit(translations, componentSpecs) {
const overflows = [];
for (const [key, locales] of Object.entries(translations)) {
const spec = componentSpecs[key];
if (!spec) continue;
for (const [locale, text] of Object.entries(locales)) {
const width = measureTextWidth(text, spec.font);
if (width > spec.maxWidth) {
overflows.push({
key,
locale,
text,
measuredWidth: Math.round(width),
maxWidth: spec.maxWidth,
overflowPx: Math.round(width - spec.maxWidth),
});
}
}
}
return overflows;
}
6. Glossary and Terminology Consistency
/*
* Glossary enforcement ensures consistent translation
* of product-specific terms across all strings.
*
* Example glossary for a project management tool:
*
* | English | German | Japanese |
* |-------------|----------------|---------------|
* | Sprint | Sprint | スプリント |
* | Backlog | Backlog | バックログ |
* | Story points | Story Points | ストーリーポイント |
* | Assignee | Zuständiger | 担当者 |
* | Due date | Fälligkeitsdatum| 期日 |
*
* Without glossary enforcement, AI might translate:
* - "Sprint" → "Rennen" (German for "race")
* - "Backlog" → "Rückstand" (German for "arrears")
*/
class GlossaryEnforcer {
constructor(glossaryEntries) {
// Index by locale and source term:
this.byLocale = {};
for (const entry of glossaryEntries) {
if (!this.byLocale[entry.locale]) {
this.byLocale[entry.locale] = new Map();
}
this.byLocale[entry.locale].set(
entry.source.toLowerCase(),
entry
);
}
}
// Check if a translation uses correct glossary terms:
validate(source, translation, locale) {
const localeGlossary = this.byLocale[locale];
if (!localeGlossary) return [];
const violations = [];
for (const [sourceTerm, entry] of localeGlossary) {
// Check if source contains this term:
if (source.toLowerCase().includes(sourceTerm)) {
// Check if translation uses the correct term:
if (!translation.includes(entry.target)) {
violations.push({
type: 'glossary-violation',
severity: 'error',
sourceTerm: entry.source,
expectedTranslation: entry.target,
message: `"${entry.source}" should be translated as "${entry.target}" but was not found in translation`,
});
}
}
}
return violations;
}
// Build glossary instruction for AI translation prompt:
getPromptInstructions(locale, sourceTexts) {
const localeGlossary = this.byLocale[locale];
if (!localeGlossary) return '';
// Only include glossary terms that appear in the source texts:
const relevantTerms = [];
const allSource = sourceTexts.join(' ').toLowerCase();
for (const [sourceTerm, entry] of localeGlossary) {
if (allSource.includes(sourceTerm)) {
relevantTerms.push(entry);
}
}
if (relevantTerms.length === 0) return '';
return `MANDATORY GLOSSARY (use these exact translations):
${relevantTerms.map(t => ` "${t.source}" → "${t.target}"${t.notes ? ` (${t.notes})` : ''}`).join('\n')}`;
}
}
7. Plural Rule Generation
/*
* Pluralization is one of the hardest i18n problems.
* English has 2 forms: one, other.
* Arabic has 6: zero, one, two, few, many, other.
*
* AI must generate ALL required forms for each locale.
*
* English source:
* "{count, plural, one {# item} other {# items}}"
*
* Arabic translation needs:
* "{count, plural,
* zero {لا عناصر}
* one {عنصر واحد}
* two {عنصران}
* few {# عناصر}
* many {# عنصرًا}
* other {# عنصر}}"
*
* The AI prompt must specify which categories are needed
* AND give examples of which numbers map to which category.
*/
// CLDR plural examples for AI context:
const PLURAL_EXAMPLES = {
ar: {
zero: '0',
one: '1',
two: '2',
few: '3-10 (e.g., 3, 4, 5, 6, 7, 8, 9, 10)',
many: '11-99 (e.g., 11, 12, 25, 50, 99)',
other: '100+ (e.g., 100, 101, 1000)',
},
pl: {
one: '1',
few: '2-4, 22-24, 32-34... (ends in 2-4, not 12-14)',
many: '0, 5-21, 25-31... (everything else)',
other: 'fractional numbers (1.5, 2.3)',
},
ru: {
one: '1, 21, 31, 41... (ends in 1, not 11)',
few: '2-4, 22-24, 32-34... (ends in 2-4, not 12-14)',
many: '0, 5-20, 25-30... (ends in 0,5-9 or 11-14)',
other: 'fractional numbers',
},
};
async function generatePluralTranslation(sourceICU, targetLocale, context) {
const requiredCategories = PLURAL_RULES[targetLocale] || ['other'];
const examples = PLURAL_EXAMPLES[targetLocale] || {};
const prompt = `Translate this ICU plural message to ${getLanguageName(targetLocale)}.
SOURCE (English):
${sourceICU}
CONTEXT: ${context}
REQUIRED PLURAL CATEGORIES for ${targetLocale}:
${requiredCategories.map(cat =>
`- ${cat}: used when count is ${examples[cat] || '(standard CLDR rules)'}`
).join('\n')}
RULES:
1. Include ALL ${requiredCategories.length} plural categories listed above
2. Use # as the number placeholder (replaced at runtime)
3. Each form should be grammatically correct for that count
4. Maintain ICU MessageFormat syntax exactly
5. Do NOT add categories not listed above
6. For the "other" category, use the most generic/default form
Return ONLY the ICU message string, nothing else.`;
const translation = await callLLM(prompt, {
model: 'gpt-4o',
temperature: 0.1,
maxTokens: 300,
});
// Validate the returned ICU has all required categories:
const parsed = parse(translation.trim());
const providedCategories = extractPluralCategories(parsed);
const missing = requiredCategories.filter(c => !providedCategories.includes(c));
if (missing.length > 0) {
// Retry with explicit correction:
return await retryWithCorrection(translation, missing, targetLocale);
}
return translation.trim();
}
8. Translation Memory and Fuzzy Matching
/*
* Translation Memory (TM) stores previously translated strings.
* Before sending a string to AI, check if we already have
* a translation — or a SIMILAR string that was translated.
*
* This reduces:
* - Cost (fewer AI API calls)
* - Inconsistency (same string always gets same translation)
* - Turnaround time
*
* Fuzzy matching handles cases like:
* Previously translated: "Save changes"
* New string: "Save all changes"
* → 80% match, suggest based on previous translation
*/
class TranslationMemory {
constructor() {
this.entries = new Map(); // key: source -> Map<locale, translation>
this.embeddings = new Map(); // key: source -> embedding vector
}
async addEntry(source, locale, translation) {
if (!this.entries.has(source)) {
this.entries.set(source, new Map());
// Generate embedding for fuzzy matching:
this.embeddings.set(source, await getEmbedding(source));
}
this.entries.get(source).set(locale, translation);
}
// Exact match:
getExact(source, locale) {
return this.entries.get(source)?.get(locale) || null;
}
// Fuzzy match:
async getFuzzy(source, locale, threshold = 0.85) {
const sourceEmbedding = await getEmbedding(source);
let bestMatch = null;
let bestSimilarity = 0;
for (const [existingSource, embedding] of this.embeddings) {
const similarity = cosineSimilarity(sourceEmbedding, embedding);
if (similarity > bestSimilarity && similarity >= threshold) {
const translation = this.entries.get(existingSource)?.get(locale);
if (translation) {
bestSimilarity = similarity;
bestMatch = {
source: existingSource,
translation,
similarity,
};
}
}
}
return bestMatch;
}
// Translate with TM priority:
async translate(source, locale) {
// 1. Check exact match:
const exact = this.getExact(source, locale);
if (exact) {
return { translation: exact, source: 'tm-exact', confidence: 1.0 };
}
// 2. Check fuzzy match:
const fuzzy = await this.getFuzzy(source, locale);
if (fuzzy && fuzzy.similarity >= 0.95) {
// Very high similarity — adapt the existing translation:
const adapted = await adaptTranslation(
fuzzy.source, fuzzy.translation, source, locale
);
return { translation: adapted, source: 'tm-fuzzy', confidence: fuzzy.similarity };
}
// 3. Fall back to AI translation:
const aiTranslation = await translateWithAI(source, locale);
// 4. Store in TM for future use:
await this.addEntry(source, locale, aiTranslation);
return { translation: aiTranslation, source: 'ai', confidence: 0.7 };
}
}
async function adaptTranslation(originalSource, existingTranslation, newSource, locale) {
const prompt = `Adapt this existing translation for a slightly different source string.
Original English: "${originalSource}"
Existing ${getLanguageName(locale)} translation: "${existingTranslation}"
New English string: "${newSource}"
Adapt the existing translation to match the new source.
Preserve the translation style and terminology.
Return only the adapted translation.`;
return await callLLM(prompt, {
model: 'gpt-4o-mini',
temperature: 0.1,
maxTokens: 100,
});
}
9. CI/CD Integration for Translation Quality
/*
* Translation checks run in CI to prevent broken
* translations from reaching production.
*
* ┌──────────────────────────────────────────────────┐
* │ PR Pipeline: │
* │ │
* │ 1. Detect changed translation files │
* │ (git diff --name-only | grep locales/) │
* │ │ │
* │ ▼ │
* │ 2. Validate ICU syntax for ALL changed keys │
* │ │ │
* │ ▼ │
* │ 3. Check variable consistency across locales │
* │ │ │
* │ ▼ │
* │ 4. Check plural categories for each locale │
* │ │ │
* │ ▼ │
* │ 5. Check glossary compliance │
* │ │ │
* │ ▼ │
* │ 6. Estimate overflow risk for length-sensitive │
* │ UI elements │
* │ │ │
* │ ▼ │
* │ 7. Report: pass/fail with detailed errors │
* └──────────────────────────────────────────────────┘
*/
// CI script: validate-translations.js
async function validateTranslations(changedFiles) {
const errors = [];
const warnings = [];
// Load all locale files:
const localeDir = 'src/locales';
const locales = await fs.readdir(localeDir);
const translations = {};
for (const locale of locales) {
const filePath = path.join(localeDir, locale, 'messages.json');
translations[locale] = JSON.parse(await fs.readFile(filePath, 'utf-8'));
}
const sourceLocale = 'en';
const sourceKeys = Object.keys(translations[sourceLocale]);
// 1. Check for missing keys in each locale:
for (const locale of locales) {
if (locale === sourceLocale) continue;
for (const key of sourceKeys) {
if (!(key in translations[locale])) {
warnings.push({
locale,
key,
type: 'missing-key',
message: `Key "${key}" exists in ${sourceLocale} but not in ${locale}`,
});
}
}
}
// 2. Validate ICU syntax for all translations:
for (const locale of locales) {
for (const [key, value] of Object.entries(translations[locale])) {
try {
parse(value);
} catch (e) {
errors.push({
locale,
key,
type: 'icu-parse-error',
message: `ICU syntax error: ${e.message}`,
value,
});
}
}
}
// 3. Check variable consistency:
for (const key of sourceKeys) {
const sourceVars = extractVariableNames(translations[sourceLocale][key]);
for (const locale of locales) {
if (locale === sourceLocale || !translations[locale][key]) continue;
const localeVars = extractVariableNames(translations[locale][key]);
const missingVars = sourceVars.filter(v => !localeVars.includes(v));
const extraVars = localeVars.filter(v => !sourceVars.includes(v));
if (missingVars.length > 0) {
errors.push({
locale, key,
type: 'missing-variables',
message: `Missing variables: ${missingVars.join(', ')}`,
});
}
if (extraVars.length > 0) {
warnings.push({
locale, key,
type: 'extra-variables',
message: `Extra variables: ${extraVars.join(', ')}`,
});
}
}
}
// 4. Check plural categories:
for (const locale of locales) {
if (locale === sourceLocale) continue;
for (const [key, value] of Object.entries(translations[locale])) {
const pluralErrors = validatePluralCategories(value, locale);
errors.push(...pluralErrors.map(e => ({ ...e, locale, key })));
}
}
// Output report:
console.log(`\n=== Translation Validation Report ===`);
console.log(`Errors: ${errors.length} | Warnings: ${warnings.length}`);
if (errors.length > 0) {
console.error('\nERRORS (must fix):');
errors.forEach(e => console.error(` [${e.locale}] ${e.key}: ${e.message}`));
process.exit(1);
}
if (warnings.length > 0) {
console.warn('\nWARNINGS:');
warnings.forEach(w => console.warn(` [${w.locale}] ${w.key}: ${w.message}`));
}
}
10. Right-to-Left (RTL) Layout Validation
/*
* RTL languages (Arabic, Hebrew, Persian, Urdu) require:
* - Mirrored layouts (sidebar on right, text right-aligned)
* - Bidirectional text handling (mixed LTR/RTL in one string)
* - Logical CSS properties (margin-inline-start vs margin-left)
* - Icon mirroring (arrows, navigation icons)
*
* AI can detect RTL issues by:
* 1. Checking CSS for physical properties (left/right) instead of logical
* 2. Detecting bidirectional text mixing issues
* 3. Identifying icons that need mirroring
*/
// CSS RTL lint rules:
const PHYSICAL_TO_LOGICAL = {
'margin-left': 'margin-inline-start',
'margin-right': 'margin-inline-end',
'padding-left': 'padding-inline-start',
'padding-right': 'padding-inline-end',
'border-left': 'border-inline-start',
'border-right': 'border-inline-end',
'left': 'inset-inline-start',
'right': 'inset-inline-end',
'text-align: left': 'text-align: start',
'text-align: right': 'text-align: end',
'float: left': 'float: inline-start',
'float: right': 'float: inline-end',
};
function lintCSSForRTL(cssContent, filePath) {
const issues = [];
const lines = cssContent.split('\n');
lines.forEach((line, index) => {
for (const [physical, logical] of Object.entries(PHYSICAL_TO_LOGICAL)) {
if (line.includes(physical) && !line.includes('/*rtl:ignore*/')) {
issues.push({
file: filePath,
line: index + 1,
physical,
logical,
severity: 'warning',
message: `Use logical property "${logical}" instead of "${physical}" for RTL support`,
});
}
}
});
return issues;
}
// Bidirectional text validation:
function validateBidiText(translation, locale) {
const issues = [];
const isRTL = ['ar', 'he', 'fa', 'ur'].includes(locale);
if (!isRTL) return issues;
// Check for LTR content embedded in RTL text without proper markers:
const ltrPattern = /[a-zA-Z0-9]{3,}/g;
const matches = [...translation.matchAll(ltrPattern)];
if (matches.length > 0) {
// LTR content in RTL text — check for directional markers:
const hasLRM = translation.includes('\u200E'); // Left-to-right mark
const hasRLM = translation.includes('\u200F'); // Right-to-left mark
const hasBidiIsolate = translation.includes('\u2066') || translation.includes('\u2068');
if (!hasLRM && !hasRLM && !hasBidiIsolate && matches.length > 1) {
issues.push({
type: 'bidi-mixed-content',
severity: 'warning',
message: `RTL text contains ${matches.length} LTR segments without directional markers. ` +
`Consider adding Unicode bidi isolates to prevent display issues.`,
ltrSegments: matches.map(m => m[0]),
});
}
}
// Check for numbers with units (common RTL display bug):
const numberUnitPattern = /\d+\s*(px|rem|em|%|MB|GB|KB)/g;
const numberMatches = [...translation.matchAll(numberUnitPattern)];
if (numberMatches.length > 0) {
issues.push({
type: 'bidi-number-unit',
severity: 'info',
message: `RTL text contains number+unit patterns that may display incorrectly. ` +
`Wrap with Unicode first strong isolate (U+2068...U+2069).`,
patterns: numberMatches.map(m => m[0]),
});
}
return issues;
}
// Screenshot-based RTL validation using headless browser:
async function validateRTLLayout(componentUrl, locales) {
const issues = [];
for (const locale of locales) {
const isRTL = ['ar', 'he', 'fa', 'ur'].includes(locale);
if (!isRTL) continue;
// Render with Playwright and check for layout issues:
const page = await browser.newPage();
await page.goto(`${componentUrl}?locale=${locale}`);
// Check dir attribute:
const htmlDir = await page.getAttribute('html', 'dir');
if (htmlDir !== 'rtl') {
issues.push({
locale,
type: 'missing-dir-attribute',
message: 'HTML dir attribute is not set to "rtl"',
});
}
// Check for horizontal overflow (common RTL layout bug):
const overflow = await page.evaluate(() => {
return document.documentElement.scrollWidth > window.innerWidth;
});
if (overflow) {
issues.push({
locale,
type: 'horizontal-overflow',
message: 'Page has horizontal overflow in RTL mode',
});
}
await page.close();
}
return issues;
}
Trade-offs & Considerations
| Aspect | Manual Translation | Machine Translation Only | AI + Glossary + Validation | Full AI Pipeline |
|---|---|---|---|---|
| Quality | Highest | Low-Medium | High | High |
| Speed | 3-5 days | Minutes | Minutes + review | Minutes |
| Cost per word | $0.10-0.25 | ~$0.001 | ~$0.005 | ~$0.005 |
| Consistency | Variable | Low | High (glossary enforced) | High |
| Plural handling | Human expertise | Often wrong | Validated per CLDR | Validated |
| Context awareness | High (if given) | None | Provided via prompt | Provided |
| RTL support | Manual QA needed | N/A | Automated checks | Automated |
| Regulatory (legal) | Certified available | Not certifiable | Not certifiable | Not certifiable |
Best Practices
-
Detect hardcoded strings with AST analysis, not grep — AST understands context — walk the JSX AST to find string literals in translatable positions (JSXText, placeholder props, aria-labels) while ignoring technical strings (className, data-testid, import paths); use heuristics (contains spaces, starts with uppercase, not camelCase) to filter false positives; run this as a lint rule in CI to prevent new hardcoded strings from merging.
-
Include UI context in translation prompts — a word's meaning depends on where it appears — "Save" as a button label translates differently than "Save" as a noun; include the component type (button, heading, tooltip), surrounding text, max character length, and variable names in the translation prompt; this context prevents the most common class of mistranslation.
-
Validate ICU MessageFormat programmatically for every locale — parse, check variables, check plural categories — use
@formatjs/icu-messageformat-parserto parse translations; verify that all variables from the source appear in translations; verify that plural messages include all CLDR-required categories for the target locale (Arabic needs 6 categories, not 2); run these checks in CI to fail the build on ICU errors. -
Use logical CSS properties instead of physical left/right for RTL support — replace
margin-leftwithmargin-inline-start,text-align: leftwithtext-align: start, andleftwithinset-inline-start; lint for physical properties in CI; add bidirectional text markers (Unicode isolates) around embedded LTR content in RTL translations to prevent display reordering. -
Maintain a Translation Memory (TM) alongside AI to reduce cost and improve consistency — cache every approved translation; before calling the AI API, check for exact matches (free) and fuzzy matches (>85% similarity that can be adapted cheaply); this typically reduces AI translation calls by 40-60% over time and ensures that the same string always gets the same translation regardless of when it was translated.
Conclusion
AI-powered internationalization automates the most error-prone and time-consuming parts of the localization pipeline. AST-based hardcoded string detection walks the JSX tree to find untranslated text in translatable positions — JSXText, placeholder attributes, aria-labels — while filtering out technical strings using heuristics and known non-translatable prop names. Context-aware machine translation includes component type (button vs heading), maximum character length, glossary terms (mandatory translations for product terminology), and ICU format details in the prompt, preventing the ambiguity-driven mistranslations that plague context-free translation APIs. ICU MessageFormat validation programmatically parses every translation to verify syntax, variable consistency (every {variable} in the source must appear in the translation), and plural category completeness (Arabic requires 6 forms, Polish requires 4, Japanese requires 1). Character length validation uses locale-specific expansion factors (German ~35% longer, Japanese ~40% shorter) and Canvas-based pixel measurement to predict UI overflow before it reaches production. Translation Memory caches approved translations and fuzzy-matches similar strings (cosine similarity on embeddings) to reduce AI API calls by 40-60% while maintaining consistency. RTL validation lints CSS for physical properties (left/right) that should be logical (inline-start/inline-end) and checks for bidirectional text mixing issues. The CI pipeline ties everything together: every PR with translation changes is validated for ICU syntax, variable consistency, plural completeness, glossary compliance, and length constraints — failing the build on errors and warning on risks.
What did you think?