Zero-Downtime Schema Migrations: Strategies for Production Databases
Zero-Downtime Schema Migrations: Strategies for Production Databases
Database schema migrations are the most dangerous routine operation in production systems. A bad migration can lock tables for hours, corrupt data, or take down your entire application. Zero-downtime migrations are the art of changing your database schema while traffic is flowing — without locks, without downtime, and without losing data.
This article covers the mechanics of online schema changes: how to add columns without locking, how to rename tables while queries are running, and how to migrate billions of rows without your pager going off at 3 AM.
Why Schema Migrations Are Dangerous
┌─────────────────────────────────────────────────────────────────────────┐
│ SCHEMA MIGRATION FAILURE MODES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. TABLE LOCKS │
│ ALTER TABLE in many databases acquires exclusive lock │
│ All queries to that table block until ALTER completes │
│ On a 100GB table, this can be hours │
│ │
│ 2. TRANSACTION LOG EXPLOSION │
│ Large migrations generate massive write-ahead logs │
│ Disk fills up, replication falls behind, failover breaks │
│ │
│ 3. BACKWARD INCOMPATIBILITY │
│ Old application code runs alongside new schema │
│ Column rename: old code inserts to old name, new code reads new │
│ Data ends up in wrong place or queries fail │
│ │
│ 4. DATA CORRUPTION │
│ Bad migration logic (wrong default, bad transform) │
│ No rollback possible after data is transformed │
│ │
│ 5. REPLICATION ISSUES │
│ Schema change propagates to replicas at different times │
│ Queries hit replicas with old schema, fail │
│ Replication lag causes inconsistent reads │
│ │
│ 6. GHOST MIGRATIONS │
│ Migration runs on staging, not production │
│ Or runs partially, leaves schema in inconsistent state │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Expand-Contract Pattern
The core strategy for zero-downtime migrations is expand-contract (also called parallel change):
┌─────────────────────────────────────────────────────────────────────────┐
│ EXPAND-CONTRACT MIGRATION PATTERN │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GOAL: Rename column "name" to "full_name" │
│ │
│ WRONG (causes downtime): │
│ ───────────────────────── │
│ 1. Deploy: ALTER TABLE users RENAME COLUMN name TO full_name; │
│ 2. Deploy new code that uses full_name │
│ ❌ Between step 1 and 2: all queries fail │
│ │
│ EXPAND-CONTRACT (zero downtime): │
│ ──────────────────────────────── │
│ │
│ PHASE 1: EXPAND │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Add new column (no lock on Postgres): │ │
│ │ ALTER TABLE users ADD COLUMN full_name VARCHAR(255); │ │
│ │ │ │
│ │ Schema now: [id, name, email, full_name] │ │
│ │ Code reads: name │ │
│ │ Code writes: name │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 2: DUAL WRITE │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Deploy code that writes to BOTH columns: │ │
│ │ INSERT INTO users (name, full_name, ...) VALUES (?, ?, ...)│ │
│ │ UPDATE users SET name = ?, full_name = ? WHERE id = ? │ │
│ │ │ │
│ │ Code reads: name (still) │ │
│ │ Code writes: name AND full_name │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 3: BACKFILL │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Copy data from old column to new: │ │
│ │ UPDATE users SET full_name = name WHERE full_name IS NULL│ │
│ │ (Do in batches to avoid locks!) │ │
│ │ │ │
│ │ Now both columns have same data │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 4: SWITCH READS │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Deploy code that reads from new column: │ │
│ │ SELECT full_name FROM users │ │
│ │ │ │
│ │ Code reads: full_name │ │
│ │ Code writes: name AND full_name (still dual-writing) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ PHASE 5: CONTRACT │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Stop writing to old column: │ │
│ │ Code now only writes to full_name │ │
│ │ │ │
│ │ Then, later: DROP COLUMN name │ │
│ │ (Or keep for rollback safety period) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Each phase is a separate deploy. Each is independently safe. │
│ Any phase can be rolled back. Zero downtime throughout. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
PostgreSQL Online Schema Changes
PostgreSQL is relatively good at online schema changes, but has gotchas:
Operations That Don't Lock
-- These are essentially instant (metadata-only):
-- Add column with no default
ALTER TABLE users ADD COLUMN new_field VARCHAR(255);
-- Add column with VOLATILE default (Postgres 11+)
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT now();
-- Drop column (marks as invisible, doesn't rewrite)
ALTER TABLE users DROP COLUMN old_field;
-- Create index concurrently (doesn't block writes)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Add constraint without validation
ALTER TABLE users ADD CONSTRAINT fk_org
FOREIGN KEY (org_id) REFERENCES orgs(id) NOT VALID;
Operations That DO Lock
-- These rewrite the table or acquire ACCESS EXCLUSIVE lock:
-- Add column with non-volatile default (before Postgres 11)
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
-- LOCKS entire table, rewrites every row
-- Add NOT NULL without default
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- Must scan entire table to verify constraint
-- Change column type
ALTER TABLE users ALTER COLUMN age TYPE bigint;
-- Rewrites entire table
-- Add unique constraint
ALTER TABLE users ADD CONSTRAINT uk_email UNIQUE (email);
-- Creates index (use CREATE INDEX CONCURRENTLY + ADD CONSTRAINT USING INDEX)
-- Validate foreign key
ALTER TABLE users VALIDATE CONSTRAINT fk_org;
-- Scans entire table
Safe Pattern: Adding NOT NULL Column
-- WRONG: Locks table for full rewrite
ALTER TABLE users ADD COLUMN role VARCHAR(20) NOT NULL DEFAULT 'user';
-- SAFE: Multi-step approach
-- Step 1: Add nullable column (instant)
ALTER TABLE users ADD COLUMN role VARCHAR(20);
-- Step 2: Set default for new rows (instant)
ALTER TABLE users ALTER COLUMN role SET DEFAULT 'user';
-- Step 3: Backfill in batches (no lock per batch)
DO $$
DECLARE
batch_size INT := 10000;
affected INT;
BEGIN
LOOP
UPDATE users
SET role = 'user'
WHERE id IN (
SELECT id FROM users
WHERE role IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED -- Don't block on locked rows
);
GET DIAGNOSTICS affected = ROW_COUNT;
EXIT WHEN affected = 0;
COMMIT;
PERFORM pg_sleep(0.1); -- Brief pause to reduce load
END LOOP;
END $$;
-- Step 4: Add NOT NULL constraint (scans but doesn't rewrite)
ALTER TABLE users ALTER COLUMN role SET NOT NULL;
MySQL Online DDL and pt-online-schema-change
MySQL's online DDL has improved but still has limitations. For large tables, use pt-online-schema-change or gh-ost:
┌─────────────────────────────────────────────────────────────────────────┐
│ pt-online-schema-change ALGORITHM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GOAL: ALTER TABLE orders ADD COLUMN tracking_num VARCHAR(100); │
│ (Table has 500M rows, normal ALTER would take hours with lock) │
│ │
│ STEP 1: CREATE SHADOW TABLE │
│ ──────────────────────────── │
│ CREATE TABLE _orders_new LIKE orders; │
│ ALTER TABLE _orders_new ADD COLUMN tracking_num VARCHAR(100); │
│ │
│ STEP 2: CREATE TRIGGERS │
│ ─────────────────────── │
│ -- Capture all changes to original table │
│ CREATE TRIGGER pt_osc_ins AFTER INSERT ON orders │
│ FOR EACH ROW REPLACE INTO _orders_new (...) VALUES (...); │
│ │
│ CREATE TRIGGER pt_osc_upd AFTER UPDATE ON orders │
│ FOR EACH ROW REPLACE INTO _orders_new (...) VALUES (...); │
│ │
│ CREATE TRIGGER pt_osc_del AFTER DELETE ON orders │
│ FOR EACH ROW DELETE FROM _orders_new WHERE id = OLD.id; │
│ │
│ STEP 3: COPY DATA IN CHUNKS │
│ ─────────────────────────── │
│ -- Copy in small batches (e.g., 1000 rows at a time) │
│ INSERT INTO _orders_new SELECT * FROM orders │
│ WHERE id BETWEEN ? AND ?; │
│ │
│ -- Triggers ensure new writes are captured │
│ -- Copy takes hours but no locking │
│ │
│ STEP 4: ATOMIC SWAP │
│ ─────────────────────── │
│ -- Brief lock for atomic rename │
│ RENAME TABLE orders TO _orders_old, _orders_new TO orders; │
│ │
│ STEP 5: CLEANUP │
│ ─────────────────── │
│ DROP TABLE _orders_old; │
│ DROP TRIGGER pt_osc_ins, pt_osc_upd, pt_osc_del; │
│ │
│ RESULT: Schema changed with only milliseconds of lock time │
│ │
└─────────────────────────────────────────────────────────────────────────┘
gh-ost: Triggerless Online Migration
# GitHub's gh-ost uses binary log parsing instead of triggers
# Safer for high-write tables (triggers add latency)
gh-ost \
--host=db.example.com \
--database=myapp \
--table=orders \
--alter="ADD COLUMN tracking_num VARCHAR(100)" \
--execute \
--chunk-size=1000 \
--max-load=Threads_running=50 \
--critical-load=Threads_running=100 \
--postpone-cut-over-flag-file=/tmp/gh-ost.postpone
┌─────────────────────────────────────────────────────────────────────────┐
│ gh-ost ALGORITHM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ KEY DIFFERENCE: No triggers. Reads MySQL binary log instead. │
│ │
│ 1. Create ghost table with new schema │
│ 2. Connect to replica and stream binary log │
│ 3. Copy existing rows from original table (in chunks) │
│ 4. Apply binary log events to ghost table (keeps it in sync) │
│ 5. When copy complete and lag is zero: atomic table swap │
│ │
│ ADVANTAGES OVER pt-osc: │
│ • No triggers means no write latency impact │
│ • Can pause/resume migration │
│ • Can throttle based on replica lag │
│ • Can postpone cut-over until maintenance window │
│ • Can test on replica before running on primary │
│ │
│ BINARY LOG STREAMING: │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Original Table │ Ghost Table │ │
│ │ ────────────── │ ─────────── │ │
│ │ │ │ ▲ │ │
│ │ │ writes │ │ applied │ │
│ │ ▼ │ │ │ │
│ │ Binary Log ────────────────────────►│ │ │
│ │ (binlog stream) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Handling Foreign Keys
Foreign keys make migrations much harder:
// Migration strategy for foreign key constraints
interface ForeignKeyMigration {
steps: MigrationStep[];
}
const renameForeignKeyColumn: ForeignKeyMigration = {
steps: [
// 1. Add new column (nullable)
{
type: 'add_column',
sql: 'ALTER TABLE orders ADD COLUMN customer_uuid UUID',
},
// 2. Add FK constraint without validation
{
type: 'add_constraint',
sql: `ALTER TABLE orders ADD CONSTRAINT fk_orders_customer_uuid
FOREIGN KEY (customer_uuid) REFERENCES customers(uuid) NOT VALID`,
note: 'NOT VALID skips validation scan',
},
// 3. Deploy dual-write code
{
type: 'code_deploy',
description: 'Write to both customer_id and customer_uuid',
},
// 4. Backfill data
{
type: 'backfill',
sql: `UPDATE orders o
SET customer_uuid = c.uuid
FROM customers c
WHERE o.customer_id = c.id
AND o.customer_uuid IS NULL`,
batch_size: 10000,
},
// 5. Validate constraint (scans but doesn't lock)
{
type: 'validate_constraint',
sql: 'ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer_uuid',
},
// 6. Switch reads to new column
{
type: 'code_deploy',
description: 'Read from customer_uuid',
},
// 7. Make new column NOT NULL
{
type: 'alter_column',
sql: 'ALTER TABLE orders ALTER COLUMN customer_uuid SET NOT NULL',
},
// 8. Drop old FK and column
{
type: 'drop_constraint',
sql: 'ALTER TABLE orders DROP CONSTRAINT fk_orders_customer_id',
},
{
type: 'drop_column',
sql: 'ALTER TABLE orders DROP COLUMN customer_id',
delay: '7 days', // Wait for rollback safety
},
],
};
Backfill Strategies
Chunked Backfill with Progress Tracking
import { Pool } from 'pg';
interface BackfillConfig {
tableName: string;
setClause: string;
whereClause: string;
batchSize: number;
delayMs: number;
primaryKey: string;
onProgress?: (processed: number, total: number) => void;
}
async function backfillInChunks(pool: Pool, config: BackfillConfig): Promise<void> {
const {
tableName,
setClause,
whereClause,
batchSize,
delayMs,
primaryKey,
onProgress
} = config;
// Get total count for progress reporting
const countResult = await pool.query(
`SELECT COUNT(*) as total FROM ${tableName} WHERE ${whereClause}`
);
const total = parseInt(countResult.rows[0].total, 10);
let processed = 0;
// Get min/max ID for range-based batching
const rangeResult = await pool.query(
`SELECT MIN(${primaryKey}) as min_id, MAX(${primaryKey}) as max_id
FROM ${tableName}
WHERE ${whereClause}`
);
if (!rangeResult.rows[0].min_id) {
console.log('No rows to backfill');
return;
}
let currentId = rangeResult.rows[0].min_id;
const maxId = rangeResult.rows[0].max_id;
console.log(`Backfilling ${total} rows from ID ${currentId} to ${maxId}`);
while (currentId <= maxId) {
const batchStart = currentId;
const batchEnd = currentId + batchSize;
// Update batch
const result = await pool.query(
`UPDATE ${tableName}
SET ${setClause}
WHERE ${primaryKey} >= $1
AND ${primaryKey} < $2
AND ${whereClause}`,
[batchStart, batchEnd]
);
processed += result.rowCount ?? 0;
currentId = batchEnd;
onProgress?.(processed, total);
// Throttle to avoid overwhelming the database
if (delayMs > 0) {
await new Promise(resolve => setTimeout(resolve, delayMs));
}
// Check replication lag (if using replicas)
const lagResult = await pool.query(
`SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int as lag_seconds`
);
const lagSeconds = lagResult.rows[0]?.lag_seconds ?? 0;
if (lagSeconds > 30) {
console.log(`Replication lag ${lagSeconds}s, pausing...`);
await new Promise(resolve => setTimeout(resolve, 10000));
}
}
console.log(`Backfill complete: ${processed} rows updated`);
}
// Usage
await backfillInChunks(pool, {
tableName: 'users',
setClause: 'full_name = name',
whereClause: 'full_name IS NULL',
batchSize: 5000,
delayMs: 100,
primaryKey: 'id',
onProgress: (processed, total) => {
console.log(`Progress: ${processed}/${total} (${((processed/total)*100).toFixed(1)}%)`);
},
});
Parallel Backfill for Large Tables
import { Worker, isMainThread, parentPort, workerData } from 'worker_threads';
interface BackfillTask {
startId: number;
endId: number;
connectionString: string;
updateQuery: string;
}
// Main thread: coordinate workers
async function parallelBackfill(
connectionString: string,
tableName: string,
setClause: string,
whereClause: string,
workerCount: number = 4
): Promise<void> {
const pool = new Pool({ connectionString });
// Get ID range
const { rows } = await pool.query(
`SELECT MIN(id) as min_id, MAX(id) as max_id FROM ${tableName} WHERE ${whereClause}`
);
const minId = rows[0].min_id;
const maxId = rows[0].max_id;
const rangeSize = Math.ceil((maxId - minId) / workerCount);
console.log(`Spawning ${workerCount} workers for IDs ${minId}-${maxId}`);
const workers: Promise<number>[] = [];
for (let i = 0; i < workerCount; i++) {
const startId = minId + (i * rangeSize);
const endId = Math.min(startId + rangeSize, maxId + 1);
const task: BackfillTask = {
startId,
endId,
connectionString,
updateQuery: `UPDATE ${tableName} SET ${setClause} WHERE id >= $1 AND id < $2 AND ${whereClause}`,
};
workers.push(runWorker(task));
}
const results = await Promise.all(workers);
const totalUpdated = results.reduce((sum, count) => sum + count, 0);
console.log(`All workers complete. Total updated: ${totalUpdated}`);
await pool.end();
}
function runWorker(task: BackfillTask): Promise<number> {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, { workerData: task });
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0) reject(new Error(`Worker exited with code ${code}`));
});
});
}
// Worker thread: process assigned range
if (!isMainThread) {
const task = workerData as BackfillTask;
(async () => {
const pool = new Pool({ connectionString: task.connectionString });
let updated = 0;
let currentId = task.startId;
const batchSize = 1000;
while (currentId < task.endId) {
const result = await pool.query(task.updateQuery, [
currentId,
Math.min(currentId + batchSize, task.endId),
]);
updated += result.rowCount ?? 0;
currentId += batchSize;
// Brief pause to reduce contention
await new Promise(resolve => setTimeout(resolve, 50));
}
await pool.end();
parentPort?.postMessage(updated);
})();
}
Handling Rollbacks
┌─────────────────────────────────────────────────────────────────────────┐
│ MIGRATION ROLLBACK STRATEGIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ROLLBACK TYPES: │
│ │
│ 1. CODE ROLLBACK (easy) │
│ Schema is forward-compatible │
│ Old code can work with new schema │
│ Just redeploy old code version │
│ │
│ 2. SCHEMA ROLLBACK (hard) │
│ Reverse the schema change │
│ May need to recover dropped data │
│ Expand-contract makes this safer │
│ │
│ EXPAND-CONTRACT ROLLBACK SAFETY: │
│ ───────────────────────────────── │
│ │
│ Phase 1: Expand (add column) │
│ → Rollback: Drop column (no data loss, old code unaffected) │
│ │
│ Phase 2: Dual-write │
│ → Rollback: Deploy single-write code (new column has data, but unused)│
│ │
│ Phase 3: Backfill │
│ → Rollback: Nothing needed (backfill is idempotent) │
│ │
│ Phase 4: Switch reads │
│ → Rollback: Deploy old-read code (data still in old column) │
│ │
│ Phase 5: Stop dual-write │
│ → Rollback: Resume dual-writing (brief gap in new column data) │
│ │
│ Phase 6: Drop old column │
│ → POINT OF NO RETURN - cannot easily rollback │
│ → Keep old column for N days before dropping │
│ │
│ KEEP OLD ARTIFACTS: │
│ • Don't drop columns immediately — wait 7-14 days │
│ • Don't drop tables — rename to _deprecated_tablename │
│ • Don't delete data — soft delete or move to archive │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Migration State Machine
enum MigrationPhase {
NOT_STARTED = 'not_started',
EXPAND = 'expand',
DUAL_WRITE = 'dual_write',
BACKFILL = 'backfill',
SWITCH_READS = 'switch_reads',
SINGLE_WRITE = 'single_write',
CONTRACTED = 'contracted',
ROLLED_BACK = 'rolled_back',
}
interface MigrationState {
name: string;
phase: MigrationPhase;
startedAt: Date;
completedPhases: MigrationPhase[];
error?: string;
}
class MigrationStateMachine {
private state: MigrationState;
constructor(name: string) {
this.state = {
name,
phase: MigrationPhase.NOT_STARTED,
startedAt: new Date(),
completedPhases: [],
};
}
async transition(to: MigrationPhase, action: () => Promise<void>): Promise<void> {
const validTransitions: Record<MigrationPhase, MigrationPhase[]> = {
[MigrationPhase.NOT_STARTED]: [MigrationPhase.EXPAND],
[MigrationPhase.EXPAND]: [MigrationPhase.DUAL_WRITE, MigrationPhase.ROLLED_BACK],
[MigrationPhase.DUAL_WRITE]: [MigrationPhase.BACKFILL, MigrationPhase.EXPAND],
[MigrationPhase.BACKFILL]: [MigrationPhase.SWITCH_READS, MigrationPhase.DUAL_WRITE],
[MigrationPhase.SWITCH_READS]: [MigrationPhase.SINGLE_WRITE, MigrationPhase.BACKFILL],
[MigrationPhase.SINGLE_WRITE]: [MigrationPhase.CONTRACTED, MigrationPhase.SWITCH_READS],
[MigrationPhase.CONTRACTED]: [], // Terminal state
[MigrationPhase.ROLLED_BACK]: [MigrationPhase.EXPAND], // Can retry
};
if (!validTransitions[this.state.phase].includes(to)) {
throw new Error(
`Invalid transition from ${this.state.phase} to ${to}. ` +
`Valid transitions: ${validTransitions[this.state.phase].join(', ')}`
);
}
console.log(`Migration ${this.state.name}: ${this.state.phase} → ${to}`);
try {
await action();
this.state.completedPhases.push(this.state.phase);
this.state.phase = to;
await this.persistState();
} catch (error) {
this.state.error = error instanceof Error ? error.message : 'Unknown error';
console.error(`Migration failed at ${this.state.phase}:`, error);
throw error;
}
}
async rollbackTo(phase: MigrationPhase, action: () => Promise<void>): Promise<void> {
if (!this.state.completedPhases.includes(phase)) {
throw new Error(`Cannot rollback to ${phase}: not in completed phases`);
}
console.log(`Rolling back ${this.state.name} from ${this.state.phase} to ${phase}`);
await action();
this.state.phase = phase;
this.state.completedPhases = this.state.completedPhases.filter(p =>
this.phaseOrder(p) <= this.phaseOrder(phase)
);
await this.persistState();
}
private phaseOrder(phase: MigrationPhase): number {
const order = [
MigrationPhase.NOT_STARTED,
MigrationPhase.EXPAND,
MigrationPhase.DUAL_WRITE,
MigrationPhase.BACKFILL,
MigrationPhase.SWITCH_READS,
MigrationPhase.SINGLE_WRITE,
MigrationPhase.CONTRACTED,
];
return order.indexOf(phase);
}
private async persistState(): Promise<void> {
// Store in database or file for recovery
await db.query(
`INSERT INTO migration_states (name, phase, completed_phases, updated_at)
VALUES ($1, $2, $3, NOW())
ON CONFLICT (name) DO UPDATE SET
phase = EXCLUDED.phase,
completed_phases = EXCLUDED.completed_phases,
updated_at = NOW()`,
[this.state.name, this.state.phase, JSON.stringify(this.state.completedPhases)]
);
}
}
Testing Migrations
// Migration test framework
import { Pool } from 'pg';
import { execSync } from 'child_process';
interface MigrationTest {
name: string;
seedData: string;
migration: string;
assertions: (pool: Pool) => Promise<void>;
}
async function testMigration(test: MigrationTest): Promise<void> {
const testDbName = `test_migration_${Date.now()}`;
const pool = new Pool({ database: 'postgres' });
try {
// Create isolated test database
await pool.query(`CREATE DATABASE ${testDbName}`);
const testPool = new Pool({ database: testDbName });
// Apply schema up to migration point
await testPool.query(test.seedData);
// Verify pre-migration state
console.log(`Testing migration: ${test.name}`);
console.log('Pre-migration data seeded');
// Apply migration
await testPool.query(test.migration);
console.log('Migration applied');
// Run assertions
await test.assertions(testPool);
console.log('All assertions passed');
await testPool.end();
} finally {
// Cleanup
await pool.query(`DROP DATABASE IF EXISTS ${testDbName}`);
await pool.end();
}
}
// Example test
const renameColumnTest: MigrationTest = {
name: 'rename_name_to_full_name',
seedData: `
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(255)
);
INSERT INTO users (name, email) VALUES
('Alice Smith', 'alice@example.com'),
('Bob Jones', 'bob@example.com');
`,
migration: `
-- Phase 1: Add column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Phase 3: Backfill
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Phase 7: Add NOT NULL
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;
`,
assertions: async (pool: Pool) => {
// Verify column exists
const { rows: columns } = await pool.query(`
SELECT column_name, is_nullable
FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'full_name'
`);
if (columns.length === 0) {
throw new Error('full_name column not created');
}
if (columns[0].is_nullable !== 'NO') {
throw new Error('full_name should be NOT NULL');
}
// Verify data migrated
const { rows: users } = await pool.query(
'SELECT name, full_name FROM users ORDER BY id'
);
if (users[0].full_name !== 'Alice Smith') {
throw new Error('Data not migrated correctly');
}
// Verify old data preserved (for rollback)
if (users[0].name !== users[0].full_name) {
throw new Error('Old column data corrupted');
}
},
};
await testMigration(renameColumnTest);
Monitoring Migrations in Production
// Migration observability
interface MigrationMetrics {
name: string;
phase: string;
rowsProcessed: number;
rowsRemaining: number;
estimatedCompletionTime: Date | null;
errorCount: number;
replicationLagSeconds: number;
lockWaitCount: number;
}
async function monitorMigration(pool: Pool, migrationName: string): Promise<MigrationMetrics> {
// Get migration progress from tracking table
const { rows: [progress] } = await pool.query(
`SELECT * FROM migration_progress WHERE name = $1`,
[migrationName]
);
// Check replication lag
const { rows: [lag] } = await pool.query(
`SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int as lag`
);
// Check for lock waits
const { rows: [locks] } = await pool.query(`
SELECT COUNT(*) as lock_count
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
AND query LIKE '%${migrationName}%'
`);
// Estimate completion time
const rowsPerSecond = progress.rows_processed /
((Date.now() - new Date(progress.started_at).getTime()) / 1000);
const secondsRemaining = progress.rows_remaining / rowsPerSecond;
return {
name: migrationName,
phase: progress.phase,
rowsProcessed: progress.rows_processed,
rowsRemaining: progress.rows_remaining,
estimatedCompletionTime: new Date(Date.now() + secondsRemaining * 1000),
errorCount: progress.error_count,
replicationLagSeconds: lag?.lag ?? 0,
lockWaitCount: parseInt(locks?.lock_count ?? '0', 10),
};
}
// Alert if migration is unhealthy
function checkMigrationHealth(metrics: MigrationMetrics): string[] {
const alerts: string[] = [];
if (metrics.replicationLagSeconds > 60) {
alerts.push(`High replication lag: ${metrics.replicationLagSeconds}s`);
}
if (metrics.lockWaitCount > 0) {
alerts.push(`Lock contention detected: ${metrics.lockWaitCount} waiting`);
}
if (metrics.errorCount > 100) {
alerts.push(`High error count: ${metrics.errorCount}`);
}
// Detect stalled migration
if (metrics.rowsProcessed === 0 && metrics.phase !== 'not_started') {
alerts.push('Migration appears stalled: no progress');
}
return alerts;
}
Conclusion
Zero-downtime schema migrations require:
- Expand-contract pattern — Never break running code, always add before removing
- Online DDL knowledge — Know which operations lock and which don't
- Chunked backfills — Never UPDATE millions of rows in one transaction
- Replication awareness — Monitor lag, throttle when needed
- State tracking — Know exactly which phase each migration is in
- Tested rollbacks — Every phase should be reversible
The tooling matters: pt-online-schema-change and gh-ost for MySQL, native online DDL for PostgreSQL. But the pattern matters more — even with perfect tooling, you need expand-contract discipline to keep code and schema in sync.
The safest migration is the one you don't have to do. Design schemas for evolution: use UUIDs instead of auto-increment, use JSON columns for frequently-changing attributes, and think about how you'll change things before you ship v1.
What did you think?