Write-Ahead Log From Scratch: Crash Recovery, Checkpoints, and ARIES

April 21, 202661 min read0 views

write ahead log

transactional systems

Write-Ahead Log From Scratch: Crash Recovery, Checkpoints, and ARIES

Why Write-Ahead Logs?

Every database, message queue, and storage engine faces the same nightmare: what happens when power cuts mid-write? Without crash recovery, a half-written page corrupts data permanently. The Write-Ahead Log (WAL) is the universal solution: before modifying any data page, write the intended change to a sequential log file. If the system crashes, replay the log to restore consistency. WAL is used by PostgreSQL, SQLite, MySQL (InnoDB), Redis (AOF), Kafka, etcd, and every serious storage system. This post builds a complete WAL implementation with checkpointing, recovery, and the ARIES protocol.

Architecture

  WRITE-AHEAD LOG PROTOCOL:
  ═════════════════════════

  Transaction: UPDATE balance SET amount=500 WHERE id=42

  Step 1: Write to WAL (sequential, fast):
  ┌──────────────────────────────────────────────┐
  │  WAL File (append-only)                      │
  │  ┌─────┐ ┌─────┐ ┌─────────────────┐        │
  │  │LSN 1│ │LSN 2│ │LSN 3            │        │
  │  │BEGIN│ │UPDT │ │COMMIT           │        │
  │  │Tx1  │ │pg=7 │ │Tx1              │        │
  │  │     │ │old= │ │                 │        │
  │  │     │ │200  │ │   ← fsync()    │        │
  │  │     │ │new= │ │     here!       │        │
  │  │     │ │500  │ │                 │        │
  │  └─────┘ └─────┘ └─────────────────┘        │
  └──────────────────────────────────────────────┘

  Step 2: Modify data page (in memory, lazy flush):
  ┌──────────────────────────────────┐
  │  Buffer Pool                     │
  │  Page 7: balance = 500 (dirty)   │
  │  (will be written to disk later) │
  └──────────────────────────────────┘

  CRASH RECOVERY:
  ═══════════════

  ┌─────────────────────────────────────────────────┐
  │                  WAL File                       │
  │  ... │ CKPT │ ... │ UPDATE │ COMMIT │ UPDATE │  │
  │       ↑            │       │        │ (no    │  │
  │  last checkpoint   │ Tx1   │ Tx1    │ commit)│  │
  │                    │ REDO ✓│ REDO ✓ │ UNDO ✗ │  │
  └─────────────────────────────────────────────────┘

  ARIES Protocol:
  1. ANALYSIS: scan WAL from last checkpoint, find active transactions
  2. REDO:     replay ALL logged changes (committed or not)
  3. UNDO:     reverse changes of uncommitted transactions

WAL Record Types

/**
 * WAL Record: the fundamental unit of the write-ahead log.
 *
 * Every modification to the database is first recorded here.
 * Records are append-only and sequentially numbered by LSN.
 */

type LSN = number; // Log Sequence Number — monotonically increasing

type WALRecord =
  | {
      type: "begin";
      lsn: LSN;
      txId: number;
      timestamp: number;
    }
  | {
      type: "update";
      lsn: LSN;
      txId: number;
      pageId: number;
      offset: number;
      beforeImage: string;  // Old value (for UNDO)
      afterImage: string;   // New value (for REDO)
      prevLsn: LSN;         // Previous LSN for this transaction (undo chain)
    }
  | {
      type: "insert";
      lsn: LSN;
      txId: number;
      pageId: number;
      afterImage: string;
      prevLsn: LSN;
    }
  | {
      type: "delete";
      lsn: LSN;
      txId: number;
      pageId: number;
      beforeImage: string;
      prevLsn: LSN;
    }
  | {
      type: "commit";
      lsn: LSN;
      txId: number;
      prevLsn: LSN;
    }
  | {
      type: "abort";
      lsn: LSN;
      txId: number;
      prevLsn: LSN;
    }
  | {
      type: "checkpoint";
      lsn: LSN;
      activeTxIds: number[];       // Transactions active at checkpoint
      dirtyPageTable: Map<number, LSN>; // Dirty pages → oldest LSN that dirtied them
    }
  | {
      type: "clr";  // Compensation Log Record (written during UNDO)
      lsn: LSN;
      txId: number;
      pageId: number;
      undoNextLsn: LSN;  // Next record to undo
      afterImage: string;
    };

WAL Manager

/**
 * Write-Ahead Log Manager: handles writing, flushing, and reading log records.
 *
 * Key invariant (WAL protocol):
 *   A dirty page cannot be flushed to disk unless ALL log records
 *   that modified it have been flushed to the WAL first.
 *   (pageLSN ≤ flushedLSN)
 */
class WALManager {
  private records: WALRecord[] = [];
  private nextLSN: LSN = 1;
  private flushedLSN: LSN = 0;
  private txLastLSN: Map<number, LSN> = new Map(); // Transaction → last LSN

  // Buffer for unflushed records:
  private buffer: WALRecord[] = [];
  private bufferSizeBytes = 0;
  private maxBufferSize = 64 * 1024; // 64KB — flush when exceeded

  /**
   * Append a record to the WAL.
   * Returns the LSN assigned to the record.
   */
  append(record: Omit<WALRecord, "lsn">): LSN {
    const lsn = this.nextLSN++;
    const fullRecord = { ...record, lsn } as WALRecord;

    // Track prevLsn chain for undo:
    if ("txId" in fullRecord && fullRecord.type !== "checkpoint") {
      const prevLsn = this.txLastLSN.get(fullRecord.txId) ?? 0;
      if ("prevLsn" in fullRecord) {
        (fullRecord as any).prevLsn = prevLsn;
      }
      this.txLastLSN.set(fullRecord.txId, lsn);
    }

    this.buffer.push(fullRecord);
    this.bufferSizeBytes += this.estimateRecordSize(fullRecord);

    // Auto-flush if buffer is full:
    if (this.bufferSizeBytes >= this.maxBufferSize) {
      this.flush();
    }

    return lsn;
  }

  /**
   * Flush buffered records to disk (fsync).
   * This is the critical durability point.
   *
   * COMMIT records MUST be flushed before acknowledging to the client.
   * This is the "force-log-at-commit" rule.
   */
  flush(): void {
    if (this.buffer.length === 0) return;

    // Simulate writing to disk:
    for (const record of this.buffer) {
      this.records.push(record);
    }

    this.flushedLSN = this.records[this.records.length - 1].lsn;
    this.buffer = [];
    this.bufferSizeBytes = 0;
  }

  /**
   * Force flush up to a specific LSN.
   * Used before committing a transaction.
   */
  flushTo(targetLSN: LSN): void {
    if (targetLSN <= this.flushedLSN) return;
    this.flush();
  }

  /**
   * Read all records from a given LSN forward.
   * Used during crash recovery.
   */
  readFrom(startLSN: LSN): WALRecord[] {
    return this.records.filter((r) => r.lsn >= startLSN);
  }

  /**
   * Find the most recent checkpoint record.
   */
  findLastCheckpoint(): WALRecord | null {
    for (let i = this.records.length - 1; i >= 0; i--) {
      if (this.records[i].type === "checkpoint") return this.records[i];
    }
    return null;
  }

  getFlushedLSN(): LSN { return this.flushedLSN; }

  private estimateRecordSize(record: WALRecord): number {
    return JSON.stringify(record).length; // Simplified
  }
}

Transaction Manager with WAL

/**
 * Simulated database pages (buffer pool).
 */
interface Page {
  id: number;
  data: Map<string, string>;
  pageLSN: LSN;  // LSN of last modification (for recovery)
  dirty: boolean;
}

class BufferPool {
  private pages: Map<number, Page> = new Map();

  getPage(id: number): Page {
    if (!this.pages.has(id)) {
      this.pages.set(id, { id, data: new Map(), pageLSN: 0, dirty: false });
    }
    return this.pages.get(id)!;
  }

  markDirty(pageId: number, lsn: LSN): void {
    const page = this.getPage(pageId);
    page.dirty = true;
    page.pageLSN = lsn;
  }

  getDirtyPages(): Map<number, LSN> {
    const result = new Map<number, LSN>();
    for (const [id, page] of this.pages) {
      if (page.dirty) result.set(id, page.pageLSN);
    }
    return result;
  }

  flushPage(pageId: number): void {
    const page = this.pages.get(pageId);
    if (page) page.dirty = false;
  }
}

/**
 * Transaction Manager: coordinates transactions with WAL.
 */
class TransactionManager {
  private wal: WALManager;
  private bufferPool: BufferPool;
  private nextTxId = 1;
  private activeTxs: Set<number> = new Set();

  constructor() {
    this.wal = new WALManager();
    this.bufferPool = new BufferPool();
  }

  begin(): number {
    const txId = this.nextTxId++;
    this.wal.append({ type: "begin", txId, timestamp: Date.now() });
    this.activeTxs.add(txId);
    return txId;
  }

  /**
   * Update a value. WAL record is written BEFORE modifying the page.
   */
  update(txId: number, pageId: number, key: string, newValue: string): void {
    this.ensureActive(txId);

    const page = this.bufferPool.getPage(pageId);
    const oldValue = page.data.get(key) ?? "";

    // Step 1: Write to WAL (before image + after image):
    const lsn = this.wal.append({
      type: "update",
      txId,
      pageId,
      offset: 0,
      beforeImage: `${key}=${oldValue}`,
      afterImage: `${key}=${newValue}`,
      prevLsn: 0, // Will be set by WALManager
    });

    // Step 2: Modify the page in the buffer pool:
    page.data.set(key, newValue);
    this.bufferPool.markDirty(pageId, lsn);
  }

  insert(txId: number, pageId: number, key: string, value: string): void {
    this.ensureActive(txId);

    const lsn = this.wal.append({
      type: "insert",
      txId,
      pageId,
      afterImage: `${key}=${value}`,
      prevLsn: 0,
    });

    const page = this.bufferPool.getPage(pageId);
    page.data.set(key, value);
    this.bufferPool.markDirty(pageId, lsn);
  }

  /**
   * Commit: force WAL to disk, then acknowledge.
   *
   * The "force-log-at-commit" rule: COMMIT record must be
   * durable (fsync'd) before we tell the client "committed."
   * Pages do NOT need to be flushed — they can be written lazily.
   */
  commit(txId: number): void {
    this.ensureActive(txId);

    const commitLSN = this.wal.append({
      type: "commit",
      txId,
      prevLsn: 0,
    });

    // CRITICAL: force WAL flush before acknowledging:
    this.wal.flushTo(commitLSN);

    this.activeTxs.delete(txId);
  }

  /**
   * Abort: undo all changes by this transaction.
   * Walk the prevLsn chain backwards, applying before-images.
   */
  abort(txId: number): void {
    this.ensureActive(txId);

    // Walk the undo chain (prevLsn links):
    const allRecords = this.wal.readFrom(0);
    const txRecords = allRecords
      .filter((r) => "txId" in r && r.txId === txId)
      .reverse();

    for (const record of txRecords) {
      if (record.type === "update") {
        // Undo: restore before-image
        const [key, oldValue] = record.beforeImage.split("=");
        const page = this.bufferPool.getPage(record.pageId);
        page.data.set(key, oldValue);

        // Write CLR (Compensation Log Record):
        this.wal.append({
          type: "clr",
          txId,
          pageId: record.pageId,
          undoNextLsn: record.prevLsn,
          afterImage: record.beforeImage,
        });
      }
    }

    this.wal.append({ type: "abort", txId, prevLsn: 0 });
    this.wal.flush();
    this.activeTxs.delete(txId);
  }

  /**
   * Checkpoint: snapshot of active state for faster recovery.
   * Fuzzy checkpoint — doesn't block transactions.
   */
  checkpoint(): void {
    const dirtyPages = this.bufferPool.getDirtyPages();

    this.wal.append({
      type: "checkpoint",
      activeTxIds: [...this.activeTxs],
      dirtyPageTable: dirtyPages,
    });

    this.wal.flush();
  }

  private ensureActive(txId: number): void {
    if (!this.activeTxs.has(txId)) {
      throw new Error(`Transaction ${txId} is not active`);
    }
  }

  getWAL(): WALManager { return this.wal; }
  getBufferPool(): BufferPool { return this.bufferPool; }
}

ARIES Recovery

/**
 * ARIES (Algorithm for Recovery and Isolation Exploiting Semantics):
 * The gold standard crash recovery protocol, used by DB2, SQL Server, PostgreSQL.
 *
 * Three phases:
 *   1. ANALYSIS: determine what needs to be redone/undone
 *   2. REDO:     replay history (repeat all logged changes)
 *   3. UNDO:     reverse uncommitted transactions
 */
class ARIESRecovery {
  private wal: WALManager;
  private bufferPool: BufferPool;

  constructor(wal: WALManager, bufferPool: BufferPool) {
    this.wal = wal;
    this.bufferPool = bufferPool;
  }

  recover(): { redone: number; undone: number } {
    console.log("=== ARIES RECOVERY START ===");

    // Phase 1: Analysis
    const { activeTxs, dirtyPageTable, redoStartLSN } = this.analysisPhase();

    // Phase 2: Redo
    const redone = this.redoPhase(redoStartLSN, dirtyPageTable);

    // Phase 3: Undo
    const undone = this.undoPhase(activeTxs);

    console.log("=== ARIES RECOVERY COMPLETE ===");
    return { redone, undone };
  }

  /**
   * ANALYSIS PHASE: scan WAL from last checkpoint to end.
   * Determine:
   *   - Which transactions were active at crash (need UNDO)
   *   - Which pages might be dirty (need REDO)
   *   - Where to start the REDO phase
   */
  private analysisPhase(): {
    activeTxs: Set<number>;
    dirtyPageTable: Map<number, LSN>;
    redoStartLSN: LSN;
  } {
    console.log("Phase 1: ANALYSIS");

    const checkpoint = this.wal.findLastCheckpoint();
    const startLSN = checkpoint?.lsn ?? 1;

    // Initialize from checkpoint:
    const activeTxs = new Set<number>(
      checkpoint?.type === "checkpoint" ? checkpoint.activeTxIds : []
    );
    const dirtyPageTable = new Map<number, LSN>(
      checkpoint?.type === "checkpoint" ? checkpoint.dirtyPageTable : []
    );

    // Scan forward from checkpoint:
    const records = this.wal.readFrom(startLSN);

    for (const record of records) {
      if ("txId" in record && record.type !== "checkpoint") {
        if (record.type === "begin") {
          activeTxs.add(record.txId);
        } else if (record.type === "commit" || record.type === "abort") {
          activeTxs.delete(record.txId);
        }
      }

      // Track dirty pages:
      if ("pageId" in record) {
        if (!dirtyPageTable.has(record.pageId)) {
          dirtyPageTable.set(record.pageId, record.lsn);
        }
      }
    }

    // REDO starts from the earliest LSN in the dirty page table:
    const redoStartLSN = dirtyPageTable.size > 0
      ? Math.min(...dirtyPageTable.values())
      : startLSN;

    console.log(`  Active transactions: [${[...activeTxs].join(", ")}]`);
    console.log(`  Dirty pages: ${dirtyPageTable.size}`);
    console.log(`  Redo start LSN: ${redoStartLSN}`);

    return { activeTxs, dirtyPageTable, redoStartLSN };
  }

  /**
   * REDO PHASE: repeat history from redoStartLSN.
   *
   * For each logged update: if the page's LSN < record's LSN,
   * the update was lost (page wasn't flushed) → redo it.
   *
   * REDO applies to ALL transactions (committed AND uncommitted).
   * This restores the exact state at crash time.
   */
  private redoPhase(startLSN: LSN, dirtyPageTable: Map<number, LSN>): number {
    console.log("Phase 2: REDO");

    const records = this.wal.readFrom(startLSN);
    let redoCount = 0;

    for (const record of records) {
      if (!("pageId" in record)) continue;
      if (record.type === "checkpoint") continue;

      const pageId = record.pageId;

      // Skip if page is not in dirty page table:
      if (!dirtyPageTable.has(pageId)) continue;

      // Skip if page was flushed after this log record:
      const page = this.bufferPool.getPage(pageId);
      if (page.pageLSN >= record.lsn) continue;

      // REDO the operation:
      this.applyRedo(record);
      page.pageLSN = record.lsn;
      redoCount++;
    }

    console.log(`  Redone ${redoCount} operations`);
    return redoCount;
  }

  /**
   * UNDO PHASE: reverse all uncommitted transactions.
   *
   * Walk the prevLsn chains backwards, applying before-images.
   * Write CLR (Compensation Log Records) for each undo action
   * so that re-recovery doesn't undo what's already been undone.
   */
  private undoPhase(activeTxs: Set<number>): number {
    console.log("Phase 3: UNDO");

    let undoCount = 0;
    const allRecords = this.wal.readFrom(0);

    for (const txId of activeTxs) {
      // Get all records for this transaction in reverse order:
      const txRecords = allRecords
        .filter((r) => "txId" in r && r.txId === txId)
        .reverse();

      for (const record of txRecords) {
        if (record.type === "update") {
          this.applyUndo(record);

          // Write CLR:
          this.wal.append({
            type: "clr",
            txId,
            pageId: record.pageId,
            undoNextLsn: record.prevLsn,
            afterImage: record.beforeImage,
          });

          undoCount++;
        }
      }

      // Write abort record:
      this.wal.append({ type: "abort", txId, prevLsn: 0 });
    }

    this.wal.flush();
    console.log(`  Undone ${undoCount} operations from ${activeTxs.size} transactions`);
    return undoCount;
  }

  private applyRedo(record: WALRecord): void {
    if ("afterImage" in record && "pageId" in record) {
      const [key, value] = record.afterImage.split("=");
      const page = this.bufferPool.getPage(record.pageId);
      page.data.set(key, value);
    }
  }

  private applyUndo(record: WALRecord): void {
    if (record.type === "update") {
      const [key, value] = record.beforeImage.split("=");
      const page = this.bufferPool.getPage(record.pageId);
      page.data.set(key, value);
    }
  }
}

Interview Q&A

Q: Why is the WAL protocol necessary — why not just write changes directly to data files?

Writing directly to data files is catastrophically unsafe. Data pages are 4-16 KB. Disk writes are not atomic at the page level — a power failure mid-write produces a "torn page" (half old data, half new data). Even with journaling filesystems, a multi-page transaction (updating an index + a data page) can leave the database in an inconsistent state if only one page is written. WAL solves this with two properties: (1) Sequential writes — WAL records are appended sequentially, which is 100-1000× faster than random writes to data pages. A single fsync() makes multiple records durable. (2) Before + after images — each record stores both old and new values, enabling both REDO (restore new value) and UNDO (restore old value). The protocol rule: "no dirty page hits disk until its WAL records are flushed" ensures that after a crash, we can always reconstruct a consistent state. This is why COMMIT must fsync() the WAL — until that sync completes, the transaction isn't durable.

Q: What are the three phases of ARIES recovery, and why does REDO apply to uncommitted transactions?

ARIES uses three phases: (1) Analysis: scan the WAL from the last checkpoint to identify which transactions were active at crash time (need UNDO) and which pages might be dirty (need REDO). (2) Redo: replay ALL logged changes from the earliest dirty-page LSN — this includes uncommitted transactions. (3) Undo: reverse changes made by transactions that never committed. The surprising part is that REDO replays uncommitted work. Why? Because we need to restore the exact state at crash time before we can UNDO cleanly. Uncommitted transactions may have modified pages that were later modified by committed transactions. We need to reconstruct the full state to correctly apply UNDO. Also, REDO is physical (apply to specific page at specific offset) while UNDO is logical (reverse a semantic operation). UNDO writes CLRs (Compensation Log Records) — if we crash again during recovery, CLRs prevent us from undoing work that was already undone. This makes ARIES recovery idempotent.

Q: What are checkpoints and why don't they need to pause transactions?

A checkpoint records the current state of active transactions and dirty pages to the WAL, so recovery doesn't have to scan the entire log from the beginning. Without checkpoints, a database running for months would need to replay millions of log records on crash recovery. Fuzzy checkpoints (used by ARIES) don't pause transactions: they write a BEGIN_CHECKPOINT record, capture the active transaction table and dirty page table, then write an END_CHECKPOINT record. Transactions continue running during this time — the checkpoint is a "fuzzy" snapshot that may be slightly stale. This is fine because the ANALYSIS phase scans forward from the checkpoint and corrects any discrepancies. The alternative — "sharp checkpoints" — requires flushing all dirty pages to disk and blocking all transactions, which causes visible stalls. PostgreSQL takes this approach with its CHECKPOINT command but spaces them out (default every 5 minutes). SQLite uses a different model: WAL mode writes all changes to the WAL, and a separate "checkpoint" process copies committed changes back to the main database file.

What did you think?