DEV Community 2h ago

Engineering a High-Performance LSM-Tree Storage Engine: MemTables, SSTables, and Compaction

The Write Bottleneck: B-Trees vs. LSM-Trees

Traditional relational databases use B-Trees to index data. While B-Trees excel at random read lookups, they perform poorly for write-heavy workloads. This is because every write requires updating arbitrary pages on disk, resulting in expensive random disk I/O.

Log-Structured Merge-Trees (LSM-Trees) solve this write bottleneck. Instead of updating records in place, an LSM-tree appends all writes sequentially. By transforming random writes into sequential disk append operations, LSM-trees deliver massive write throughput. This makes them the foundation of modern high-performance databases like RocksDB, Cassandra, and InfluxDB.

The Write Path: Durability and Memory

Writes to an LSM-tree follow a simple, high-speed path:

Client ─── Write ───> Write-Ahead Log (WAL) ─── [Append to Disk]
                           │
                           ▼
                    MemTable ──────────────── [In-Memory Update]

Write-Ahead Log (WAL): The write is appended sequentially to a log file on disk. This is extremely fast because it requires no disk seeks. The WAL ensures durability - if the system crashes, the log is replayed to recover lost memory state.
MemTable: Once logged to the WAL, the write updates an in-memory data structure called the MemTable. The MemTable is typically implemented as a self-balancing binary search tree (like a Red-Black Tree) or a Skip List, keeping keys sorted.

At this point, the write is complete. The client receives a success response, having avoided any expensive random disk I/O.

The Flush: Creating Sorted String Tables (SSTables)

When the MemTable reaches a size limit (e.g. 64MB), it is converted into a read-only structure and flushed to disk as a Sorted String Table (SSTable). An SSTable is a sequential file containing sorted keys and values divided into data blocks. Because the MemTable kept keys sorted in memory, writing the SSTable is a fast, sequential disk append.

[MemTable] ─── (Sorted keys) ───> [SSTable on Disk]
                                      ├── Sparse Index (Offsets)
                                      └── Bloom Filter (Probabilistic)

To enable fast searches without reading the entire SSTable file, the flush generates two helper structures:

Sparse Index: Instead of indexing every key, a sparse index maps a key to its data block offset every few kilobytes. Finding a key requires a binary search on the sparse index to locate the correct block, which is then scanned sequentially.
Bloom Filter: A space-efficient probabilistic data structure that answers: Is this key in this SSTable? It can return false positives, but never false negatives. If the Bloom Filter returns false, the engine skips searching that SSTable entirely, saving an expensive disk seek.

The Read Path: Multi-Tier Search

Since keys are scattered across multiple SSTable files on disk, reading a key requires a structured lookup path:

Step 1: Check active MemTable (RAM) ─── (Found?) ───> Return
                │
               (No)
                ▼
Step 2: Check immutable MemTables (RAM)
                │
               (No)
                ▼
Step 3: Scan SSTables on disk (Newest to Oldest)
           ├── 1. Query Bloom Filter ── (Not present?) ──> Skip file
           ├── 2. Binary search Sparse Index to find block offset
           └── 3. Scan block sequentially for target key

Space Amplification: Compaction

Because SSTables are immutable, updates and deletes (represented by marker records called Tombstones) accumulate on disk. This leads to Space Amplification (wasted storage) and Read Amplification (having to search too many SSTable files). To clean this up, the storage engine runs a background process called Compaction.

In Leveled Compaction, SSTables are organized into numbered levels (L1, L2, L3...). Level $L_i$ has a maximum size capacity (e.g., L1 = 10MB, L2 = 100MB, L3 = 1GB). When a level exceeds its capacity, the engine picks one of its SSTables, reads overlapping SSTables from the next level, merges them (performing a multi-way merge sort, discarding duplicate keys and tombstones), and writes new sorted files to the next level.

TypeScript LSM Storage Engine Implementation

Here is a modular TypeScript class modeling the Write-Ahead Log (WAL) append, MemTable updates, sparse index mapping, and Bloom Filter seeks:

interface LogEntry {
  op: "PUT" | "DELETE";
  key: string;
  value: string;
}

interface SSTableIndex {
  key: string;
  offset: number;
}

export class LSMStorageEngine {
  private memTable: Map<string, string> = new Map();
  private walLog: LogEntry[] = [];
  private activeMemTableBytes: number = 0;
  private readonly MAX_MEMTABLE_BYTES = 64 * 1024 * 1024; // 64MB

  /**
   * Appends operation to Write-Ahead Log (WAL),
   * updates the in-memory MemTable, and flushes to disk if full.
   */
  public put(key: string, value: string): void {
    const entry: LogEntry = { op: "PUT", key, value };

    // 1. Durability: Sequential append to WAL log
    this.appendWal(entry);

    // 2. High Throughput: In-memory write to MemTable
    this.memTable.set(key, value);
    this.activeMemTableBytes += key.length + value.length;

    // 3. Flush Check
    if (this.activeMemTableBytes >= this.MAX_MEMTABLE_BYTES) {
      this.flushMemTableToSSTable();
    }
  }

  public get(key: string): string | null {
    // 1. Check in-memory MemTable
    if (this.memTable.has(key)) {
      return this.memTable.get(key) || null;
    }

    // 2. Read from disk SSTables in order (newest to oldest)
    return this.searchSSTablesOnDisk(key);
  }

  private flushMemTableToSSTable(): void {
    // Extract keys and sort alphabetically
    const sortedKeys = Array.from(this.memTable.keys()).sort();
    const sstableData: { k: string; v: string }[] = [];
    const sparseIndex: SSTableIndex[] = [];
    let currentOffset = 0;

    sortedKeys.forEach((key, idx) => {
      const val = this.memTable.get(key)!;
      sstableData.push({ k: key, v: val });

      // Create sparse index entry every 4KB block boundaries (simulated every 16 keys)
      if (idx % 16 === 0) {
        sparseIndex.push({ key, offset: currentOffset });
      }
      currentOffset += key.length + val.length;
    });

    // Write data, index blocks, and computed Bloom Filter to disk
    this.writeDiskFile({
      data: sstableData,
      index: sparseIndex,
      bloomFilter: this.generateBloomFilter(sortedKeys)
    });

    // Reset memory structures
    this.memTable.clear();
    this.walLog = [];
    this.activeMemTableBytes = 0;
  }

  private searchSSTablesOnDisk(key: string): string | null {
    const activeFiles = this.getSSTableDiskFiles();

    for (const file of activeFiles) {
      // 1. Check Bloom Filter (Bypasses disk if key definitely doesn't exist)
      if (!file.bloomFilter.test(key)) {
        continue;
      }

      // 2. Scan sparse index to locate target block offset
      const blockOffset = this.binarySearchSparseIndex(file.index, key);
      const dataBlock = this.loadDiskBlock(file.path, blockOffset);

      // 3. Scan block sequentially for target match
      const result = dataBlock.find(item => item.k === key);
      if (result) return result.v;
    }

    return null;
  }

  private generateBloomFilter(keys: string[]): any {
    return { test: (k: string) => true }; // Simulated filter response
  }

  private appendWal(entry: LogEntry) { this.walLog.push(entry); }
  private writeDiskFile(payload: any) {}
  private getSSTableDiskFiles(): any[] { return []; }
  private binarySearchSparseIndex(index: any[], key: string): number { return 0; }
  private loadDiskBlock(path: string, offset: number): any[] { return []; }
}

Engineering Takeaways

Sequential writes are orders of magnitude faster than random writes: By appending data sequentially, LSM-trees achieve massive write performance.
Bloom Filters are critical for read speed: They prevent the database from checking every single SSTable file on disk for a missing key.
Compaction is necessary to manage space and read speed: Merging and sorting SSTables in the background balances write, read, and space amplification.

The full article features a live interactive LSM-tree storage sandbox - perform inserts, watch the MemTable flush, inspect index blocks and Bloom Filters, and trigger leveled compaction directly in your browser. Read the full interactive article →

Written by Ebenezer Akinseinde - Software Developer & AI Automations Engineer. Portfolio · GitHub

Read on DEV Community ↗ ← Back to News