Trending:
Software Development

Why B-Tree indexes still dominate relational databases: the page-level trade-offs

B-Tree indexes remain the default in PostgreSQL, MySQL, Oracle, and SQL Server because they solve a fundamental problem: databases read data in fixed-size pages, not individual rows. Understanding this constraint explains why they beat alternatives for most workloads, and where they don't.

The page problem that B-Trees solve

Relational databases don't read individual rows. They read 8KB pages. When you query for a single record without an index, the database scans every page in the table. With 10 million rows, that's expensive.

B-Tree indexes exist because they're optimized for this page-based reality. Each node fits neatly in one page and holds hundreds of keys, not just two like a textbook binary tree. This wide, shallow structure means finding any record typically requires 3-5 page reads, regardless of table size. The math: with 200 keys per node, a three-level tree handles 8 million entries.

Why writes cost more than you'd expect

Every insert updates the B-Tree. If a leaf node is full, it splits into two nodes and promotes a key to the parent. These splits propagate upward but remain rare in practice. PostgreSQL mitigates this through key deduplication, shrinking indexes by 50% or more on skewed data compared to SQL Server's approach of storing every key-value pair fully.

The write overhead is the price for keeping data sorted. That sorted structure enables range scans: after locating the first matching key, the database walks linked leaf nodes sequentially. Hash indexes offer constant-time lookups for exact matches but can't handle ranges, ordering, or prefix scans. This is why LIKE 'abc%' works with B-Trees but requires a full scan with hash indexes.

Implementation differences matter

MySQL's InnoDB stores full rows in leaf nodes (clustered index). PostgreSQL stores heap pointers. Both choices create trade-offs: InnoDB's approach speeds range scans but makes inserts heavier. PostgreSQL's secondary indexes require an extra lookup (index to heap pointer to row), adding overhead.

B-Tree bloat is real. Deleted rows leave gaps, page splits fragment data, and performance degrades. PostgreSQL's pgstattuple can detect bloat percentages; REINDEX CONCURRENTLY rebuilds indexes without downtime, though it's slower than VACUUM ANALYZE. The fine print: monitoring index health matters more than most teams assume.

The pattern holds

B-Trees have dominated since the 1970s not through inertia but because the page-level constraints haven't changed. Storage is still block-based. Reads still dominate most workloads. The logarithmic lookup cost grows slowly enough that even at massive scale, a handful of page reads beats alternatives. We'll see if column stores or LSM trees change this for write-heavy workloads, but for general-purpose relational databases, B-Trees remain the rational default.