It is a frustrating, almost universal rite of passage for every backend developer.
You launch a new application, and initially, everything is lightning-fast. But as your platform scales, you hit a milestone. Say, 50 million records in your primary transactions table. A user navigates to your data grid and decides to click “Page 10,000.”
Behind the scenes, your ORM translates this benign request into a seemingly standard SQL query:
SELECT * FROM transactions ORDER BY created_at LIMIT 50 OFFSET 500000;Suddenly, your database CPU utilization spikes to 100%. Query latency jumps from a snappy 2 milliseconds to an agonizing 4,500 milliseconds. The memory pressure from that massive sequential scan forces the cache to evict frequently-accessed data, which you can watch unfold in real time as your Page Life Expectancy metrics plummet.
OFFSET clause has no algorithmic shortcut. To reach row 500,000, the database must physically retrieve all 500,050 preceding rows into memory, then throw away the first 500,000 and return the last 50. It cannot “skip ahead” the way you can with a bookmark. The same applies to SELECT COUNT(*): there’s no cached answer; the engine counts every qualifying row from scratch.To engineers familiar with algorithms, one theoretical fix looks obvious: the Order Statistics Tree (OST). It is a mathematically elegant data structure that would reduce both counting and pagination from O(N) to O(log n).
Yet no mainstream transactional database (PostgreSQL, MySQL/InnoDB, Oracle, SQL Server) uses them. This is not an oversight. It is a deliberate, unavoidable architectural trade-off driven by three fundamental incompatibilities. Let’s break each one down.
Background
What Is an Order Statistics Tree?
The index structure used by virtually every relational database. Each internal node stores routing keys, values used to navigate left or right toward a leaf node, where the actual data pointer lives. Internal nodes contain no information about how many rows exist beneath them.
A B-tree augmented with one extra piece of data per internal node: the exact count of all rows in that node’s subtree. With this, finding the 500,000th row becomes a tree traversal: at each node, compare your target rank against the left subtree’s count and go left or right accordingly. This turns an O(N) scan into an O(log n) descent.
Here’s a simplified illustration of why this works: if the root node says its left subtree has 400,000 rows and you want row 500,000, you immediately know to go right and look for row 100,000 in the right subtree, skipping 400,000 rows in a single step. Repeat for each level of the tree (typically 4–6 levels in a production database) and you’ve found your row in just a handful of steps.
The promise is compelling. The execution, however, collides with three pillars of how real databases work.
Multi-Version Concurrency Control makes a globally-correct row count impossible to store on a node.
Cascading count updates create a universal lock choke point that kills multi-core scalability.
Propagating counts up the tree multiplies physical I/O and WAL logging costs by the tree’s depth.
Reason 01
The MVCC Paradox: “Count” Is Relative
The most mathematically unyielding barrier is the industry-wide reliance on Multi-Version Concurrency Control (MVCC), the mechanism that lets your database handle thousands of simultaneous reads and writes without them blocking each other.
Instead of locking a row while it’s being written (which would force readers to wait), MVCC keeps multiple physical versions of every row simultaneously. Each transaction gets its own consistent snapshot of the database, the state as it existed when that transaction began. This is why you can run a long report query while inserts are happening around you and see a perfectly stable result.
MVCC is what makes modern databases usable under load. But it creates a fundamental problem for any data structure that tries to store a definitive row count: there is no single correct answer.
Consider this scenario:
- Transaction A starts and inserts 10,000 new rows, but hasn’t committed yet.
- Transaction B starts and runs
SELECT COUNT(*).
From Transaction B’s snapshot, those 10,000 rows don’t exist yet. The correct count for B excludes them. But from Transaction A’s own perspective, the rows definitely exist. Now, if we stored a count of N + 10,000 on the root node, Transaction B would see an inflated, incorrect count. If we stored N, Transaction A would see an incorrect count of its own in-progress work.
Reason 02
The “Hot Root” Bottleneck: A Concurrency Disaster
Even if we invented a version of MVCC that could tolerate stored counts, Order Statistics Trees would still be rejected. This time for a reason that hits even harder in practice: they are fundamentally incompatible with how modern databases achieve high concurrency.
Modern database engines are designed to let multiple threads work on the index tree simultaneously with minimal locking. Most use a variant of the B-link tree architecture, which adds sideways pointers between sibling nodes. This allows threads to traverse the tree without holding locks the entire way up. A write operation only needs to lock the specific leaf it’s modifying, then it’s done.
An Order Statistics Tree destroys this property entirely.
Think about what happens when a new row is inserted into a leaf node. That leaf’s subtree count must increase by 1. But so must the parent node’s count. And the grandparent’s. And so on, all the way to the root. This cascade is not optional; it is the mathematical requirement for the O(log n) guarantee to hold.
Reason 03
Write Amplification: One Insert, Five Disk Writes
The final nail in the coffin is physical: Order Statistics Trees wage war on your storage hardware.
The ratio between how much data is actually written to disk versus how much data you logically changed. All databases have some write amplification (changing a 30-byte row means writing a full 4KB page to disk), but good architecture keeps it localized and predictable.
In a standard B-tree, inserting a row dirties exactly one page: the leaf node. That’s one disk write (or one WAL entry, more precisely).
In an Order Statistics Tree on a 5-level-deep tree, that same insert dirties five pages: the leaf, parent, grandparent, great-grandparent, and root. You’ve multiplied your write load by 5×.
Before any page is modified on disk, databases first record the change in a sequential log file called the WAL. This is what allows crash recovery. If the server dies mid-write, the WAL can replay the changes. Every dirtied page costs a WAL entry.
| Structure | Pages dirtied per insert | WAL entries per insert | Replication cost |
|---|---|---|---|
| Standard B-Tree | 1 page | 1 entry | Low |
| Order Statistics Tree (5 levels) | 5 pages | 5 entries | 5× higher |
At scale, this matters enormously. The compounding consequences are:
- Disk I/O saturation: storage throughput limits are hit much earlier under write load.
- Replication lag: replica servers must replay every WAL entry; 5× the entries means 5× the lag risk.
- SSD wear: enterprise SSDs have finite write endurance (TBW ratings); amplified writes burn through that endurance proportionally faster.
The Practical Fix
How We Survive: Keyset Pagination
Since OSTs are off the table, developers need a different approach to pagination at scale. The industry-standard answer is Keyset Pagination (also called cursor-based pagination).
The insight is simple: instead of telling the database “skip 500,000 rows,” give it a bookmark (the last value you saw) and ask for everything after that. Your existing B-tree index can jump directly to that value in O(log n) time, no scanning required.
SELECT * FROM transactions
ORDER BY created_at
LIMIT 50 OFFSET 500000; -- must scan and discard 500,000 rowsSELECT * FROM transactions
WHERE created_at > '2025-09-17T14:32:00' -- bookmark from last page
ORDER BY created_at
LIMIT 50;The trade-off is that keyset pagination doesn’t support jumping to an arbitrary page number. You can only go “next” or “previous.” For most real-world APIs (infinite scroll, feed pagination, export jobs), this is entirely acceptable. For admin interfaces that genuinely need “go to page 10,000,” the honest answer is that the feature should be redesigned, or the count should be approximated.
pg_class.reltuples. Querying this is essentially free: SELECT reltuples FROM pg_class WHERE relname = 'transactions'. It won’t be exact, but for display purposes it’s usually good enough and avoids a full COUNT(*) scan entirely.A Deliberate, Brilliant Trade-off
The absence of Order Statistics Trees in your favorite relational database is not a failure of imagination. It is a testament to how deeply the constraints of concurrency, isolation, and physical hardware shape what is architecturally possible.
Augmenting every B-tree node with a subtree count looks like a free lunch, a small addition that buys O(log n) counting and pagination for free. But that integer cascades upward on every write, violates transactional isolation under MVCC, serializes all writes through the root, and multiplies physical I/O by the tree’s depth.
Database architects made the right call. The job of an OLTP engine is to handle thousands of concurrent writes safely, accurately, and durably. Sacrificing write throughput, multi-core scalability, and ACID isolation to optimize OFFSET queries is an untenable trade-off, especially when keyset pagination solves the underlying problem without any of those costs.
As developers, the burden falls on us to adapt: ditch arbitrary offsets, use index-aware cursors, and let the database do what it was designed to do.