← All lectures
Lecture · L05

Lower Bounds for Sorting

donesortingalgorithms

Every sorting algorithm we have seen so far — merge sort, heap sort, insertion sort, the sort hidden inside BST construction — runs in Θ(nlogn)\Theta(n \log n) at best. The natural question: is nlognn \log n a fundamental wall, or just a failure of imagination so far? This lecture answers both halves of that question. The structure is theorem → proof → counterexample, where the "counterexample" is not a bug in the theorem but a demonstration that its hypotheses were doing real work. Change the hypotheses and the answer changes.

Concretely, the lecture does three things:

  1. Defines a restricted model of computation — the comparison model — that captures every sorting algorithm we've studied so far.
  2. Proves that inside that model, sorting requires Ω(nlogn)\Omega(n \log n) comparisons and searching requires Ω(logn)\Omega(\log n) comparisons. So merge sort and binary search are not merely "best known" — they are provably optimal.
  3. Steps outside the comparison model into the standard RAM model, where keys are integers we are allowed to do arithmetic on, and shows two algorithms that sort in linear time when the keys are not too large: counting sort and radix sort.

The Comparison Model

The setup is deliberately restrictive. Input items are treated as black boxes — abstract data types whose internal structure is hidden. The only operation the algorithm is permitted to perform on items is comparison: given two items xx and yy, ask one of x<yx < y, xyx \leq y, x>yx > y, xyx \geq y, or x=yx = y. Each such comparison returns one bit (yes or no).

The cost of an algorithm in this model is defined as the number of comparisons it performs. Everything else is free: pointer manipulation, swaps, allocating arrays, integer arithmetic on indices, copying data — none of it counts. This is the strange and important part of the model. We are being deliberately generous about what is free, because if we can prove a lower bound under this generosity, the bound is unconditional: it holds for any algorithm restricted to inspecting items only via comparisons, no matter how clever the bookkeeping.

Why this captures everything we've seen

Every sorting and searching algorithm from earlier lectures fits inside this model:

  • Merge sort moves items between arrays, but only inspects them via << during the merge step.
  • Heap sort sifts items up and down, but the only thing it ever asks about an item is how it compares to its parent or child.
  • BST operations walk the tree by comparing the query key to each node's key.
  • Binary search compares the target to the middle element of a subarray.

So a lower bound proved in the comparison model applies to all of these simultaneously.

A subtlety about cost

When we previously said "binary search runs in O(logn)O(\log n)", the logn\log n was simultaneously the number of comparisons it performs and the number of real-time steps it takes on a RAM machine. In the comparison model we are only counting the first. The lower bound we are about to prove says: even if everything except comparisons is free, you still cannot get below logn\log n for searching, or below nlognn \log n for sorting.


Decision Trees

The trick that makes lower bounds in this model tractable is to draw out all possible executions of an algorithm at once, as a tree.

Fix an input size nn. For any deterministic algorithm in the comparison model, the very first comparison it makes is determined by the algorithm itself — it doesn't yet depend on anything about the input. After that comparison returns yes or no, the algorithm branches into two possible second comparisons (one for each answer), and so on. This branching structure is a binary tree, called the decision tree of the algorithm at input size nn.

  • Each internal node is a comparison the algorithm performs at that point in some execution. It has two children, one for each possible binary outcome.
  • Each leaf represents the algorithm having terminated and produced an answer. The leaf is labeled with that answer.

Concrete example: binary search at n=3n = 3

Take an array A[0..2]A[0..2] and a query value xx. Binary search at n=3n = 3 first compares xx to the middle element A[1]A[1]:

  • Is A[1]<xA[1] < x?
    • No (so xA[1]x \leq A[1]): now compare xx to A[0]A[0].
      • Is A[0]<xA[0] < x?
        • No: xA[0]x \leq A[0]. Output: xx falls at or before position 0.
        • Yes: A[0]<xA[1]A[0] < x \leq A[1]. Output: xx falls strictly between positions 0 and 1.
    • Yes (so x>A[1]x > A[1]): now compare xx to A[2]A[2].
      • Is A[2]<xA[2] < x?
        • No: A[1]<xA[2]A[1] < x \leq A[2]. Output: xx falls between positions 1 and 2.
        • Yes: x>A[2]x > A[2]. Output: xx falls past position 2.

Two comparisons in every execution, four possible answers — four leaves, height 2. This is what the decision tree of binary search looks like at n=3n = 3. For larger nn, the same shape extends: a balanced binary tree of height log2n\lceil \log_2 n \rceil.

For sorting, the trees become enormous — exponentially many nodes — so we never actually draw them. But the conceptual picture is the same: each internal node asks "Ai<AjA_i < A_j?" for some i,ji, j, and each leaf is labeled with a permutation of the input that the algorithm has determined to be the sorted order.

Reading running time off the tree

This is where decision trees become useful, because they translate questions about algorithms into questions about trees, and we know a lot about trees.

  • A single execution of the algorithm corresponds to a root-to-leaf path in the decision tree. At each internal node along the path, the comparison happens; the path turns left or right depending on the answer; eventually it reaches a leaf and outputs the label.
  • The cost of that single execution is the length of that path (number of comparisons performed). One comparison per internal node visited.
  • The worst-case running time of the algorithm is therefore the height of the tree — the length of the longest root-to-leaf path.

So the question "how few comparisons does any algorithm need in the worst case to solve problem PP?" becomes "what is the minimum possible height of any decision tree that solves PP?" And that is a question about binary trees, which is much easier to reason about.


Lower Bound for Searching

Claim. Any comparison-based algorithm that searches among nn preprocessed items requires Ω(logn)\Omega(\log n) comparisons in the worst case.

The phrase "preprocessed" is a strong concession. It means we are allowed to do arbitrarily much work on the nn items ahead of time — sort them, build them into an AVL tree, build any data structure we like — and none of that work counts. The clock starts only when a query item xx arrives, and we count only the comparisons between xx and the stored items during that query.

Proof. Take any comparison-based search algorithm and look at its decision tree.

  • The decision tree is binary, because each comparison returns one bit.
  • The decision tree must contain at least nn leaves. Why? Because the algorithm has to be able to output any of at least nn distinct answers — for instance, "xx matches item 1", "xx matches item 2", …, "xx matches item nn" — and each distinct answer needs at least one leaf labeled with it. There may be more leaves than this (the same answer can appear in multiple leaves, depending on the algorithm), but there cannot be fewer.

A binary tree with at least nn leaves has height at least log2n\log_2 n. (This is the basic fact that a binary tree of height hh has at most 2h2^h leaves, so hlog2(leaves)h \geq \log_2(\text{leaves}).) The height is the worst-case running time. So any comparison-based search needs at least log2n\log_2 n comparisons in the worst case. \blacksquare

That justifies retroactively why binary search trees were worth caring about: in the comparison model, Θ(logn)\Theta(\log n) for search, predecessor, and successor is the best anyone can ever do. AVL trees hit that bound, so they're optimal.

A subtle point worth pausing on: this lower bound holds even with preprocessing allowed. If we drop the preprocessing assumption, the truth is actually stronger — you need Ω(n)\Omega(n) time, because you might have to look at every item just to know what's there. But this proof technique doesn't capture that stronger bound; the decision-tree argument only ever gives Ω(logn)\Omega(\log n). So the technique is useful but not always tight.


Lower Bound for Sorting

Claim. Any comparison-based sorting algorithm requires Ω(nlogn)\Omega(n \log n) comparisons in the worst case.

The strategy is identical: bound the number of leaves the decision tree must have, then take the logarithm.

A leaf in a sorting algorithm's decision tree is labeled with a permutation of the input — saying something like "the smallest element was originally A5A_5, the next was A7A_7, then A1A_1, then A0A_0, …". The algorithm has done enough comparisons by the time it reaches that leaf to know what the sorted order is, and it just writes it down (writing is free).

How many distinct outputs does a sorting algorithm need to be able to produce? All possible permutations of the input, since for any permutation there exists some input that requires that exact output. There are n!n! permutations of nn distinct items. So:

  • The decision tree is binary.
  • It must have at least n!n! leaves.
  • Therefore its height is at least log2(n!)\log_2(n!).

The remaining task is purely algebraic: show that log2(n!)=Ω(nlogn)\log_2(n!) = \Omega(n \log n).

Bounding log2(n!)\log_2(n!) from below

Two ways. The clean one uses Stirling's approximation; the more elementary one is a direct summation argument.

Summation approach. Use the identity log(ab)=loga+logb\log(ab) = \log a + \log b to expand:

log2(n!)=log2(n)+log2(n1)++log2(2)+log2(1)=i=1nlog2i\log_2(n!) = \log_2(n) + \log_2(n-1) + \cdots + \log_2(2) + \log_2(1) = \sum_{i=1}^{n} \log_2 i

We want to lower-bound this sum. The trick: throw away the smaller half of the terms, then bound each remaining term from below by the smallest one in that half.

i=1nlog2i    i=n/2nlog2i    i=n/2nlog2(n/2)\sum_{i=1}^{n} \log_2 i \;\geq\; \sum_{i=n/2}^{n} \log_2 i \;\geq\; \sum_{i=n/2}^{n} \log_2(n/2)

The first inequality drops the first half of the terms (they're all positive, so the sum only gets smaller). The second replaces every remaining term log2i\log_2 i with the smallest such term, log2(n/2)\log_2(n/2) — since in/2i \geq n/2 throughout the sum, every term is at least log2(n/2)\log_2(n/2).

Now the sum is trivial: there are n/2n/2 identical terms, each equal to log2(n/2)=log2n1\log_2(n/2) = \log_2 n - 1. So:

i=n/2nlog2(n/2)=n2(log2n1)=nlog2n2n2\sum_{i=n/2}^{n} \log_2(n/2) = \frac{n}{2}(\log_2 n - 1) = \frac{n \log_2 n}{2} - \frac{n}{2}

The dominating term is nlog2n2\frac{n \log_2 n}{2}, and the n/2-n/2 is negligible asymptotically. So log2(n!)nlog2n2n2=Ω(nlogn)\log_2(n!) \geq \frac{n \log_2 n}{2} - \frac{n}{2} = \Omega(n \log n). \blacksquare

Stirling approach. Stirling's formula says n!2πn(n/e)nn! \approx \sqrt{2\pi n} \cdot (n/e)^n. Taking log2\log_2 of both sides and using log(ab)=loga+logb\log(ab) = \log a + \log b:

log2(n!)12log2(2πn)+nlog2nnlog2e\log_2(n!) \approx \frac{1}{2}\log_2(2\pi n) + n \log_2 n - n \log_2 e

The dominant term is nlog2nn \log_2 n, with a linear correction of nlog2e+12log2(2πn)-n \log_2 e + \frac{1}{2}\log_2(2\pi n). Asymptotically, log2(n!)=nlog2nO(n)\log_2(n!) = n \log_2 n - O(n). The leading constant is exactly 1, so this method even gives a tighter result than the summation argument (which had a constant of 1/21/2). Either way, the conclusion is the same: log2(n!)=Ω(nlogn)\log_2(n!) = \Omega(n \log n), hence sorting requires Ω(nlogn)\Omega(n \log n) comparisons. \blacksquare

So merge sort, heap sort, and any other O(nlogn)O(n \log n) comparison sort are optimal in the comparison model. The framework is now closed: in this model, search is Θ(logn)\Theta(\log n), sort is Θ(nlogn)\Theta(n \log n), and we cannot do better.


Leaving the Comparison Model

The remaining question is: what if we allow ourselves more than just comparisons? In the standard RAM (Random Access Machine) model, integers fit in machine words and we can do arithmetic — addition, subtraction, multiplication, division, modulus, indexing into arrays — all in O(1)O(1) time per operation. Comparisons are still allowed too, of course, but they are no longer the only thing.

The next two algorithms exploit this extra power. Both are integer sorting algorithms — they assume the keys we are sorting are integers, not arbitrary black-box comparables. This is a real assumption, but a practically benign one: most things you sort on a computer are already represented as integers (or can be mapped to integers cheaply).

Setup for integer sorting

Throughout the rest of the lecture, the assumptions are:

  • We are sorting nn items whose keys are integers in the range {0,1,,k1}\{0, 1, \ldots, k-1\}.
  • Each key fits in a single machine word, so all the standard arithmetic operations on a key cost O(1)O(1).

Both nn and kk are now parameters. The interesting question is how the running time depends on each.

A side note: integer sorting is still an active area of research. The current best general result (when nothing is assumed about kk beyond "fits in a word") is roughly O(nloglogn)O(n \sqrt{\log \log n}) in expectation — almost linear, not quite. Whether you can sort nn word-sized integers in deterministic O(n)O(n) time remains open. We're not going to touch that here. We will instead focus on the regime where kk is "not too large", and show two algorithms that achieve linear time in that regime.


Counting Sort

The intuition is exactly what the name suggests. If I hand you the multiset {3,5,7,5,5,3,6}\{3, 5, 7, 5, 5, 3, 6\}, you can sort it by first counting: there are two 3s, three 5s, one 6, one 7. Then you reconstruct the sorted output: write two 3s, then three 5s, then a 6, then a 7. Done.

To turn this into an algorithm, allocate an array LL of length kk — one slot per possible key value. There's a tempting first version where each slot is just an integer counter, but that version can only sort the keys themselves, not items with keys plus extra payload. We want to handle the more general case where each item has a key but also drags along other data — like a spreadsheet row where we sort by one column but want all the other columns to come along for the ride. So instead, each slot of LL holds a list of items.

The algorithm

Allocate L[0..k-1], where each L[i] is an empty list.
For j in range(n):
    Append A[j] to L[ key(A[j]) ].
Output = empty list.
For i in range(k):
    Extend output with L[i].
Return output.

Walking through it: the first loop scans the input array once, and for each item, appends it to the list indexed by its key. After this loop, all items with key 00 live in L[0]L[0], all items with key 11 live in L[1]L[1], and so on — and within each list, items appear in the order they appeared in the input. The second loop concatenates these lists in order L[0],L[1],,L[k1]L[0], L[1], \ldots, L[k-1], producing a sorted output.

Note the use of key(item) rather than treating the item itself as the integer. This is the same convention as Python's sort(key=...): the items can be anything, as long as we have a function that extracts an integer key from each one. The key extraction is assumed to be O(1)O(1), which is reasonable since keys fit in a word.

Running time

  • Allocating LL: creating kk empty lists is O(k)O(k).
  • First loop: nn iterations, each doing one append (which is O(1)O(1) in Python and most reasonable list implementations). Total: O(n)O(n).
  • Second loop: visits each of the kk slots, and for each slot L[i]L[i], extends the output by the contents of L[i]L[i]. Visiting an empty slot is O(1)O(1), copying L[i]|L[i]| items is O(L[i])O(|L[i]|). Summed over all ii, the visit cost is O(k)O(k) and the copy cost is O(n)O(n) (because the total number of items across all lists is exactly nn).

Adding it up: O(n+k)O(n + k).

So counting sort runs in Θ(n+k)\Theta(n + k). If k=O(n)k = O(n), this is linear. As soon as kk grows much larger than nn — say, k=n2k = n^2 — the running time degrades and counting sort becomes worse than merge sort. So counting sort is good for small key ranges, useless for large ones. We need something better.

Stability — a property worth naming

Notice that within each list L[i]L[i], items appear in the same order they appeared in the input array. This means counting sort is stable: items with equal keys preserve their original relative order in the output. We didn't have to do anything special to achieve this — it falls out of the fact that we append items in input order. Stability looks like a minor accounting detail, but it is exactly what makes the next algorithm work.


Radix Sort

Radix sort is the cooler algorithm. It uses counting sort as a subroutine — that's why we spent so much time on counting sort even though it isn't satisfying on its own — and extends the regime where linear-time integer sorting works dramatically. With counting sort alone, kk has to be O(n)O(n). With radix sort, kk can be polynomial in nn — for example, all keys can be integers up to n100n^{100}, and radix sort still runs in linear time. That's a huge generalization.

The idea

Pick a base bb (which we'll choose later) and imagine writing every key in base bb. If the maximum key value is kk, then each key has at most

d=logbk+1d = \lceil \log_b k \rceil + 1

base-bb digits. The digits of a number xx in base bb are extracted by xmodbx \bmod b (least significant digit), (x/b)modb(x / b) \bmod b (next digit), and so on — each extraction is a constant number of arithmetic operations on word-sized integers, so each digit can be computed in O(1)O(1).

We never actually rewrite the numbers in base bb; we just compute digits on demand when we need them.

The algorithm itself is then:

  1. Sort all the integers by their least significant digit.
  2. Sort by the next least significant digit.
  3. Sort by the most significant digit.

That is, dd passes total, each one a complete sort of the entire array of nn items, but using only one digit as the key. The output of pass dd is the final sorted array.

This is the same trick spreadsheets use: if you want to sort by several columns, you can click the least important column first, then the next-most-important, then the most important — and at the end you get a multi-column sort. It is genuinely surprising that this works, and the reason it works is stability.

Why it works (informal)

Suppose you've already sorted the items by the lower-order digits, and now you sort them by the next digit up. Two items that differ in the new digit get separated correctly by this new sort — that's obvious. Two items that agree in the new digit need to remain in their previous relative order, because their previous relative order was determined by the lower-order digits, which is correct. The stable sort guarantees exactly this: equal keys (here, equal values in the current digit) preserve their relative order. So each pass extends the sorted-prefix-of-digits one position higher without disturbing the work done by earlier passes. After dd passes, all dd digits have been incorporated and the array is fully sorted. (A clean proof is by induction on the number of passes completed.)

This is also why we need counting sort specifically as the inner sort — counting sort is stable, and it's fast for small key ranges, which is exactly what we have when we're sorting by a single base-bb digit (range {0,,b1}\{0, \ldots, b-1\}).

Running time

Each pass is a counting sort over a key range of size bb. By the counting sort analysis, that costs O(n+b)O(n + b). There are dd passes, so the total is:

T(n)=dO(n+b)=O((n+b)d)=O((n+b)logbk)T(n) = d \cdot O(n + b) = O\big((n + b) \cdot d\big) = O\big((n + b) \log_b k\big)

Now we get to choose bb. We want this to be as small as possible. We have a sum of two terms multiplied by a logbk\log_b k, and the standard trick when minimizing such a thing is to balance the two parts of the sum — typically the optimum is reached when they are equal, or close to it.

If we set b=nb = n, then n+b=2n=O(n)n + b = 2n = O(n), and logbk=lognk\log_b k = \log_n k, so:

T(n)=O(nlognk)T(n) = O(n \cdot \log_n k)

Why is this the right choice? If bb is much smaller than nn, then logbk\log_b k is unnecessarily large. If bb is much larger than nn, then the bb term in (n+b)(n + b) dominates and we're paying for the size of the key range without using it. Setting b=Θ(n)b = \Theta(n) keeps (n+b)(n + b) proportional to nn while making each digit as wide as possible.

When this is linear

The key payoff: suppose knck \leq n^c for some constant cc. Then:

lognklogn(nc)=c\log_n k \leq \log_n(n^c) = c

so:

T(n)=O(nc)=O(n)T(n) = O(n \cdot c) = O(n)

The constant cc disappears into the asymptotic notation. So if your integer keys are anywhere in the range {0,1,,nc}\{0, 1, \ldots, n^c\} for any constant cc — keys up to nn, n2n^2, n100n^{100}, anything polynomial in nn — radix sort sorts them in linear time.

This is the punchline of the lecture. In the comparison model, sorting needs Θ(nlogn)\Theta(n \log n). By stepping out of that model and exploiting integer arithmetic, we sort polynomially-bounded integers in Θ(n)\Theta(n). The nlognn \log n wall isn't a property of sorting — it's a property of sorting via comparisons only.


Summary

The lecture's arc: a model is a contract about what your algorithm is allowed to do, and lower bounds are statements about that contract. The comparison model captures everything we'd previously called sorting and searching, and inside it the nlognn \log n and logn\log n bounds are tight. The proof is short and entirely structural: count the answers an algorithm must distinguish, and a binary decision tree distinguishing that many things needs that many leaves, and a binary tree with LL leaves has height at least log2L\log_2 L.

Once we relax the model and let the algorithm do arithmetic on integer keys, the picture changes. Counting sort gives Θ(n+k)\Theta(n + k) — linear when k=O(n)k = O(n), useless when kk is much larger. Radix sort uses counting sort as a stable inner subroutine and digit-decomposes the keys, achieving O(nlognk)O(n \log_n k), which is linear whenever kk is polynomial in nn.

The pattern worth taking away from all of this: an asymptotic bound is never just a property of the problem. It is always a property of the (problem, model) pair. Change the model and the bound can change dramatically. The job of an algorithm designer is partly to know which model your real-world situation actually inhabits, so that you reach for the right tool.