Every sorting algorithm we have seen so far — merge sort, heap sort, insertion sort, the sort hidden inside BST construction — runs in at best. The natural question: is a fundamental wall, or just a failure of imagination so far? This lecture answers both halves of that question. The structure is theorem → proof → counterexample, where the "counterexample" is not a bug in the theorem but a demonstration that its hypotheses were doing real work. Change the hypotheses and the answer changes.
Concretely, the lecture does three things:
The setup is deliberately restrictive. Input items are treated as black boxes — abstract data types whose internal structure is hidden. The only operation the algorithm is permitted to perform on items is comparison: given two items and , ask one of , , , , or . Each such comparison returns one bit (yes or no).
The cost of an algorithm in this model is defined as the number of comparisons it performs. Everything else is free: pointer manipulation, swaps, allocating arrays, integer arithmetic on indices, copying data — none of it counts. This is the strange and important part of the model. We are being deliberately generous about what is free, because if we can prove a lower bound under this generosity, the bound is unconditional: it holds for any algorithm restricted to inspecting items only via comparisons, no matter how clever the bookkeeping.
Every sorting and searching algorithm from earlier lectures fits inside this model:
So a lower bound proved in the comparison model applies to all of these simultaneously.
When we previously said "binary search runs in ", the was simultaneously the number of comparisons it performs and the number of real-time steps it takes on a RAM machine. In the comparison model we are only counting the first. The lower bound we are about to prove says: even if everything except comparisons is free, you still cannot get below for searching, or below for sorting.
The trick that makes lower bounds in this model tractable is to draw out all possible executions of an algorithm at once, as a tree.
Fix an input size . For any deterministic algorithm in the comparison model, the very first comparison it makes is determined by the algorithm itself — it doesn't yet depend on anything about the input. After that comparison returns yes or no, the algorithm branches into two possible second comparisons (one for each answer), and so on. This branching structure is a binary tree, called the decision tree of the algorithm at input size .
Take an array and a query value . Binary search at first compares to the middle element :
Two comparisons in every execution, four possible answers — four leaves, height 2. This is what the decision tree of binary search looks like at . For larger , the same shape extends: a balanced binary tree of height .
For sorting, the trees become enormous — exponentially many nodes — so we never actually draw them. But the conceptual picture is the same: each internal node asks "?" for some , and each leaf is labeled with a permutation of the input that the algorithm has determined to be the sorted order.
This is where decision trees become useful, because they translate questions about algorithms into questions about trees, and we know a lot about trees.
So the question "how few comparisons does any algorithm need in the worst case to solve problem ?" becomes "what is the minimum possible height of any decision tree that solves ?" And that is a question about binary trees, which is much easier to reason about.
Claim. Any comparison-based algorithm that searches among preprocessed items requires comparisons in the worst case.
The phrase "preprocessed" is a strong concession. It means we are allowed to do arbitrarily much work on the items ahead of time — sort them, build them into an AVL tree, build any data structure we like — and none of that work counts. The clock starts only when a query item arrives, and we count only the comparisons between and the stored items during that query.
Proof. Take any comparison-based search algorithm and look at its decision tree.
A binary tree with at least leaves has height at least . (This is the basic fact that a binary tree of height has at most leaves, so .) The height is the worst-case running time. So any comparison-based search needs at least comparisons in the worst case.
That justifies retroactively why binary search trees were worth caring about: in the comparison model, for search, predecessor, and successor is the best anyone can ever do. AVL trees hit that bound, so they're optimal.
A subtle point worth pausing on: this lower bound holds even with preprocessing allowed. If we drop the preprocessing assumption, the truth is actually stronger — you need time, because you might have to look at every item just to know what's there. But this proof technique doesn't capture that stronger bound; the decision-tree argument only ever gives . So the technique is useful but not always tight.
Claim. Any comparison-based sorting algorithm requires comparisons in the worst case.
The strategy is identical: bound the number of leaves the decision tree must have, then take the logarithm.
A leaf in a sorting algorithm's decision tree is labeled with a permutation of the input — saying something like "the smallest element was originally , the next was , then , then , …". The algorithm has done enough comparisons by the time it reaches that leaf to know what the sorted order is, and it just writes it down (writing is free).
How many distinct outputs does a sorting algorithm need to be able to produce? All possible permutations of the input, since for any permutation there exists some input that requires that exact output. There are permutations of distinct items. So:
The remaining task is purely algebraic: show that .
Two ways. The clean one uses Stirling's approximation; the more elementary one is a direct summation argument.
Summation approach. Use the identity to expand:
We want to lower-bound this sum. The trick: throw away the smaller half of the terms, then bound each remaining term from below by the smallest one in that half.
The first inequality drops the first half of the terms (they're all positive, so the sum only gets smaller). The second replaces every remaining term with the smallest such term, — since throughout the sum, every term is at least .
Now the sum is trivial: there are identical terms, each equal to . So:
The dominating term is , and the is negligible asymptotically. So .
Stirling approach. Stirling's formula says . Taking of both sides and using :
The dominant term is , with a linear correction of . Asymptotically, . The leading constant is exactly 1, so this method even gives a tighter result than the summation argument (which had a constant of ). Either way, the conclusion is the same: , hence sorting requires comparisons.
So merge sort, heap sort, and any other comparison sort are optimal in the comparison model. The framework is now closed: in this model, search is , sort is , and we cannot do better.
The remaining question is: what if we allow ourselves more than just comparisons? In the standard RAM (Random Access Machine) model, integers fit in machine words and we can do arithmetic — addition, subtraction, multiplication, division, modulus, indexing into arrays — all in time per operation. Comparisons are still allowed too, of course, but they are no longer the only thing.
The next two algorithms exploit this extra power. Both are integer sorting algorithms — they assume the keys we are sorting are integers, not arbitrary black-box comparables. This is a real assumption, but a practically benign one: most things you sort on a computer are already represented as integers (or can be mapped to integers cheaply).
Throughout the rest of the lecture, the assumptions are:
Both and are now parameters. The interesting question is how the running time depends on each.
A side note: integer sorting is still an active area of research. The current best general result (when nothing is assumed about beyond "fits in a word") is roughly in expectation — almost linear, not quite. Whether you can sort word-sized integers in deterministic time remains open. We're not going to touch that here. We will instead focus on the regime where is "not too large", and show two algorithms that achieve linear time in that regime.
The intuition is exactly what the name suggests. If I hand you the multiset , you can sort it by first counting: there are two 3s, three 5s, one 6, one 7. Then you reconstruct the sorted output: write two 3s, then three 5s, then a 6, then a 7. Done.
To turn this into an algorithm, allocate an array of length — one slot per possible key value. There's a tempting first version where each slot is just an integer counter, but that version can only sort the keys themselves, not items with keys plus extra payload. We want to handle the more general case where each item has a key but also drags along other data — like a spreadsheet row where we sort by one column but want all the other columns to come along for the ride. So instead, each slot of holds a list of items.
Allocate L[0..k-1], where each L[i] is an empty list.
For j in range(n):
Append A[j] to L[ key(A[j]) ].
Output = empty list.
For i in range(k):
Extend output with L[i].
Return output.
Walking through it: the first loop scans the input array once, and for each item, appends it to the list indexed by its key. After this loop, all items with key live in , all items with key live in , and so on — and within each list, items appear in the order they appeared in the input. The second loop concatenates these lists in order , producing a sorted output.
Note the use of key(item) rather than treating the item itself as the integer. This is the same convention as Python's sort(key=...): the items can be anything, as long as we have a function that extracts an integer key from each one. The key extraction is assumed to be , which is reasonable since keys fit in a word.
Adding it up: .
So counting sort runs in . If , this is linear. As soon as grows much larger than — say, — the running time degrades and counting sort becomes worse than merge sort. So counting sort is good for small key ranges, useless for large ones. We need something better.
Notice that within each list , items appear in the same order they appeared in the input array. This means counting sort is stable: items with equal keys preserve their original relative order in the output. We didn't have to do anything special to achieve this — it falls out of the fact that we append items in input order. Stability looks like a minor accounting detail, but it is exactly what makes the next algorithm work.
Radix sort is the cooler algorithm. It uses counting sort as a subroutine — that's why we spent so much time on counting sort even though it isn't satisfying on its own — and extends the regime where linear-time integer sorting works dramatically. With counting sort alone, has to be . With radix sort, can be polynomial in — for example, all keys can be integers up to , and radix sort still runs in linear time. That's a huge generalization.
Pick a base (which we'll choose later) and imagine writing every key in base . If the maximum key value is , then each key has at most
base- digits. The digits of a number in base are extracted by (least significant digit), (next digit), and so on — each extraction is a constant number of arithmetic operations on word-sized integers, so each digit can be computed in .
We never actually rewrite the numbers in base ; we just compute digits on demand when we need them.
The algorithm itself is then:
That is, passes total, each one a complete sort of the entire array of items, but using only one digit as the key. The output of pass is the final sorted array.
This is the same trick spreadsheets use: if you want to sort by several columns, you can click the least important column first, then the next-most-important, then the most important — and at the end you get a multi-column sort. It is genuinely surprising that this works, and the reason it works is stability.
Suppose you've already sorted the items by the lower-order digits, and now you sort them by the next digit up. Two items that differ in the new digit get separated correctly by this new sort — that's obvious. Two items that agree in the new digit need to remain in their previous relative order, because their previous relative order was determined by the lower-order digits, which is correct. The stable sort guarantees exactly this: equal keys (here, equal values in the current digit) preserve their relative order. So each pass extends the sorted-prefix-of-digits one position higher without disturbing the work done by earlier passes. After passes, all digits have been incorporated and the array is fully sorted. (A clean proof is by induction on the number of passes completed.)
This is also why we need counting sort specifically as the inner sort — counting sort is stable, and it's fast for small key ranges, which is exactly what we have when we're sorting by a single base- digit (range ).
Each pass is a counting sort over a key range of size . By the counting sort analysis, that costs . There are passes, so the total is:
Now we get to choose . We want this to be as small as possible. We have a sum of two terms multiplied by a , and the standard trick when minimizing such a thing is to balance the two parts of the sum — typically the optimum is reached when they are equal, or close to it.
If we set , then , and , so:
Why is this the right choice? If is much smaller than , then is unnecessarily large. If is much larger than , then the term in dominates and we're paying for the size of the key range without using it. Setting keeps proportional to while making each digit as wide as possible.
The key payoff: suppose for some constant . Then:
so:
The constant disappears into the asymptotic notation. So if your integer keys are anywhere in the range for any constant — keys up to , , , anything polynomial in — radix sort sorts them in linear time.
This is the punchline of the lecture. In the comparison model, sorting needs . By stepping out of that model and exploiting integer arithmetic, we sort polynomially-bounded integers in . The wall isn't a property of sorting — it's a property of sorting via comparisons only.
The lecture's arc: a model is a contract about what your algorithm is allowed to do, and lower bounds are statements about that contract. The comparison model captures everything we'd previously called sorting and searching, and inside it the and bounds are tight. The proof is short and entirely structural: count the answers an algorithm must distinguish, and a binary decision tree distinguishing that many things needs that many leaves, and a binary tree with leaves has height at least .
Once we relax the model and let the algorithm do arithmetic on integer keys, the picture changes. Counting sort gives — linear when , useless when is much larger. Radix sort uses counting sort as a stable inner subroutine and digit-decomposes the keys, achieving , which is linear whenever is polynomial in .
The pattern worth taking away from all of this: an asymptotic bound is never just a property of the problem. It is always a property of the (problem, model) pair. Change the model and the bound can change dramatically. The job of an algorithm designer is partly to know which model your real-world situation actually inhabits, so that you reach for the right tool.