ADS — Algorithms & Data Structures

Every sorting algorithm we have seen so far — merge sort, heap sort, insertion sort, the sort hidden inside BST construction — runs in $\Theta(n \log n)$ at best. The natural question: is $n \log n$ a fundamental wall, or just a failure of imagination so far? This lecture answers both halves of that question. The structure is theorem → proof → counterexample, where the "counterexample" is not a bug in the theorem but a demonstration that its hypotheses were doing real work. Change the hypotheses and the answer changes.

Concretely, the lecture does three things:

Defines a restricted model of computation — the comparison model — that captures every sorting algorithm we've studied so far.
Proves that inside that model, sorting requires $\Omega(n \log n)$ comparisons and searching requires $\Omega(\log n)$ comparisons. So merge sort and binary search are not merely "best known" — they are provably optimal.
Steps outside the comparison model into the standard RAM model, where keys are integers we are allowed to do arithmetic on, and shows two algorithms that sort in linear time when the keys are not too large: counting sort and radix sort.

The Comparison Model

The setup is deliberately restrictive. Input items are treated as black boxes — abstract data types whose internal structure is hidden. The only operation the algorithm is permitted to perform on items is comparison: given two items $x$ and $y$ , ask one of $x < y$ , $x \leq y$ , $x > y$ , $x \geq y$ , or $x = y$ . Each such comparison returns one bit (yes or no).

The cost of an algorithm in this model is defined as the number of comparisons it performs. Everything else is free: pointer manipulation, swaps, allocating arrays, integer arithmetic on indices, copying data — none of it counts. This is the strange and important part of the model. We are being deliberately generous about what is free, because if we can prove a lower bound under this generosity, the bound is unconditional: it holds for any algorithm restricted to inspecting items only via comparisons, no matter how clever the bookkeeping.

Why this captures everything we've seen

Every sorting and searching algorithm from earlier lectures fits inside this model:

Merge sort moves items between arrays, but only inspects them via $<$ during the merge step.
Heap sort sifts items up and down, but the only thing it ever asks about an item is how it compares to its parent or child.
BST operations walk the tree by comparing the query key to each node's key.
Binary search compares the target to the middle element of a subarray.

So a lower bound proved in the comparison model applies to all of these simultaneously.

A subtlety about cost

When we previously said "binary search runs in $O(\log n)$ ", the $\log n$ was simultaneously the number of comparisons it performs and the number of real-time steps it takes on a RAM machine. In the comparison model we are only counting the first. The lower bound we are about to prove says: even if everything except comparisons is free, you still cannot get below $\log n$ for searching, or below $n \log n$ for sorting.

Decision Trees

The trick that makes lower bounds in this model tractable is to draw out all possible executions of an algorithm at once, as a tree.

Fix an input size $n$ . For any deterministic algorithm in the comparison model, the very first comparison it makes is determined by the algorithm itself — it doesn't yet depend on anything about the input. After that comparison returns yes or no, the algorithm branches into two possible second comparisons (one for each answer), and so on. This branching structure is a binary tree, called the decision tree of the algorithm at input size $n$ .

Each internal node is a comparison the algorithm performs at that point in some execution. It has two children, one for each possible binary outcome.
Each leaf represents the algorithm having terminated and produced an answer. The leaf is labeled with that answer.

Concrete example: binary search at $n = 3$

Take an array $A[0..2]$ and a query value $x$ . Binary search at $n = 3$ first compares $x$ to the middle element $A[1]$ :

Is $A[1] < x$ ?
- No (so $x \leq A[1]$ $x \leq A [1]$ ): now compare $x$ $x$ to $A[0]$ $A [0]$ .
  - Is $A[0] < x$ ?
    - No: $x \leq A[0]$ . Output: $x$ falls at or before position 0.
    - Yes: $A[0] < x \leq A[1]$ . Output: $x$ falls strictly between positions 0 and 1.
- Yes (so $x > A[1]$ $x > A [1]$ ): now compare $x$ $x$ to $A[2]$ $A [2]$ .
  - Is $A[2] < x$ ?
    - No: $A[1] < x \leq A[2]$ . Output: $x$ falls between positions 1 and 2.
    - Yes: $x > A[2]$ . Output: $x$ falls past position 2.

Two comparisons in every execution, four possible answers — four leaves, height 2. This is what the decision tree of binary search looks like at $n = 3$ . For larger $n$ , the same shape extends: a balanced binary tree of height $\lceil \log_2 n \rceil$ .

For sorting, the trees become enormous — exponentially many nodes — so we never actually draw them. But the conceptual picture is the same: each internal node asks " $A_i < A_j$ ?" for some $i, j$ , and each leaf is labeled with a permutation of the input that the algorithm has determined to be the sorted order.

Reading running time off the tree

This is where decision trees become useful, because they translate questions about algorithms into questions about trees, and we know a lot about trees.

A single execution of the algorithm corresponds to a root-to-leaf path in the decision tree. At each internal node along the path, the comparison happens; the path turns left or right depending on the answer; eventually it reaches a leaf and outputs the label.
The cost of that single execution is the length of that path (number of comparisons performed). One comparison per internal node visited.
The worst-case running time of the algorithm is therefore the height of the tree — the length of the longest root-to-leaf path.

So the question "how few comparisons does any algorithm need in the worst case to solve problem $P$ ?" becomes "what is the minimum possible height of any decision tree that solves $P$ ?" And that is a question about binary trees, which is much easier to reason about.

Lower Bound for Searching

Claim. Any comparison-based algorithm that searches among $n$ preprocessed items requires $\Omega(\log n)$ comparisons in the worst case.

The phrase "preprocessed" is a strong concession. It means we are allowed to do arbitrarily much work on the $n$ items ahead of time — sort them, build them into an AVL tree, build any data structure we like — and none of that work counts. The clock starts only when a query item $x$ arrives, and we count only the comparisons between $x$ and the stored items during that query.

Proof. Take any comparison-based search algorithm and look at its decision tree.

The decision tree is binary, because each comparison returns one bit.
The decision tree must contain at least $n$ leaves. Why? Because the algorithm has to be able to output any of at least $n$ distinct answers — for instance, " $x$ matches item 1", " $x$ matches item 2", …, " $x$ matches item $n$ " — and each distinct answer needs at least one leaf labeled with it. There may be more leaves than this (the same answer can appear in multiple leaves, depending on the algorithm), but there cannot be fewer.

A binary tree with at least $n$ leaves has height at least $\log_2 n$ . (This is the basic fact that a binary tree of height $h$ has at most $2^h$ leaves, so $h \geq \log_2(\text{leaves})$ .) The height is the worst-case running time. So any comparison-based search needs at least $\log_2 n$ comparisons in the worst case. $\blacksquare$

That justifies retroactively why binary search trees were worth caring about: in the comparison model, $\Theta(\log n)$ for search, predecessor, and successor is the best anyone can ever do. AVL trees hit that bound, so they're optimal.

A subtle point worth pausing on: this lower bound holds even with preprocessing allowed. If we drop the preprocessing assumption, the truth is actually stronger — you need $\Omega(n)$ time, because you might have to look at every item just to know what's there. But this proof technique doesn't capture that stronger bound; the decision-tree argument only ever gives $\Omega(\log n)$ . So the technique is useful but not always tight.

Lower Bound for Sorting

Claim. Any comparison-based sorting algorithm requires $\Omega(n \log n)$ comparisons in the worst case.

The strategy is identical: bound the number of leaves the decision tree must have, then take the logarithm.

A leaf in a sorting algorithm's decision tree is labeled with a permutation of the input — saying something like "the smallest element was originally $A_5$ , the next was $A_7$ , then $A_1$ , then $A_0$ , …". The algorithm has done enough comparisons by the time it reaches that leaf to know what the sorted order is, and it just writes it down (writing is free).

How many distinct outputs does a sorting algorithm need to be able to produce? All possible permutations of the input, since for any permutation there exists some input that requires that exact output. There are $n!$ permutations of $n$ distinct items. So:

The decision tree is binary.
It must have at least $n!$ leaves.
Therefore its height is at least $\log_2(n!)$ .

The remaining task is purely algebraic: show that $\log_2(n!) = \Omega(n \log n)$ .

Bounding $\log_2(n!)$ from below

Two ways. The clean one uses Stirling's approximation; the more elementary one is a direct summation argument.

Summation approach. Use the identity $\log(ab) = \log a + \log b$ to expand:

$\log_2(n!) = \log_2(n) + \log_2(n-1) + \cdots + \log_2(2) + \log_2(1) = \sum_{i=1}^{n} \log_2 i$

We want to lower-bound this sum. The trick: throw away the smaller half of the terms, then bound each remaining term from below by the smallest one in that half.

$\sum_{i=1}^{n} \log_2 i \;\geq\; \sum_{i=n/2}^{n} \log_2 i \;\geq\; \sum_{i=n/2}^{n} \log_2(n/2)$

The first inequality drops the first half of the terms (they're all positive, so the sum only gets smaller). The second replaces every remaining term $\log_2 i$ with the smallest such term, $\log_2(n/2)$ — since $i \geq n/2$ throughout the sum, every term is at least $\log_2(n/2)$ .

Now the sum is trivial: there are $n/2$ identical terms, each equal to $\log_2(n/2) = \log_2 n - 1$ . So:

$\sum_{i=n/2}^{n} \log_2(n/2) = \frac{n}{2}(\log_2 n - 1) = \frac{n \log_2 n}{2} - \frac{n}{2}$

The dominating term is $\frac{n \log_2 n}{2}$ , and the $-n/2$ is negligible asymptotically. So $\log_2(n!) \geq \frac{n \log_2 n}{2} - \frac{n}{2} = \Omega(n \log n)$ . $\blacksquare$

Stirling approach. Stirling's formula says $n! \approx \sqrt{2\pi n} \cdot (n/e)^n$ . Taking $\log_2$ of both sides and using $\log(ab) = \log a + \log b$ :

$\log_2(n!) \approx \frac{1}{2}\log_2(2\pi n) + n \log_2 n - n \log_2 e$

The dominant term is $n \log_2 n$ , with a linear correction of $-n \log_2 e + \frac{1}{2}\log_2(2\pi n)$ . Asymptotically, $\log_2(n!) = n \log_2 n - O(n)$ . The leading constant is exactly 1, so this method even gives a tighter result than the summation argument (which had a constant of $1/2$ ). Either way, the conclusion is the same: $\log_2(n!) = \Omega(n \log n)$ , hence sorting requires $\Omega(n \log n)$ comparisons. $\blacksquare$

So merge sort, heap sort, and any other $O(n \log n)$ comparison sort are optimal in the comparison model. The framework is now closed: in this model, search is $\Theta(\log n)$ , sort is $\Theta(n \log n)$ , and we cannot do better.

Leaving the Comparison Model

The remaining question is: what if we allow ourselves more than just comparisons? In the standard RAM (Random Access Machine) model, integers fit in machine words and we can do arithmetic — addition, subtraction, multiplication, division, modulus, indexing into arrays — all in $O(1)$ time per operation. Comparisons are still allowed too, of course, but they are no longer the only thing.

The next two algorithms exploit this extra power. Both are integer sorting algorithms — they assume the keys we are sorting are integers, not arbitrary black-box comparables. This is a real assumption, but a practically benign one: most things you sort on a computer are already represented as integers (or can be mapped to integers cheaply).

Setup for integer sorting

Throughout the rest of the lecture, the assumptions are:

We are sorting $n$ items whose keys are integers in the range $\{0, 1, \ldots, k-1\}$ .
Each key fits in a single machine word, so all the standard arithmetic operations on a key cost $O(1)$ .

Both $n$ and $k$ are now parameters. The interesting question is how the running time depends on each.

A side note: integer sorting is still an active area of research. The current best general result (when nothing is assumed about $k$ beyond "fits in a word") is roughly $O(n \sqrt{\log \log n})$ in expectation — almost linear, not quite. Whether you can sort $n$ word-sized integers in deterministic $O(n)$ time remains open. We're not going to touch that here. We will instead focus on the regime where $k$ is "not too large", and show two algorithms that achieve linear time in that regime.

Counting Sort

The intuition is exactly what the name suggests. If I hand you the multiset $\{3, 5, 7, 5, 5, 3, 6\}$ , you can sort it by first counting: there are two 3s, three 5s, one 6, one 7. Then you reconstruct the sorted output: write two 3s, then three 5s, then a 6, then a 7. Done.

To turn this into an algorithm, allocate an array $L$ of length $k$ — one slot per possible key value. There's a tempting first version where each slot is just an integer counter, but that version can only sort the keys themselves, not items with keys plus extra payload. We want to handle the more general case where each item has a key but also drags along other data — like a spreadsheet row where we sort by one column but want all the other columns to come along for the ride. So instead, each slot of $L$ holds a list of items.

The algorithm

Allocate L[0..k-1], where each L[i] is an empty list.
For j in range(n):
    Append A[j] to L[ key(A[j]) ].
Output = empty list.
For i in range(k):
    Extend output with L[i].
Return output.

Walking through it: the first loop scans the input array once, and for each item, appends it to the list indexed by its key. After this loop, all items with key $0$ live in $L[0]$ , all items with key $1$ live in $L[1]$ , and so on — and within each list, items appear in the order they appeared in the input. The second loop concatenates these lists in order $L[0], L[1], \ldots, L[k-1]$ , producing a sorted output.

Note the use of key(item) rather than treating the item itself as the integer. This is the same convention as Python's sort(key=...): the items can be anything, as long as we have a function that extracts an integer key from each one. The key extraction is assumed to be $O(1)$ , which is reasonable since keys fit in a word.

Running time

Allocating $L$ : creating $k$ empty lists is $O(k)$ .
First loop: $n$ iterations, each doing one append (which is $O(1)$ in Python and most reasonable list implementations). Total: $O(n)$ .
Second loop: visits each of the $k$ slots, and for each slot $L[i]$ , extends the output by the contents of $L[i]$ . Visiting an empty slot is $O(1)$ , copying $|L[i]|$ items is $O(|L[i]|)$ . Summed over all $i$ , the visit cost is $O(k)$ and the copy cost is $O(n)$ (because the total number of items across all lists is exactly $n$ ).

Adding it up: $O(n + k)$ .

So counting sort runs in $\Theta(n + k)$ . If $k = O(n)$ , this is linear. As soon as $k$ grows much larger than $n$ — say, $k = n^2$ — the running time degrades and counting sort becomes worse than merge sort. So counting sort is good for small key ranges, useless for large ones. We need something better.

Stability — a property worth naming

Notice that within each list $L[i]$ , items appear in the same order they appeared in the input array. This means counting sort is stable: items with equal keys preserve their original relative order in the output. We didn't have to do anything special to achieve this — it falls out of the fact that we append items in input order. Stability looks like a minor accounting detail, but it is exactly what makes the next algorithm work.

Radix Sort

Radix sort is the cooler algorithm. It uses counting sort as a subroutine — that's why we spent so much time on counting sort even though it isn't satisfying on its own — and extends the regime where linear-time integer sorting works dramatically. With counting sort alone, $k$ has to be $O(n)$ . With radix sort, $k$ can be polynomial in $n$ — for example, all keys can be integers up to $n^{100}$ , and radix sort still runs in linear time. That's a huge generalization.

The idea

Pick a base $b$ (which we'll choose later) and imagine writing every key in base $b$ . If the maximum key value is $k$ , then each key has at most

$d = \lceil \log_b k \rceil + 1$

base- $b$ digits. The digits of a number $x$ in base $b$ are extracted by $x \bmod b$ (least significant digit), $(x / b) \bmod b$ (next digit), and so on — each extraction is a constant number of arithmetic operations on word-sized integers, so each digit can be computed in $O(1)$ .

We never actually rewrite the numbers in base $b$ ; we just compute digits on demand when we need them.

The algorithm itself is then:

Sort all the integers by their least significant digit.
Sort by the next least significant digit.
…
Sort by the most significant digit.

That is, $d$ passes total, each one a complete sort of the entire array of $n$ items, but using only one digit as the key. The output of pass $d$ is the final sorted array.

This is the same trick spreadsheets use: if you want to sort by several columns, you can click the least important column first, then the next-most-important, then the most important — and at the end you get a multi-column sort. It is genuinely surprising that this works, and the reason it works is stability.

Why it works (informal)

Suppose you've already sorted the items by the lower-order digits, and now you sort them by the next digit up. Two items that differ in the new digit get separated correctly by this new sort — that's obvious. Two items that agree in the new digit need to remain in their previous relative order, because their previous relative order was determined by the lower-order digits, which is correct. The stable sort guarantees exactly this: equal keys (here, equal values in the current digit) preserve their relative order. So each pass extends the sorted-prefix-of-digits one position higher without disturbing the work done by earlier passes. After $d$ passes, all $d$ digits have been incorporated and the array is fully sorted. (A clean proof is by induction on the number of passes completed.)

This is also why we need counting sort specifically as the inner sort — counting sort is stable, and it's fast for small key ranges, which is exactly what we have when we're sorting by a single base- $b$ digit (range $\{0, \ldots, b-1\}$ ).

Running time

Each pass is a counting sort over a key range of size $b$ . By the counting sort analysis, that costs $O(n + b)$ . There are $d$ passes, so the total is:

$T(n) = d \cdot O(n + b) = O\big((n + b) \cdot d\big) = O\big((n + b) \log_b k\big)$

Now we get to choose $b$ . We want this to be as small as possible. We have a sum of two terms multiplied by a $\log_b k$ , and the standard trick when minimizing such a thing is to balance the two parts of the sum — typically the optimum is reached when they are equal, or close to it.

If we set $b = n$ , then $n + b = 2n = O(n)$ , and $\log_b k = \log_n k$ , so:

$T(n) = O(n \cdot \log_n k)$

Why is this the right choice? If $b$ is much smaller than $n$ , then $\log_b k$ is unnecessarily large. If $b$ is much larger than $n$ , then the $b$ term in $(n + b)$ dominates and we're paying for the size of the key range without using it. Setting $b = \Theta(n)$ keeps $(n + b)$ proportional to $n$ while making each digit as wide as possible.

When this is linear

The key payoff: suppose $k \leq n^c$ for some constant $c$ . Then:

$\log_n k \leq \log_n(n^c) = c$

so:

$T(n) = O(n \cdot c) = O(n)$

The constant $c$ disappears into the asymptotic notation. So if your integer keys are anywhere in the range $\{0, 1, \ldots, n^c\}$ for any constant $c$ — keys up to $n$ , $n^2$ , $n^{100}$ , anything polynomial in $n$ — radix sort sorts them in linear time.

This is the punchline of the lecture. In the comparison model, sorting needs $\Theta(n \log n)$ . By stepping out of that model and exploiting integer arithmetic, we sort polynomially-bounded integers in $\Theta(n)$ . The $n \log n$ wall isn't a property of sorting — it's a property of sorting via comparisons only.

Summary

The lecture's arc: a model is a contract about what your algorithm is allowed to do, and lower bounds are statements about that contract. The comparison model captures everything we'd previously called sorting and searching, and inside it the $n \log n$ and $\log n$ bounds are tight. The proof is short and entirely structural: count the answers an algorithm must distinguish, and a binary decision tree distinguishing that many things needs that many leaves, and a binary tree with $L$ leaves has height at least $\log_2 L$ .

Once we relax the model and let the algorithm do arithmetic on integer keys, the picture changes. Counting sort gives $\Theta(n + k)$ — linear when $k = O(n)$ , useless when $k$ is much larger. Radix sort uses counting sort as a stable inner subroutine and digit-decomposes the keys, achieving $O(n \log_n k)$ , which is linear whenever $k$ is polynomial in $n$ .

The pattern worth taking away from all of this: an asymptotic bound is never just a property of the problem. It is always a property of the (problem, model) pair. Change the model and the bound can change dramatically. The job of an algorithm designer is partly to know which model your real-world situation actually inhabits, so that you reach for the right tool.

The Comparison Model

Why this captures everything we've seen

A subtlety about cost

Decision Trees

Concrete example: binary search at n=3n = 3n=3

Reading running time off the tree

Lower Bound for Searching

Lower Bound for Sorting

Bounding log⁡2(n!)\log_2(n!)log2​(n!) from below

Leaving the Comparison Model

Setup for integer sorting

Counting Sort

The algorithm

Running time

Stability — a property worth naming

Radix Sort

The idea

Why it works (informal)

Running time

When this is linear

Summary

Concrete example: binary search at $n = 3$

Bounding $\log_2(n!)$ from below