The most fundamental cache-oblivious primitive is scanning—scanning an array with *N *elements incurs Θ( *N *) memory transfers for any value of *B*. Thus algorithms such as median ﬁnding and data structures such as stacks and queues that only rely on scanning are automatically cache-oblivious. In fact, the examples above are optimal in the cache-oblivious model. Other examples of algorithms that only rely on scanning include Quicksort and Mergesort. However, they are not asymptotically optimal in the cache-oblivious model since they use *O*( *N *log *N *) memory transfers rather than Θ(Sort *M**,B *(*N *)).

Apart from algorithms and data structures that only utilize scanning, most cache-oblivious results use recursion to obtain eﬃciency; in almost all cases, the sizes of the recursive problems decrease double-exponentially. In this section we describe two of the most fundamental such recursive schemes, namely the *van Emde Boas layout *and the *k-merger*.

One of the most fundamental data structures in the I/O-model is the B-tree [7]. A B-tree is basically a fanout Θ(*B*) tree with all leaves on the same level. Since it has height *O*(log*B **N *) and each node can be accessed in *O*(1) memory transfers, it supports searches in *O*(log*B **N *) memory transfers. It also supports range queries, that is, the reporting of all *K *elements in a given query range, in *O*(log*B **N *+ *K *) memory transfers. Since *B *is an integral part of the deﬁnition of the structure, it seems challenging to develop a cache-oblivious B-tree structure. However, Prokop [24] showed how a binary tree can be laid out in memory in order to obtain a (static) cache-oblivious version of a B-tree. The main idea is to use a recursively deﬁned layout called the *van Emde Boas layout *closely related to the deﬁnition of a van Emde Boas tree [26]. The layout has been used as a basic building block of most cache-oblivious search structures (e.g in [1, 8, 10–12, 18, 25]).

*L**a**y**ou**t*

For simplicity, we only consider complete binary trees. A binary tree is complete if it has *N *= 2*h **− *1 nodes and height *h *for some integer *h*. The basic idea in the van Emde Boas layout of a complete binary tree *T *with *N *leaves is to divide *T *at the middle level and lay out the pieces recursively (Figure 34.3). More precisely, if *T *only has one node it is simply laid out as a single node in memory. Otherwise, we deﬁne the *top tree **T*0 to be the subtree consisting o__f th__e nodes in the topmost *l**h/*2*J *levels of *T *, and the *bottom trees **T*1*,…**, **T**k *to be the Θ(*√ *) __s__ubtrees rooted in the nodes on level *l**h/*2*l *of *T *; note that all the subtrees have size Θ(*√*). The van Emde Boas layout of *T *consists of the van Emde Boas layout of *T*0 followed by the van Emde Boas layouts of *T*1*,..**. , **T**k *.

*S**e**ar**c**h*

To analyze the number of memory transfers needed to perform a search in *T *, that is, traverse a root-leaf path, we consider the ﬁrst recursive level of the van Emde Boas layout where the subtree__s __are smaller than *B*. As this level *T *is divided into a set of *base trees *of size between Θ(*√*) and Θ(*B*), that is, of height Ω(log *B*) (Figure 34.4). By the deﬁnition of the layout, each base tree is stored in *O*(*B*) contiguous memory locations and can thus be accessed in *O*(1) memory transfers. That the search is performed in *O*(log*B **N *) memory transfers then follows since the search path traverses *O*((log *N *)*/ *log *B*) = *O*(log*B **N *) diﬀerent base trees.

*Rang**e query*

To analyze the number of memory transfers needed to answer a range query [*x*1*, x*2] on *T *using the standard recursive algorithm that traverses the relevant parts of *T *(starting at the root), we ﬁrst note that the two paths to *x*1 and *x*2 are traversed in *O*(log*B **N *) memory transfers. Next we consider traversed nodes *v *that are not on the two paths to *x*1 and *x*2. Since all elements in the subtree *T**v *rooted at such a node *v *are reported, and since a subtree of height log *B *stores Θ(*B*) elements, *O*( *K *) subtrees *T**v *of height log *B *are visited. This in turn means that the number of visited nodes above the last log *B *levels of *T *is also *O*( *K *); thus they can all be accessed in *O*( *K *) memory transfers. Consider the smallest recursive level of the van Emde Boas layout that completely contain *T**v *. This level is of size between Ω(*B*) and *O*(*B*2 ) (Figure 34.5(a)). On the next leve__l __of recursion *T**v *is broken into a top part and *O*(*B*) bottom parts of size between *√ **B*) and *O*(*B*) each (Figure 34.5(b)). The top part is contained in a recursive level of size *O*(*B*) and is thus stored within *O*(*B*) consecutive memory locations; the__ref__ore it can be accessed in *O*(1) memory accesses. Similarly, the *O*(*B*) nodes in the *O*(*√**B*) bottom parts are stored consecutively in memory; therefore they can all be accessed in a total of *O*(1) memory transfers. Therefore, the optimal paging strategy can ensure that any traversal of *T**v *is performed in *O*(1) memory transfers, simply by accessing the relevant *O*(1) blocks. Thus overall a range query is performed in *O*(log*B **N *+ *K *) memory transfers.

*TH**EOREM 34.1 **L**et **T **b**e a complete binary tree with **N **l**e**ave**s laid out using the van** **E**mde Boas layout. The number of memory transfers needed to perform a search (traverse **a root-to-leaf path) and a range query in **T **i**s **O*(log*B **N *) *and **O*(log*B **N *+ *K *)*, respectively.*

Note that the navigation from node to node in the van Emde Boas layout is straight- forward if the tree is implemented using pointers. However, navigation using arithmetic on array indexes is also possible [18]. This avoids the use of pointers and hence saves space.

The constant in the *O*(log*B **N *) bound for searching in Theorem 34.1 can be seen to be four. Further investigations of which constants are possible for cache-oblivious comparison based searching appear in [9].

In the I/O-model the two basic optimal sorting algorithms are multi-way versions of Merge- sort and distribution sorting (Quicksort) [2]. Similarly, Frigo et al. [20] showed how both merge based and distribution based optimal cache-oblivious sorting algorithms can be developed. The merging based algorithm, *Funnelsort*, is based on a so-called *k**-merger*. This structure has been used as a basic building block in several cache-oblivious algorithms. Here we describe a simpliﬁed version of the *k*-merger due to Brodal and Fagerberg [15].

*Binary mergers and merge trees*

A *binary merger *merges two sorted input streams into a sorted output stream: In one merge step an element is moved from the head of one of the input streams to the tail of the output stream; the heads of the input streams, as well as the tail of the output stream, reside in *buﬀers *of a limited capacity.

Binary mergers can be combined to form *binary merge trees *by letting the output buﬀer of one merger be the input buﬀer of another—in other words, a binary merge tree is a binary tree with mergers at the nodes and buﬀers at the edges, and it is used to merge a set of sorted input streams (at the leaves) into one sorted output stream (at the root). Refer to Figure 34.6 for an example.

An *invocation *of a binary merger in a binary merge tree is a recursive procedure that performs merge steps until the output buﬀer is full (or both input streams are exhausted); if

an input buﬀer becomes empty during the invocation (and the corresponding stream is not exhausted), the input buﬀer is recursively ﬁlled by an invocation of the merger having this buﬀer as output buﬀer. If both input streams of a merger get exhausted, the corresponding output stream is marked as exhausted. A procedure Fill(*v*) performing an invocation of a binary merger *v *is shown in Figure 34.7 (ignoring exhaustion issues). A single invocation Fill(*r*) on the root *r *of a merge tree will merge the streams at the leaves of the tree.

*k**–**merger*

A *k*-merger is a binary merge tree with speciﬁc buﬀer sizes. For simplicity, we assume that *k *is a power of two, in which case a *k*-merger is a complete binary tree of *k **− *1 binary mergers. The output buﬀer at the root has size *k*3, and the sizes of the rest of the buﬀers are deﬁned recursively in a manner resembling the deﬁnition of the van Emde Boas layout: Let *i *= log *k *be the height of the *k*-merger. We deﬁne the *top tree *to be the subtree consisting of all mergers of depth at most *l**i/*2*l*, and the *bottom trees *to be the subtrees rooted in nodes at depth *l**i/*2*l *+ 1. We let the edges between the top and bottom trees have buﬀers of size *k*3*/*2, and deﬁne the sizes of the remaining buﬀers by recursion on the top and bottom trees. The input buﬀers at the leaves hold the input streams and are not part of the *k*-merger deﬁnition. The space required by a *k*-merger, excluding the output buﬀer at the root, is given by *S*(*k*) = *k*1*/*2 *· **k*3*/*2 + (*k*1*/*2 + 1) *· **S*(*k*1*/*2), which has the solution *S*(*k*) = Θ(*k*2).

We now analyze the number of memory transfers needed to ﬁll the output buﬀer of size *k*3 at the root of a *k*-merger. In the recursive deﬁnition of the buﬀer sizes in the *k*-merger, consider the ﬁrst level where the subtrees (excluding output buﬀers) have size less than *M**/*3; if *k*¯ is the number of leaves of one such subtree, we by the space usage of *k*-mergers have *k*¯2 *≤ **M**/*3 and (*k*¯2)2 = *k*¯4 = Ω(*M *). We call these subtrees of the *k*-merger *base** **trees *and the buﬀers between the base trees *large buﬀers*. Assuming *B*2 *≤ **M**/*3, a base tree *T**v *rooted in *v *together with one block from each of the large buﬀers surrounding it (i.e., its single output buﬀer and *k*¯ input buﬀers) can be contained in fast memory, since *M**/*3+ *B *+ *k*¯ *· **B **≤ **M**/*3+ *B *+ (*M**/*3)1*/*2 *· *(*M**/*3)1*/*2 *≤ **M *. If the *k*-merger consists of a single base tree, the number of memory transfers used to ﬁll its output buﬀer with *k*3 elements during an invocation is trivially *O*(*k*3*/B *+ *k*). Otherwise, consider an invocation of the root *v *of a base tree *T**v *, which will ﬁll up the size Ω(*k*¯3) output buﬀer of *v*. Loading *T**v *and one block for each of the *k*¯ buﬀers just below it into fast memory will incur *O*(*k*¯2*/B *+ *k*¯) memory transfers. This is *O*(1*/B*) memory transfer for each of the Ω(*k*¯3) elements output, since *k*¯4 = Ω(*M *) implies *k*¯2 = Ω(*M *1*/*2) = Ω(*B*), from which *k*¯ = *O*(*k*¯3*/B*) follows. Provided that none of the input buﬀers just below *T**v *become empty, the output buﬀer can then be ﬁlled in *O*(*k*¯3*/B*) memory transfers since elements can be read from the input buﬀers in *O*(1*/B*) transfers amortized. If a buﬀer below *T**v *becomes empty, a recursive invocation is needed. This invocation may evict *T**v *from memory, leading to its reloading when the invocation ﬁnishes. We charge this cost to the Ω(*k*¯3) elements in the ﬁlled buﬀer, or *O*(1*/B*)

memory transfers per element. Finally, the last time an invocation is used to ﬁll a particular buﬀer, the buﬀer may not be completely ﬁlled (due to exhaustion). However, this happens only once for each buﬀer, so we can pay the cost by charging *O*(1*/B*) memory transfers to each position in each buﬀer in the *k*-merger. As the entire *k*-merger uses *O*(*k*2) space and merges *k*3 elements, these charges add up to *O*(1*/B*) memory transfers per element.

We charge an element *O*(1*/B*) memory transfers each time it is inserted into a large buﬀer. Since *k*¯ = Ω(*M *1*/*4), each element is inserted in *O*(log*k*¯ *k*) = *O*(log*M **k*3) large buﬀers. Thus we have the following.

*TH**E**OR**E**M 34.2 **E**x**cl**u**d**i**ng the output buﬀers, the size of a **k**-merger is **O*(*k*2) *and it **memory transfers during an invocation to ﬁll up its output **b**u**ﬀe**r of size **k*3*.*

*F**unn**e**l**sort*

The cache-oblivious sorting algorithm Funnelsort is easily obtained once the *k*-merger structure is deﬁned: Funnelsort breaks the *N *input elements into *N *1*/*3 groups of size *N *2*/*3, sorts them recursively, and then merges the sorted groups using an *N *1*/*3-merger.

Funnelsort can be analyzed as follows: Since the space usage of a *k*-merger is sub-linear in its output, the elements in a recursive sort of size *M**/*3 only need to be loaded into memory once during the entire following recursive sort. For *k*-mergers at the remaining higher levels in the recursion tree, we have *k*3 *≥ **M**/*3 *≥ **B*2, which implies *k*2 *≥ **B*4*/*3 *>** B *and hence *k*3*/B > k*. By Theorem 34.2, the number of memory transfers during a merge involving *N **! *elements is then *O*(log*M *(*N **!*)*/B*) per element. Hence, the total number of memory transfers per element is

In the above analysis, the exact (tall cache) assumption on the size of the fast memory is *B*2 *≤ **M**/*3. In [15] it is shown how to generalize Funnelsort such that it works un- der the weaker assumption *B*1+*ε **≤ **M *, for ﬁxed *ε** **> *0. The resulting algorithm incurs the optimal *O*(Sort*M**,B *(*N *)) memory transfers when *B*1+*ε *= *M *, at the price of incurring *O*( 1 *· *Sort*M**,B *(*N *)) memory transfers when *B*2 *≤ **M *. It is shown in [17] that this trade-oﬀ is the best possible for comparison based cache-oblivious sorting.