Towards a Theory of Cache-Efficient Algorithms

Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan

The RAM Model • In the previous lecture we discussed a cache in an operating system • We saw a lower bound on sorting: • N = number of sorted elements • B = number of elements in each block • M = memory size

The I/O Model • A datum can be accessed only from fast memory • B elements are brought to memory in each access • Computation cost << I/O cost • A block of data can be placed anywhere in fast memory • I/O operations are explicit

The Cache Model • A datum can be accessed only from fast memory √ • B elements are brought to memory in each access √ • Computation cost << I/O costL denotes normalized cache latency, accessing a block from cache costs 1 • A block of data can be placed anywhere in fast memoryA fixed mapping distributes main memory in the cache • I/O operations are explicitThe cache is not visible to the programmer

Notation • I(M,B) - The I/O model • C(M,B,L) - The cache model • n = N/B, m = M/B – The size of the data and of memory in blocks (instead of elements) • The goal of an algorithm design is to minimize running time = (number of cache accesses) + (L* number of memory accesses)

Reminder – Cache Associativity • Associativity specifies the number of different frames in which a memory block can reside Fully Associative Direct Mapped 2-Way Associative Set

Emulation Theorem • An algorithm A in I(M,B) using Tblock transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T )steps. • The additional memory requirement is m blocks. • In other words – an algorithm that is efficient in main memory, can be efficient in cache.

1 2 m Proof (1) C[] m 1 2 Buf[] n Mem[]

Proof (2) C[] m 1 2 Buf[] 1 2 m n a b Mem[]

Proof (3) C[] m 1 2 q Buf[] 1 2 m n a b Mem[]

Proof (4) C[] m 1 2 b Buf[] 1 2 m n a b Mem[]

Proof (5) C[] m 1 2 b Buf[] 1 2 m n a q b Mem[]

Proof (6) C[] m 1 2 b Buf[] 1 2 m n a q b Mem[]

Block efficient algorithms • For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred. • In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps. • The algorithms for sorting, FFT, and matrix transposition are block efficient.

Extension to set-associative cache • In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block. • In the emulation technique described before we do not have explicit control of the replacement. • Instead, a property of LRU will be used, and the cache will be used only partially.

Optimal Replacement Algorithm for Cache • OPT or MIN – a hypothetical algorithm that minimizes cache misses for a given (finite) access trace. • Offline – it knows in advance which blocks will be accessed next. • Evicts from cache the block which will be accessed again in the longest time in the future. • Was proven to be optimal – better than any online algorithm. • Proposed by Belady in 1966. • Used to theoretically test efficiency of online algorithms.

LRU vs. OPT • For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m. • For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m. 1 9 1 6 6 = (1-1/3) 9 OPT – X misses LRU – 3X misses

Extension to set-associative cache – Cont. • Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2. • We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2 • These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm. • In the real cache, k lines will be managed by LRU and will experience twice the misses.

1 2 m Extension to set-associative cache – Cont. C[] m 1 2 Buf[] n Mem[]

Generalized Emulation Theorem • An algorithm A in I(M/2,B) using Tblock transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T )steps. • The additional memory requirement is m/2 blocks.

The cache complexity of sorting • The lower bound for sorting in I(M,B) is • The lower bound for sorting in C(M,B,L) is I = computations T = I/O operations

Cache Miss Classes • Compulsory Miss – a block is being referenced for the first time • Capacity Miss – a block was evicted from the cache because it is too small • Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.

Average case performance of merge-sort in the cache model • We want to estimate the number of cache misses while performing the algorithm: • Compulsory misses are unavoidable • Capacity misses are minimized by the I/O algorithm • We can quantify the expected number of conflict misses.

When does a conflict miss occur? • s cache sets are available for k runs S1…Sk. • The expected number of elements in any run Si is N/k. • A leading block is a cache line containing a leading element of a run. bi is the leading block of Si. • A conflict occurs when two leading blocks are mapped to the same cache set.

When does a conflict miss occur – Cont. • Formally: a conflict miss occurs for element Si,j+1 when there is at least one element x in a leading block bk, k≠i, such that Si,j<x<Si,j+1 and S(bi) = S(bk). Si j j+1 … Sk x

How many conflict misses to expect • Pi = the probability of conflict for element i, 1≤i≤N. • Assume uniform distribution: • The leading blocks among cache sets • The leading element within the leading block • If k is Ω(s) then Pi is Ω(1). • For each round, the number of conflict misses is Ω(N).

How many conflict misses to expect – Cont. • The expected number of conflict misses throughout merge-sort is • This includes O(N) misses for each pass. • By choosing k<<s we minimize the probability of conflict misses, but we incur more capacity misses.

Conclusions • There is a way to transform I/O efficient algorithms to cache efficient algorithms • It is only for blocking, direct mapped cache that does not distinguish between reads and writes. • The constants are important in these orders of magnitude.

Towards a Theory of Cache-Efficient Algorithms

Towards a Theory of Cache-Efficient Algorithms

Presentation Transcript

Cache Algorithms

Towards a useful theory of language

TOWARDS A DYNAMIC THEORY OF STRATEGY

External-Memory and Cache-Oblivious Algorithms: Theory and Experiments

The Study of Cache Oblivious Algorithms

Towards a Theory of Onion Routing

Towards a Theory of Events

Cache Based Iterative Algorithms

3.2 Cache Oblivious Algorithms

Towards a Theory of Everything

Theory of Multicore Algorithms

TOWARDS A CONTROL THEORY OF ATTENTION

Power Efficient Cache Coherence

Theory of Algorithms: Introduction

Towards a Distributed WMS Cache...

Cache Efficient Data Structures and Algorithms for d -Dimensional Problems

Cache-Oblivious Algorithms

Towards a Theory of Everything

Towards a Theory of Digital Preservation

TOWARDS A CONTROL THEORY OF ATTENTION

Towards a Taxonomy of Global Illumination Algorithms

Theory of Algorithms: Introduction