1 / 28

Towards a Theory of Cache-Efficient Algorithms

Towards a Theory of Cache-Efficient Algorithms. Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan. The RAM Model. In the previous lecture we discussed a cache in an operating system We saw a lower bound on sorting:

metta
Download Presentation

Towards a Theory of Cache-Efficient Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan

  2. The RAM Model • In the previous lecture we discussed a cache in an operating system • We saw a lower bound on sorting: • N = number of sorted elements • B = number of elements in each block • M = memory size

  3. The I/O Model • A datum can be accessed only from fast memory • B elements are brought to memory in each access • Computation cost << I/O cost • A block of data can be placed anywhere in fast memory • I/O operations are explicit

  4. The Cache Model • A datum can be accessed only from fast memory √ • B elements are brought to memory in each access √ • Computation cost << I/O costL denotes normalized cache latency, accessing a block from cache costs 1 • A block of data can be placed anywhere in fast memoryA fixed mapping distributes main memory in the cache • I/O operations are explicitThe cache is not visible to the programmer

  5. Notation • I(M,B) - The I/O model • C(M,B,L) - The cache model • n = N/B, m = M/B – The size of the data and of memory in blocks (instead of elements) • The goal of an algorithm design is to minimize running time = (number of cache accesses) + (L* number of memory accesses)

  6. Reminder – Cache Associativity • Associativity specifies the number of different frames in which a memory block can reside Fully Associative Direct Mapped 2-Way Associative Set

  7. Emulation Theorem • An algorithm A in I(M,B) using Tblock transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T )steps. • The additional memory requirement is m blocks. • In other words – an algorithm that is efficient in main memory, can be efficient in cache.

  8. 1 2 m Proof (1) C[] m 1 2 Buf[] n Mem[]

  9. Proof (2) C[] m 1 2 Buf[] 1 2 m n a b Mem[]

  10. Proof (3) C[] m 1 2 q Buf[] 1 2 m n a b Mem[]

  11. Proof (4) C[] m 1 2 b Buf[] 1 2 m n a b Mem[]

  12. Proof (5) C[] m 1 2 b Buf[] 1 2 m n a q b Mem[]

  13. Proof (6) C[] m 1 2 b Buf[] 1 2 m n a q b Mem[]

  14. Block efficient algorithms • For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred. • In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps. • The algorithms for sorting, FFT, and matrix transposition are block efficient.

  15. Extension to set-associative cache • In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block. • In the emulation technique described before we do not have explicit control of the replacement. • Instead, a property of LRU will be used, and the cache will be used only partially.

  16. Optimal Replacement Algorithm for Cache • OPT or MIN – a hypothetical algorithm that minimizes cache misses for a given (finite) access trace. • Offline – it knows in advance which blocks will be accessed next. • Evicts from cache the block which will be accessed again in the longest time in the future. • Was proven to be optimal – better than any online algorithm. • Proposed by Belady in 1966. • Used to theoretically test efficiency of online algorithms.

  17. LRU vs. OPT • For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m. • For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m. 1 9 1 6 6 = (1-1/3) 9 OPT – X misses LRU – 3X misses

  18. Extension to set-associative cache – Cont. • Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2. • We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2 • These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm. • In the real cache, k lines will be managed by LRU and will experience twice the misses.

  19. 1 2 m Extension to set-associative cache – Cont. C[] m 1 2 Buf[] n Mem[]

  20. Generalized Emulation Theorem • An algorithm A in I(M/2,B) using Tblock transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T )steps. • The additional memory requirement is m/2 blocks.

  21. The cache complexity of sorting • The lower bound for sorting in I(M,B) is • The lower bound for sorting in C(M,B,L) is I = computations T = I/O operations

  22. Cache Miss Classes • Compulsory Miss – a block is being referenced for the first time • Capacity Miss – a block was evicted from the cache because it is too small • Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.

  23. Average case performance of merge-sort in the cache model • We want to estimate the number of cache misses while performing the algorithm: • Compulsory misses are unavoidable • Capacity misses are minimized by the I/O algorithm • We can quantify the expected number of conflict misses.

  24. When does a conflict miss occur? • s cache sets are available for k runs S1…Sk. • The expected number of elements in any run Si is N/k. • A leading block is a cache line containing a leading element of a run. bi is the leading block of Si. • A conflict occurs when two leading blocks are mapped to the same cache set.

  25. When does a conflict miss occur – Cont. • Formally: a conflict miss occurs for element Si,j+1 when there is at least one element x in a leading block bk, k≠i, such that Si,j<x<Si,j+1 and S(bi) = S(bk). Si j j+1 … Sk x

  26. How many conflict misses to expect • Pi = the probability of conflict for element i, 1≤i≤N. • Assume uniform distribution: • The leading blocks among cache sets • The leading element within the leading block • If k is Ω(s) then Pi is Ω(1). • For each round, the number of conflict misses is Ω(N).

  27. How many conflict misses to expect – Cont. • The expected number of conflict misses throughout merge-sort is • This includes O(N) misses for each pass. • By choosing k<<s we minimize the probability of conflict misses, but we incur more capacity misses.

  28. Conclusions • There is a way to transform I/O efficient algorithms to cache efficient algorithms • It is only for blocking, direct mapped cache that does not distinguish between reads and writes. • The constants are important in these orders of magnitude.

More Related