1 / 67

ECE729 : Advanced Computer Architecture

ECE729 : Advanced Computer Architecture. Lectures 20-23 : Memory Hierarchy -Cache Performance 2 nd – 12 th March, 2010. Lecture 20. 2 nd March, 2010. Performance. Time to access cache =  1 ( usually 1 CPU cycle ) Time to access main memory =  2 ( 1 order higher than  1 )

lajos
Download Presentation

ECE729 : Advanced Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE729 : Advanced Computer Architecture Lectures 20-23 : Memory Hierarchy -Cache Performance 2nd – 12th March, 2010 Anshul Kumar, CSE IITD

  2. Lecture 20 2nd March, 2010 Anshul Kumar, CSE IITD

  3. Performance Time to access cache = 1 (usually 1 CPU cycle) Time to access main memory = 2 (1 order higher than1) Hit probability (hit ratio or hit rate) = h Miss probability (miss ratio or miss rate) = m= 1 - h Time spent when hit occurs= 1 (Hit time) Time spent when miss occurs = 1 +2 (2 =Miss penalty) Teff = h 1 + m (1 +2) OR 1 + m 2 Anshul Kumar, CSE IITD

  4. Performance contd. Mem stalls / access Teff = 1 + m 2 Average memory access time = Hit time + Miss rate * Miss penalty Program execution time = IC * Cycle time * (CPIexec + Mem stalls / instr) Mem stalls / instr = Miss rate * Miss Penalty * Mem accesses / instr Miss Penalty in OOO processor = Total miss latency - Overlapped miss latency Anshul Kumar, CSE IITD

  5. CPU CPU CPU cache cache cache bus bus bus mem bank1 mem bank2 mem bank0 mem bank3 memory c. interleaved memory memory b. four word wide memory a. one word wide memory Transferring blocks to/from memory Anshul Kumar, CSE IITD

  6. Miss penalty example • 1 clock cycle to send address • 15 cycles for RAM access • 1 cycle for sending data • block size = 4 words Miss penalty: case (a): 4 (1 + 15 + 1) = 68 or 1 + 4 (15 + 1) = 65 case (b): 1 + 1 (15 + 1) = 17 case (c): 1 + 15 + 4 = 20 Anshul Kumar, CSE IITD

  7. DRAM with page mode • Memory cells are organized as a 2-D structure • Entire row is accessed at a time internally and kept in a buffer • Reading multiple bits from a row can be done very fast • sequentially, without giving address again • randomly, giving only the column addresses Anshul Kumar, CSE IITD

  8. Performance analysis example CPIeff = CPI+ Miss rate * Miss Penalty * Mem accesses / Instr CPI = 1.2 Miss rate = 0.5% Block size = 16 w Miss penalty?? Mem access / Instr = 1 (assumption) Anshul Kumar, CSE IITD

  9. Miss penalty calculation Data / address transfer time = 1 cycle Memory latency = 10 cycles a) Miss penalty = 16*(1+10+1) = 192 b) Miss penalty = 4*(1+10+1) = 48 c) Miss penalty = 4*(1+10+4*1) = 60 Anshul Kumar, CSE IITD

  10. Back to CPI calculation CPIeff = 1.2 + .005 * miss penalty * 1.0 a) 1.2 + .005 * 192 * 1.0 = 1.2 + .96 = 2.16 b) 1.2 + .005 * 48 * 1.0 = 1.2 + .24 = 1.44 c) 1.2 + .005 * 60 * 1.0 = 1.2 + .30 = 1.50 Anshul Kumar, CSE IITD

  11. Performance Improvement • Reducing miss penalty • Reducing miss rate • Reducing miss penalty * miss rate • Reducing hit time Anshul Kumar, CSE IITD

  12. Reducing Miss Penalty • Multi level caches • Critical word first and early restart • Write Through policy • Giving priority to read misses over write • Merging write buffer • Victim caches Anshul Kumar, CSE IITD

  13. Multi Level Caches Average memory access time = Hit timeL1 + Miss rateL1 * Miss penaltyL1 Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2 Multi level inclusion and Multi level exclusion Anshul Kumar, CSE IITD

  14. Misses in Multilevel Cache • Local Miss rate • no. of misses / no. of requests, as seen at a level • Global Miss rate • no. of misses / no. of requests, on the whole • Solo Miss rate • miss rate if only this cache was present Anshul Kumar, CSE IITD

  15. Two level cache miss example B: ~L1, L2 A: L1, L2 D: ~L1, ~L2 Local miss (L1) = (B+D)/(A+B+C+D) Local miss (L2) = D/(B+D) Global Miss = D/(A+B+C+D) Solo miss (L2) = (C+D)/(A+B+C+D) C: L1, ~L2 Anshul Kumar, CSE IITD

  16. Multi-level cache example CPI with no miss = 1.0 Clock = 500 MHz Main mem access time = 200 ns Miss rate = 5% Adding L2 cache (20 ns) reduces miss to 2%. Find performance improvement. Miss penalty (mem) = 200/2 = 100 cycles Effective CPI with L1 = 1+5%*100 = 6 Anshul Kumar, CSE IITD

  17. Example continued Miss penalty ( L2 ) = 20/2 = 10 cycles Total CPI = Base CPI + stalls due to L1 miss + stalls due to L2 miss = 1.0 + 5% * 10 + 2% * 100 = 1.0 + 0.5 + 2.0 = 3.5 Performance ratio = 6.0/3.5 = 1.7 Anshul Kumar, CSE IITD

  18. Lecture 21 3rd March, 2010 Anshul Kumar, CSE IITD

  19. Critical Word First and Early Restart • Read policy • initiate memory access along with cache access in anticipation of a miss • forward data to CPU as it gets filled in cache • Load policy • wrap around load More effective when block size is large Anshul Kumar, CSE IITD

  20. Write Through Policy • Write Through Policy reduces block traffic at the cost of • increased word traffic • increased miss rate Anshul Kumar, CSE IITD

  21. Read Miss Priority Over Write • Provide write buffers • Processor writes into buffer and proceeds (for write through as well as write back) On read miss • wait for buffer to be empty, or • check addresses in buffer for conflict Anshul Kumar, CSE IITD

  22. Merging Write Buffer Merge writes belonging to same block in case of write through Anshul Kumar, CSE IITD

  23. Victim Cache (proposed by Jouppi) • Evicted blocks are recycled • Much faster than getting a block from the next level • Size = a few blocks only • A significant fraction of misses may be found in victim cache to proc Cache Victim Cache from mem Anshul Kumar, CSE IITD

  24. Reducing Miss Rate • Large block size • Larger cache • LRU replacement • WB, WTWA write policies • Higher associativity • Way prediction, pseudo-associative cache • Warm start in multi-tasking • Compiler optimizations Anshul Kumar, CSE IITD

  25. Large Block Size • Reduces compulsory misses • Too large block size - misses increase • Miss Penalty increases Anshul Kumar, CSE IITD

  26. Large Cache • Reduces capacity misses • Hit time increases • Keep small L1 cache and large L2 cache Anshul Kumar, CSE IITD

  27. LRU Replacement Policy • Choice of replacement policy • optimal: replace the block whose reference is farthest away in future • practical: replace the block whose reference is farthest away in past Anshul Kumar, CSE IITD

  28. WB, WTWA Policies • WB and WTWA write policies tend to reduce the miss rate as compared to WTNWA at the cost of • increased block traffic Anshul Kumar, CSE IITD

  29. Higher Associativity • Reduces conflict misses • 8-way is almost like fully associative • Hit time increases Anshul Kumar, CSE IITD

  30. Associative cache example Cache mapping block size I-miss D-miss CPI 1 direct 1 word 4% 8% 2.0 2 direct 4 word 2% 5% ?? 3 2-way s.a. 4 word 2% 4% ?? Miss penalty = 6 + block size 50% instruction have a data reference Stall cycles: cache1: 7*(.04+.08*.5)=.56 cache2: 10*(.02+.05*.5) = .45 cache3: 10*(.02+.04*.5) = .40 Anshul Kumar, CSE IITD

  31. Example continued Cache CPI clock period time/instr 1 2.0 2.0 4.0 2 2.0 - .56 + .45 = 1.89 2.0 3.78 3 2.0 - .56 + .40 = 1.84 2.4 4.416 Anshul Kumar, CSE IITD

  32. Way Prediction and Pseudo-associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache • Only one tag is compared initially • Extra bits are kept for prediction • Hit time in case of mis-prediction is high Pseudo-assoc. or column assoc. cache: get advantage of SA cache in a DM cache • Check sequentially in a pseudo-set • Fast hit and slow hit Anshul Kumar, CSE IITD

  33. Warm Start in Multi-tasking • Cold start • process starts with empty cache • blocks of previous process invalidated • Warm start • some blocks from previous activation are still available Anshul Kumar, CSE IITD

  34. Lecture 22 6th March, 2010 Anshul Kumar, CSE IITD

  35. Compiler optimizations Loop interchange • Improve spatial locality by scanning arrays row-wise Blocking • Improve temporal and spatial locality Anshul Kumar, CSE IITD

  36. Improving Locality Matrix Multiplication example Anshul Kumar, CSE IITD

  37. Cache Organization for the example • Cache line (or block) = 4 matrix elements. • Matrices are stored row wise. • Cache can’t accommodate a full row/column. (In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.) • Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.) Anshul Kumar, CSE IITD

  38. Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = o; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LM LMN LMN misses LM/4 LMN/4 LMN Total misses = LM(5N+1)/4 Anshul Kumar, CSE IITD

  39. Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN LMN/4 Total misses = LN(2M+4)/4 Anshul Kumar, CSE IITD

  40. Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN/4 LMN/4 Total misses = LN(2M+1)/4 Anshul Kumar, CSE IITD

  41. Blocking jj kk C A B accesses LMN/b LMN LMN misses LMN/4b LMN/4b MN/4 Total misses = MN(2L/b+1)/4 i 5 nested loops blocking factor = b j k jj kk jj j k i i  = j kk k Anshul Kumar, CSE IITD

  42. Loop Blocking for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j]; C A B accesses LMN/4 LN LMN misses LMN/16 LN/4 LMN/4 Total misses = LN(5M/4+1)/4 Anshul Kumar, CSE IITD

  43. Reducing Miss Penalty * Miss Rate • Non-blocking cache • Write allocate with no fetch • Hardware prefetching • Compiler controlled prefetching Anshul Kumar, CSE IITD

  44. Non-blocking Cache In OOO (Out of Order) processor • Hit under a miss • complexity of cache controller increases • Hit under multiple misses or miss under a miss • memory should be able to handle multiple misses Anshul Kumar, CSE IITD

  45. Write allocate with no fetch Write miss in a Write Through cache: • Allocate a block in cache • Fetch contents of block from memory • Write into cache • Write into memory • reduces miss rate (WTWA) • increases miss penalty (avoid for clustered writes) Anshul Kumar, CSE IITD

  46. Hardware Prefetching • Prefetch items before they are requested • both data and instructions • What and when to prefetch? • fetch two blocks on a miss (requested+next) • Where to keep prefetched information? • in cache • in a separate buffer (most common case) Anshul Kumar, CSE IITD

  47. Prefetch Buffer/Stream Buffer to proc Cache prefetch buffer from mem Anshul Kumar, CSE IITD

  48. Hardware prefetching: Stream buffers Joupi’s experiment [1990]: • Single instruction stream buffer catches 15% to 25% misses from a 4KB direct mapped instruction cache with 16 byte blocks • 4 block buffer – 50%, 16 block – 72% • single data stream buffer catches 25% misses from 4 KB direct mapped cache • 4 data stream buffers (each prefetching at a different address) – 43% Anshul Kumar, CSE IITD

  49. HW prefetching: UltraSPARC III example 64 KB data cache, 36.9 misses per 1000 instructions 22% instructions make data reference hit time = 1, miss penalty = 15 prefetch hit rate = 20% 1 cycle to get data from prefetch buffer What size of cache will give same performance? miss rate = 36.9/220 = 16.7% av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046 effective miss rate = (3.046-1)/15=13.6%=> 256 KB cache Anshul Kumar, CSE IITD

  50. Compiler Controlled Prefetching • Register prefetch / Cache prefetch • Faulting / non-faulting (non-binding) • Semantically invisible (no change in registers or memory contents) • Makes sense if processor doesn’t stall while prefetching (non-blocking cache) • Overhead of prefetch instruction should not exceed the benefit Anshul Kumar, CSE IITD

More Related