690 likes | 998 Views
ECE729 : Advanced Computer Architecture. Lectures 20-23 : Memory Hierarchy -Cache Performance 2 nd – 12 th March, 2010. Lecture 20. 2 nd March, 2010. Performance. Time to access cache = 1 ( usually 1 CPU cycle ) Time to access main memory = 2 ( 1 order higher than 1 )
E N D
ECE729 : Advanced Computer Architecture Lectures 20-23 : Memory Hierarchy -Cache Performance 2nd – 12th March, 2010 Anshul Kumar, CSE IITD
Lecture 20 2nd March, 2010 Anshul Kumar, CSE IITD
Performance Time to access cache = 1 (usually 1 CPU cycle) Time to access main memory = 2 (1 order higher than1) Hit probability (hit ratio or hit rate) = h Miss probability (miss ratio or miss rate) = m= 1 - h Time spent when hit occurs= 1 (Hit time) Time spent when miss occurs = 1 +2 (2 =Miss penalty) Teff = h 1 + m (1 +2) OR 1 + m 2 Anshul Kumar, CSE IITD
Performance contd. Mem stalls / access Teff = 1 + m 2 Average memory access time = Hit time + Miss rate * Miss penalty Program execution time = IC * Cycle time * (CPIexec + Mem stalls / instr) Mem stalls / instr = Miss rate * Miss Penalty * Mem accesses / instr Miss Penalty in OOO processor = Total miss latency - Overlapped miss latency Anshul Kumar, CSE IITD
CPU CPU CPU cache cache cache bus bus bus mem bank1 mem bank2 mem bank0 mem bank3 memory c. interleaved memory memory b. four word wide memory a. one word wide memory Transferring blocks to/from memory Anshul Kumar, CSE IITD
Miss penalty example • 1 clock cycle to send address • 15 cycles for RAM access • 1 cycle for sending data • block size = 4 words Miss penalty: case (a): 4 (1 + 15 + 1) = 68 or 1 + 4 (15 + 1) = 65 case (b): 1 + 1 (15 + 1) = 17 case (c): 1 + 15 + 4 = 20 Anshul Kumar, CSE IITD
DRAM with page mode • Memory cells are organized as a 2-D structure • Entire row is accessed at a time internally and kept in a buffer • Reading multiple bits from a row can be done very fast • sequentially, without giving address again • randomly, giving only the column addresses Anshul Kumar, CSE IITD
Performance analysis example CPIeff = CPI+ Miss rate * Miss Penalty * Mem accesses / Instr CPI = 1.2 Miss rate = 0.5% Block size = 16 w Miss penalty?? Mem access / Instr = 1 (assumption) Anshul Kumar, CSE IITD
Miss penalty calculation Data / address transfer time = 1 cycle Memory latency = 10 cycles a) Miss penalty = 16*(1+10+1) = 192 b) Miss penalty = 4*(1+10+1) = 48 c) Miss penalty = 4*(1+10+4*1) = 60 Anshul Kumar, CSE IITD
Back to CPI calculation CPIeff = 1.2 + .005 * miss penalty * 1.0 a) 1.2 + .005 * 192 * 1.0 = 1.2 + .96 = 2.16 b) 1.2 + .005 * 48 * 1.0 = 1.2 + .24 = 1.44 c) 1.2 + .005 * 60 * 1.0 = 1.2 + .30 = 1.50 Anshul Kumar, CSE IITD
Performance Improvement • Reducing miss penalty • Reducing miss rate • Reducing miss penalty * miss rate • Reducing hit time Anshul Kumar, CSE IITD
Reducing Miss Penalty • Multi level caches • Critical word first and early restart • Write Through policy • Giving priority to read misses over write • Merging write buffer • Victim caches Anshul Kumar, CSE IITD
Multi Level Caches Average memory access time = Hit timeL1 + Miss rateL1 * Miss penaltyL1 Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2 Multi level inclusion and Multi level exclusion Anshul Kumar, CSE IITD
Misses in Multilevel Cache • Local Miss rate • no. of misses / no. of requests, as seen at a level • Global Miss rate • no. of misses / no. of requests, on the whole • Solo Miss rate • miss rate if only this cache was present Anshul Kumar, CSE IITD
Two level cache miss example B: ~L1, L2 A: L1, L2 D: ~L1, ~L2 Local miss (L1) = (B+D)/(A+B+C+D) Local miss (L2) = D/(B+D) Global Miss = D/(A+B+C+D) Solo miss (L2) = (C+D)/(A+B+C+D) C: L1, ~L2 Anshul Kumar, CSE IITD
Multi-level cache example CPI with no miss = 1.0 Clock = 500 MHz Main mem access time = 200 ns Miss rate = 5% Adding L2 cache (20 ns) reduces miss to 2%. Find performance improvement. Miss penalty (mem) = 200/2 = 100 cycles Effective CPI with L1 = 1+5%*100 = 6 Anshul Kumar, CSE IITD
Example continued Miss penalty ( L2 ) = 20/2 = 10 cycles Total CPI = Base CPI + stalls due to L1 miss + stalls due to L2 miss = 1.0 + 5% * 10 + 2% * 100 = 1.0 + 0.5 + 2.0 = 3.5 Performance ratio = 6.0/3.5 = 1.7 Anshul Kumar, CSE IITD
Lecture 21 3rd March, 2010 Anshul Kumar, CSE IITD
Critical Word First and Early Restart • Read policy • initiate memory access along with cache access in anticipation of a miss • forward data to CPU as it gets filled in cache • Load policy • wrap around load More effective when block size is large Anshul Kumar, CSE IITD
Write Through Policy • Write Through Policy reduces block traffic at the cost of • increased word traffic • increased miss rate Anshul Kumar, CSE IITD
Read Miss Priority Over Write • Provide write buffers • Processor writes into buffer and proceeds (for write through as well as write back) On read miss • wait for buffer to be empty, or • check addresses in buffer for conflict Anshul Kumar, CSE IITD
Merging Write Buffer Merge writes belonging to same block in case of write through Anshul Kumar, CSE IITD
Victim Cache (proposed by Jouppi) • Evicted blocks are recycled • Much faster than getting a block from the next level • Size = a few blocks only • A significant fraction of misses may be found in victim cache to proc Cache Victim Cache from mem Anshul Kumar, CSE IITD
Reducing Miss Rate • Large block size • Larger cache • LRU replacement • WB, WTWA write policies • Higher associativity • Way prediction, pseudo-associative cache • Warm start in multi-tasking • Compiler optimizations Anshul Kumar, CSE IITD
Large Block Size • Reduces compulsory misses • Too large block size - misses increase • Miss Penalty increases Anshul Kumar, CSE IITD
Large Cache • Reduces capacity misses • Hit time increases • Keep small L1 cache and large L2 cache Anshul Kumar, CSE IITD
LRU Replacement Policy • Choice of replacement policy • optimal: replace the block whose reference is farthest away in future • practical: replace the block whose reference is farthest away in past Anshul Kumar, CSE IITD
WB, WTWA Policies • WB and WTWA write policies tend to reduce the miss rate as compared to WTNWA at the cost of • increased block traffic Anshul Kumar, CSE IITD
Higher Associativity • Reduces conflict misses • 8-way is almost like fully associative • Hit time increases Anshul Kumar, CSE IITD
Associative cache example Cache mapping block size I-miss D-miss CPI 1 direct 1 word 4% 8% 2.0 2 direct 4 word 2% 5% ?? 3 2-way s.a. 4 word 2% 4% ?? Miss penalty = 6 + block size 50% instruction have a data reference Stall cycles: cache1: 7*(.04+.08*.5)=.56 cache2: 10*(.02+.05*.5) = .45 cache3: 10*(.02+.04*.5) = .40 Anshul Kumar, CSE IITD
Example continued Cache CPI clock period time/instr 1 2.0 2.0 4.0 2 2.0 - .56 + .45 = 1.89 2.0 3.78 3 2.0 - .56 + .40 = 1.84 2.4 4.416 Anshul Kumar, CSE IITD
Way Prediction and Pseudo-associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache • Only one tag is compared initially • Extra bits are kept for prediction • Hit time in case of mis-prediction is high Pseudo-assoc. or column assoc. cache: get advantage of SA cache in a DM cache • Check sequentially in a pseudo-set • Fast hit and slow hit Anshul Kumar, CSE IITD
Warm Start in Multi-tasking • Cold start • process starts with empty cache • blocks of previous process invalidated • Warm start • some blocks from previous activation are still available Anshul Kumar, CSE IITD
Lecture 22 6th March, 2010 Anshul Kumar, CSE IITD
Compiler optimizations Loop interchange • Improve spatial locality by scanning arrays row-wise Blocking • Improve temporal and spatial locality Anshul Kumar, CSE IITD
Improving Locality Matrix Multiplication example Anshul Kumar, CSE IITD
Cache Organization for the example • Cache line (or block) = 4 matrix elements. • Matrices are stored row wise. • Cache can’t accommodate a full row/column. (In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.) • Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.) Anshul Kumar, CSE IITD
Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = o; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LM LMN LMN misses LM/4 LMN/4 LMN Total misses = LM(5N+1)/4 Anshul Kumar, CSE IITD
Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN LMN/4 Total misses = LN(2M+4)/4 Anshul Kumar, CSE IITD
Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN/4 LMN/4 Total misses = LN(2M+1)/4 Anshul Kumar, CSE IITD
Blocking jj kk C A B accesses LMN/b LMN LMN misses LMN/4b LMN/4b MN/4 Total misses = MN(2L/b+1)/4 i 5 nested loops blocking factor = b j k jj kk jj j k i i = j kk k Anshul Kumar, CSE IITD
Loop Blocking for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j]; C A B accesses LMN/4 LN LMN misses LMN/16 LN/4 LMN/4 Total misses = LN(5M/4+1)/4 Anshul Kumar, CSE IITD
Reducing Miss Penalty * Miss Rate • Non-blocking cache • Write allocate with no fetch • Hardware prefetching • Compiler controlled prefetching Anshul Kumar, CSE IITD
Non-blocking Cache In OOO (Out of Order) processor • Hit under a miss • complexity of cache controller increases • Hit under multiple misses or miss under a miss • memory should be able to handle multiple misses Anshul Kumar, CSE IITD
Write allocate with no fetch Write miss in a Write Through cache: • Allocate a block in cache • Fetch contents of block from memory • Write into cache • Write into memory • reduces miss rate (WTWA) • increases miss penalty (avoid for clustered writes) Anshul Kumar, CSE IITD
Hardware Prefetching • Prefetch items before they are requested • both data and instructions • What and when to prefetch? • fetch two blocks on a miss (requested+next) • Where to keep prefetched information? • in cache • in a separate buffer (most common case) Anshul Kumar, CSE IITD
Prefetch Buffer/Stream Buffer to proc Cache prefetch buffer from mem Anshul Kumar, CSE IITD
Hardware prefetching: Stream buffers Joupi’s experiment [1990]: • Single instruction stream buffer catches 15% to 25% misses from a 4KB direct mapped instruction cache with 16 byte blocks • 4 block buffer – 50%, 16 block – 72% • single data stream buffer catches 25% misses from 4 KB direct mapped cache • 4 data stream buffers (each prefetching at a different address) – 43% Anshul Kumar, CSE IITD
HW prefetching: UltraSPARC III example 64 KB data cache, 36.9 misses per 1000 instructions 22% instructions make data reference hit time = 1, miss penalty = 15 prefetch hit rate = 20% 1 cycle to get data from prefetch buffer What size of cache will give same performance? miss rate = 36.9/220 = 16.7% av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046 effective miss rate = (3.046-1)/15=13.6%=> 256 KB cache Anshul Kumar, CSE IITD
Compiler Controlled Prefetching • Register prefetch / Cache prefetch • Faulting / non-faulting (non-binding) • Semantically invisible (no change in registers or memory contents) • Makes sense if processor doesn’t stall while prefetching (non-blocking cache) • Overhead of prefetch instruction should not exceed the benefit Anshul Kumar, CSE IITD