ECE729 : Advanced Computer Architecture

ECE729 : Advanced Computer Architecture Lectures 20-23 : Memory Hierarchy -Cache Performance 2nd – 12th March, 2010 Anshul Kumar, CSE IITD

Lecture 20 2nd March, 2010 Anshul Kumar, CSE IITD

Performance Time to access cache = 1 (usually 1 CPU cycle) Time to access main memory = 2 (1 order higher than1) Hit probability (hit ratio or hit rate) = h Miss probability (miss ratio or miss rate) = m= 1 - h Time spent when hit occurs= 1 (Hit time) Time spent when miss occurs = 1 +2 (2 =Miss penalty) Teff = h 1 + m (1 +2) OR 1 + m 2 Anshul Kumar, CSE IITD

Performance contd. Mem stalls / access Teff = 1 + m 2 Average memory access time = Hit time + Miss rate * Miss penalty Program execution time = IC * Cycle time * (CPIexec + Mem stalls / instr) Mem stalls / instr = Miss rate * Miss Penalty * Mem accesses / instr Miss Penalty in OOO processor = Total miss latency - Overlapped miss latency Anshul Kumar, CSE IITD

CPU CPU CPU cache cache cache bus bus bus mem bank1 mem bank2 mem bank0 mem bank3 memory c. interleaved memory memory b. four word wide memory a. one word wide memory Transferring blocks to/from memory Anshul Kumar, CSE IITD

Miss penalty example • 1 clock cycle to send address • 15 cycles for RAM access • 1 cycle for sending data • block size = 4 words Miss penalty: case (a): 4 (1 + 15 + 1) = 68 or 1 + 4 (15 + 1) = 65 case (b): 1 + 1 (15 + 1) = 17 case (c): 1 + 15 + 4 = 20 Anshul Kumar, CSE IITD

DRAM with page mode • Memory cells are organized as a 2-D structure • Entire row is accessed at a time internally and kept in a buffer • Reading multiple bits from a row can be done very fast • sequentially, without giving address again • randomly, giving only the column addresses Anshul Kumar, CSE IITD

Performance analysis example CPIeff = CPI+ Miss rate * Miss Penalty * Mem accesses / Instr CPI = 1.2 Miss rate = 0.5% Block size = 16 w Miss penalty?? Mem access / Instr = 1 (assumption) Anshul Kumar, CSE IITD

Miss penalty calculation Data / address transfer time = 1 cycle Memory latency = 10 cycles a) Miss penalty = 16*(1+10+1) = 192 b) Miss penalty = 4*(1+10+1) = 48 c) Miss penalty = 4*(1+10+4*1) = 60 Anshul Kumar, CSE IITD

Back to CPI calculation CPIeff = 1.2 + .005 * miss penalty * 1.0 a) 1.2 + .005 * 192 * 1.0 = 1.2 + .96 = 2.16 b) 1.2 + .005 * 48 * 1.0 = 1.2 + .24 = 1.44 c) 1.2 + .005 * 60 * 1.0 = 1.2 + .30 = 1.50 Anshul Kumar, CSE IITD

Performance Improvement • Reducing miss penalty • Reducing miss rate • Reducing miss penalty * miss rate • Reducing hit time Anshul Kumar, CSE IITD

Reducing Miss Penalty • Multi level caches • Critical word first and early restart • Write Through policy • Giving priority to read misses over write • Merging write buffer • Victim caches Anshul Kumar, CSE IITD

Multi Level Caches Average memory access time = Hit timeL1 + Miss rateL1 * Miss penaltyL1 Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2 Multi level inclusion and Multi level exclusion Anshul Kumar, CSE IITD

Misses in Multilevel Cache • Local Miss rate • no. of misses / no. of requests, as seen at a level • Global Miss rate • no. of misses / no. of requests, on the whole • Solo Miss rate • miss rate if only this cache was present Anshul Kumar, CSE IITD

Two level cache miss example B: ~L1, L2 A: L1, L2 D: ~L1, ~L2 Local miss (L1) = (B+D)/(A+B+C+D) Local miss (L2) = D/(B+D) Global Miss = D/(A+B+C+D) Solo miss (L2) = (C+D)/(A+B+C+D) C: L1, ~L2 Anshul Kumar, CSE IITD

Multi-level cache example CPI with no miss = 1.0 Clock = 500 MHz Main mem access time = 200 ns Miss rate = 5% Adding L2 cache (20 ns) reduces miss to 2%. Find performance improvement. Miss penalty (mem) = 200/2 = 100 cycles Effective CPI with L1 = 1+5%*100 = 6 Anshul Kumar, CSE IITD

Example continued Miss penalty ( L2 ) = 20/2 = 10 cycles Total CPI = Base CPI + stalls due to L1 miss + stalls due to L2 miss = 1.0 + 5% * 10 + 2% * 100 = 1.0 + 0.5 + 2.0 = 3.5 Performance ratio = 6.0/3.5 = 1.7 Anshul Kumar, CSE IITD

Lecture 21 3rd March, 2010 Anshul Kumar, CSE IITD

Critical Word First and Early Restart • Read policy • initiate memory access along with cache access in anticipation of a miss • forward data to CPU as it gets filled in cache • Load policy • wrap around load More effective when block size is large Anshul Kumar, CSE IITD

Write Through Policy • Write Through Policy reduces block traffic at the cost of • increased word traffic • increased miss rate Anshul Kumar, CSE IITD

Read Miss Priority Over Write • Provide write buffers • Processor writes into buffer and proceeds (for write through as well as write back) On read miss • wait for buffer to be empty, or • check addresses in buffer for conflict Anshul Kumar, CSE IITD

Merging Write Buffer Merge writes belonging to same block in case of write through Anshul Kumar, CSE IITD

Victim Cache (proposed by Jouppi) • Evicted blocks are recycled • Much faster than getting a block from the next level • Size = a few blocks only • A significant fraction of misses may be found in victim cache to proc Cache Victim Cache from mem Anshul Kumar, CSE IITD

Reducing Miss Rate • Large block size • Larger cache • LRU replacement • WB, WTWA write policies • Higher associativity • Way prediction, pseudo-associative cache • Warm start in multi-tasking • Compiler optimizations Anshul Kumar, CSE IITD

Large Block Size • Reduces compulsory misses • Too large block size - misses increase • Miss Penalty increases Anshul Kumar, CSE IITD

Large Cache • Reduces capacity misses • Hit time increases • Keep small L1 cache and large L2 cache Anshul Kumar, CSE IITD

LRU Replacement Policy • Choice of replacement policy • optimal: replace the block whose reference is farthest away in future • practical: replace the block whose reference is farthest away in past Anshul Kumar, CSE IITD

WB, WTWA Policies • WB and WTWA write policies tend to reduce the miss rate as compared to WTNWA at the cost of • increased block traffic Anshul Kumar, CSE IITD

Higher Associativity • Reduces conflict misses • 8-way is almost like fully associative • Hit time increases Anshul Kumar, CSE IITD

Associative cache example Cache mapping block size I-miss D-miss CPI 1 direct 1 word 4% 8% 2.0 2 direct 4 word 2% 5% ?? 3 2-way s.a. 4 word 2% 4% ?? Miss penalty = 6 + block size 50% instruction have a data reference Stall cycles: cache1: 7*(.04+.08*.5)=.56 cache2: 10*(.02+.05*.5) = .45 cache3: 10*(.02+.04*.5) = .40 Anshul Kumar, CSE IITD

Example continued Cache CPI clock period time/instr 1 2.0 2.0 4.0 2 2.0 - .56 + .45 = 1.89 2.0 3.78 3 2.0 - .56 + .40 = 1.84 2.4 4.416 Anshul Kumar, CSE IITD

Way Prediction and Pseudo-associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache • Only one tag is compared initially • Extra bits are kept for prediction • Hit time in case of mis-prediction is high Pseudo-assoc. or column assoc. cache: get advantage of SA cache in a DM cache • Check sequentially in a pseudo-set • Fast hit and slow hit Anshul Kumar, CSE IITD

Warm Start in Multi-tasking • Cold start • process starts with empty cache • blocks of previous process invalidated • Warm start • some blocks from previous activation are still available Anshul Kumar, CSE IITD

Lecture 22 6th March, 2010 Anshul Kumar, CSE IITD

Compiler optimizations Loop interchange • Improve spatial locality by scanning arrays row-wise Blocking • Improve temporal and spatial locality Anshul Kumar, CSE IITD

Improving Locality Matrix Multiplication example Anshul Kumar, CSE IITD

Cache Organization for the example • Cache line (or block) = 4 matrix elements. • Matrices are stored row wise. • Cache can’t accommodate a full row/column. (In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.) • Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.) Anshul Kumar, CSE IITD

Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = o; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LM LMN LMN misses LM/4 LMN/4 LMN Total misses = LM(5N+1)/4 Anshul Kumar, CSE IITD

Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN LMN/4 Total misses = LN(2M+4)/4 Anshul Kumar, CSE IITD

Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN/4 LMN/4 Total misses = LN(2M+1)/4 Anshul Kumar, CSE IITD

Blocking jj kk C A B accesses LMN/b LMN LMN misses LMN/4b LMN/4b MN/4 Total misses = MN(2L/b+1)/4 i 5 nested loops blocking factor = b j k jj kk jj j k i i  = j kk k Anshul Kumar, CSE IITD

Loop Blocking for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j]; C A B accesses LMN/4 LN LMN misses LMN/16 LN/4 LMN/4 Total misses = LN(5M/4+1)/4 Anshul Kumar, CSE IITD

Reducing Miss Penalty * Miss Rate • Non-blocking cache • Write allocate with no fetch • Hardware prefetching • Compiler controlled prefetching Anshul Kumar, CSE IITD

Non-blocking Cache In OOO (Out of Order) processor • Hit under a miss • complexity of cache controller increases • Hit under multiple misses or miss under a miss • memory should be able to handle multiple misses Anshul Kumar, CSE IITD

Write allocate with no fetch Write miss in a Write Through cache: • Allocate a block in cache • Fetch contents of block from memory • Write into cache • Write into memory • reduces miss rate (WTWA) • increases miss penalty (avoid for clustered writes) Anshul Kumar, CSE IITD

Hardware Prefetching • Prefetch items before they are requested • both data and instructions • What and when to prefetch? • fetch two blocks on a miss (requested+next) • Where to keep prefetched information? • in cache • in a separate buffer (most common case) Anshul Kumar, CSE IITD

Prefetch Buffer/Stream Buffer to proc Cache prefetch buffer from mem Anshul Kumar, CSE IITD

Hardware prefetching: Stream buffers Joupi’s experiment [1990]: • Single instruction stream buffer catches 15% to 25% misses from a 4KB direct mapped instruction cache with 16 byte blocks • 4 block buffer – 50%, 16 block – 72% • single data stream buffer catches 25% misses from 4 KB direct mapped cache • 4 data stream buffers (each prefetching at a different address) – 43% Anshul Kumar, CSE IITD

HW prefetching: UltraSPARC III example 64 KB data cache, 36.9 misses per 1000 instructions 22% instructions make data reference hit time = 1, miss penalty = 15 prefetch hit rate = 20% 1 cycle to get data from prefetch buffer What size of cache will give same performance? miss rate = 36.9/220 = 16.7% av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046 effective miss rate = (3.046-1)/15=13.6%=> 256 KB cache Anshul Kumar, CSE IITD

Compiler Controlled Prefetching • Register prefetch / Cache prefetch • Faulting / non-faulting (non-binding) • Semantically invisible (no change in registers or memory contents) • Makes sense if processor doesn’t stall while prefetching (non-blocking cache) • Overhead of prefetch instruction should not exceed the benefit Anshul Kumar, CSE IITD

ECE729 : Advanced Computer Architecture

ECE729 : Advanced Computer Architecture

Presentation Transcript

Ongoing Computer Engineerin g Research Projects at the Lucian Blaga University of Sibiu

ECE 6160: Advanced Computer Networks Introduction to Storage Devices

GSM Protocol Architecture

CSC 317 Computer Organization and Architecture

System Software and Machine Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

DESIGN OF SOFTWARE ARCHITECTURE

CSC: 345 Computer Architecture

Conceptual Architecture View

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases

Advanced Computer Vision

Advanced Computer Architectures – HB49 –

Advanced Computer Architecture CSE 8383

CPE 323 Introduction to Embedded Computer Systems: The MSP430 System Architecture

Interconnection Networks Computer Architecture: A Quantitative Approach 4 th Edition, Appendix E

80386 MICROPROCESSOR Architecture

CSE390 – Advanced Computer Networks

198:211 Computer Architecture

EEL 5764 Graduate Computer Architecture Chapter 2 - Instruction Level Parallelism

ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines