CMP 301A Computer Architecture 1 Lecture 2

CMP 301AComputer Architecture 1Lecture 2

Outline • Direct mapped caches: Reading and writing policies • Measuring cache performance • Improving cache performance • Enhancing main memory performance • Flexible placement of blocks: Associativity • Multilevel caches

Read and Write Policies • Cache read is much easier to handle than cache write: • Instruction cache is much easier to design than data cache • Cache write: • How do we keep data in the cache and memory consistent? • Two write options: • Write Through: write to cache and memory at the same time. • Isn’t memory too slow for this? • Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. • Need a “dirty” bit for each cache block • Greatly reduce the memory bandwidth requirement • Control can be complex

Write Buffer for Write Through Cache Processor DRAM • A Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: • Typical number of entries: 4 • Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle • Memory system designer’s nightmare: • Store frequency (w.r.t. time) -> 1 / DRAM write cycle • Write buffer saturation • Problem: Write buffer may hold updated value of location needed by a read miss??!! Write Buffer

Write Allocate versus Not Allocate • Assume: a 16-bit write to memory location 0x0 and causes a miss • Do we read in the rest of the block (Byte 2, 3, ... 31)? Yes: Write Allocate No: Write Not Allocate 31 9 4 0 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x00 Byte 31 Byte 1 Byte 0 0 : Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

Measuring cache performanceImpact of cache miss on Performance • Suppose a processor executes at • Clock Rate = 1 GHz (1 ns per cycle), Ideal (no misses) CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory operations (involving data) get 100 cycle miss penalty • Suppose that 1% of instructions get same miss penalty 78% of the time the proc is stalled waiting for memory!

Improving Cache Performance • Average memory access time(AMAT) = • Hit time + Miss rate x Miss penalty • To improve performance: • reduce the hit time • reduce the miss rate • reduce the miss penalty

Enhancing main memory performance • Increasing memory and bus width • Transfer more words every clock cycle • Isn’t too much wiring • Using interleaved memory organization • Reduce access time with less wiring • Double Date Rate (DDR) DRAMs

Enhancing main memory performance (Cont)

Flexible placement of blocks: Associativity 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 3 3 0 1 Block Number 0 1 2 3 4 5 6 7 8 9 Memory Set Number 0 1 2 3 4 5 6 7 0 1 2 3 Cache Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) block 12 can be placed

Flexible placement of blocks: Associativity

Cache Data Cache Tag Valid Cache Block 0 : : : Compare A Two-way Set Associative Cache • N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel • Example: Two-way set associative cache • Cache Index selects a “set” from the cache • The two tags in the set are compared in parallel • Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Block 0 : : : Adr Tag Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit

And yet Another Extreme Example: Fully Associative • Fully Associative Cache -- push the set associative idea to its limit! • Forget about the Cache Index • Compare the Cache Tags of all cache entries in parallel • Example: Block Size = 32 B blocks, we need N 27-bit comparators • By definition: Conflict Miss = 0 for a fully associative cache 31 4 0 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

Replacement Policy • In an associative cache, which block from a set should be evicted when the set becomes full? • Random • Least-Recently Used (LRU) • LRU cache state must be updated on every access • true implementation only feasible for small sets (2-way) • First-In, First-Out (FIFO) a.k.a. Round-Robin • used in highly associative caches • Not-Most-Recently Used (NMRU) • FIFO with exception for most-recently used block or blocks Replacement only happens on misses

CMP 301A Computer Architecture 1 Lecture 2

CMP 301A Computer Architecture 1 Lecture 2

Presentation Transcript

CS 5513: Computer Architecture Lecture 1: Introduction

Lecture 2: Computer Architecture: A Science ofTradeoffs

Computer Architecture Lecture 2: Fundamental Concepts and ISA

ECE C61 Computer Architecture Lecture 2 – performance

Computer Architecture Lecture 3

Computer Organization and Architecture Lecture 2

CS152 Computer Architecture and Engineering Lecture 1

CS 5513: Computer Architecture Lecture 1: Introduction

Lecture 2: Intro to Computer Architecture

EECS 362 Computer Architecture Projects Lecture 1

Computer Architecture Lecture 2

Computer Architecture Lecture 2: Fundamental Concepts and ISA

CS252 Graduate Computer Architecture Lecture 1 Introduction

CP2022 - Lecture 2 Computer communications 1

Lecture 1: Computer Architecture and Technology

CS 5513: Computer Architecture Lecture 1: Introduction