350 likes | 559 Views
CPE 626 CPU Resources: Introduction to Cache Memories. Aleksandar Milenkovic E-mail: milenka@ece.uah.edu Web: http://www.ece.uah.edu/~milenka. Outline. Processor-Memory Speed Gap Memory Hierarchy Spatial and Temporal Locality What is Cache? Four questions about caches
E N D
CPE 626 CPU Resources:Introduction to Cache Memories Aleksandar Milenkovic E-mail: milenka@ece.uah.edu Web: http://www.ece.uah.edu/~milenka
Outline • Processor-Memory Speed Gap • Memory Hierarchy • Spatial and Temporal Locality • What is Cache? • Four questions about caches • Example: Alpha AXP 21064 Data Cache • Fully Associative Cache • N-Way Set Associative Cache • Block Replacement Policies • Cache Performance
Review: Processor Memory Performance Gap Processor: 2x/1.5 year Processor-Memory Performance Gap grows 50% / year Performance Memory: 2x/10 years Time
Processor Control Datapath The Memory Hierarchy (MH) User sees as much memory as is available in cheapest technology and access it at the speed offered by the fastest technology Levels in Memory Hierarchy Lower Upper Slowest Fastest Speed: Capacity: Cost/bit: Biggest Smallest Highest Lowest
Why hierarchy works? • Principal of locality • Temporal locality: recently accessed items are likely to be accessed in the near future Keep them close tothe processor • Spatial locality: items whose addresses are near one another tend to be referenced close together in time Move blocks consisted of contiguous words to the upper level Rule of thumb: Programs spend 90% of their execution time in only 10% of code Probability of reference Address space
What is a Cache? • Small, fast storage used to improve average access time to slow memory • Exploits spatial and temporal locality • In computer architecture, almost everything is a cache! • Registers a cache on variables • First-level cache a cache on second-level cache • Second-level cache a cache on memory • Memory a cache on disk (virtual memory) • TLB a cache on Page Map Table • Branch-prediction a cache on prediction information?
Four questions about cache • Where can a block be placed in the upper level? Block placement • direct-mapped, fully associative, set-associative • How is a block found if it is in the upper level? Block identification • Which block should be replaced on a miss? Block replacement • Random, LRU (Least Recently Used) • What happens on a write? Write strategy • Write-through vs. write-back • Write allocate vs. No-write allocate
Direct-Mapped Cache (1/3) • In a direct-mapped cache, each memory address is associated with one possible block within the cache • Therefore, we only need to look in a single location in the cache for the data if it exists in the cache • Block is the unit of transfer between cache and memory
Direct-Mapped Cache (2/3) Memory Address Cache Index Cache Location 0 can be occupied by data from: • Memory location 0, 4, 8, ... • In general: any memory location that is multiple of 4 Cache (4 byte) Memory 0 0 1 1 2 2 3 3 4 5 6 7 8 9 A B C D E F
Direct-Mapped Cache (3/3) • Since multiple memory addresses map to same cache index, how do we tell which one is in there? • What if we have a block size > 1 byte? • Result: divide memory address into three fields: Block Address ttttttttttttttttttiiiiiiiiiioooo TAG: to check if have the correct block OFFSET: to select byte within the block INDEX: to select block
Direct-Mapped Cache Terminology • INDEX: specifies the cache index (which “row” of the cache we should look in) • OFFSET: once we have found correct block, specifies which byte within the block we want • TAG: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location • BLOCK ADDRESS: TAG + INDEX
Direct-Mapped Cache Example • Conditions • 32-bit architecture (word=32bits), address unit is byte • 8KB direct-mapped cache with 4 words blocks • Determine the size of the Tag, Index, and Offset fields • OFFSET (specifies correct byte within block): cache block contains 4 words = 16 (24) bytes 4 bits • INDEX (specifies correct row in the cache):cache size is 8KB=213 bytes, cache block is 24 bytes#Rows in cache (1 block = 1 row): 213/24 = 29 9 bits • TAG: Memory address length - offset - index =32 - 4 - 9 = 19 tag is leftmost 19 bits
What to do on a write hit? • Write through(or store through) • Update the word in cache block and corresponding word in memory • Write back (or copy back) • update the word in cache block only • allow the memory word to be “stale” • the modified cache block is written to main memory only when it is replaced => add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced; if the block is clean, it is not written on a miss, since the memory has identical information to the cache=> OS flushes cache before I/O !!!
What to do on a write-miss? • Write allocate (or fetch on write)The block is loaded on a write-miss, followed by the write-hit actions • No-write allocate (or write around)The block is modified in the memory and not loaded into the cache • Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate
Alpha AXP 21064 Data Cache • 8KB direct mapped cache with 32B blocks; CPU presents a 34 bits address to the cache • => offset is 5 bits, index is 8 bits, tag 21 bits • Write through, four block write buffer • No-write allocate on a write miss
Alpha AXP 21064 Data Cache <34> CPU Offset Address <21> <8> <5> Data in Tag Index Data out Data<256> Valid<1> Tag<21> ... ... Write buffer =? 4:1 Mux Lower level memory
Alpha AXP 21064 Data Cache: Read Hit • 1. CPU presents 34 bits address to the cache • 2. The address is divided <tag, index, offset> • 3. Index selects tag to be tested to see if the desired block is in the cache • 4. After reading the tag from the cache, it is compared to the tag portion of the address • data can be read and sent to CPU in parallel with the tag being read and checked • 5. If the tag does match and valid bit is set,cache signals the CPU to load data
Alpha AXP 21064 Data Cache: Write Hit • First four steps are the same as for Read • 5. Corresponding word in the cache is written and also sent to write buffer • 6a. If write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPU perspective • 6b. If the buffer contains any other modified blocks, the addresses are checked to see if the address of this new data matches; if so, new data are combined with that entry (write-merging) • 6c. If the buffer is full and there is no address match, cache and CPU must wait
Alpha AXP 21064 Data Cache: Read Miss • First four steps are the same as for Read • 5. The cache sends a stall signal to CPU telling it to wait, and 32B are read from the next level of the hierarchy • in the DEC 3000 model 800 the path to the lower level is 16 bytes wide; it takes 5 clock cycles per transfer, or 10 clock cycles for all 32 bytes • 6. Replace block (update data, tag, and valid bit)
Alpha AXP 21064 Data Cache: Write Miss • First four steps are the same as for Read • 5. Since tag does not match (or valid bit is not set) the CPU writes around the cache to lower level memory and does not affect cache (no-write allocate)
Fully Associative Cache • Memory address fields: • Tag: same as before • Offset: same as before • Index: non-existent • What does this mean? • any block can be placed anywhere in the cache • must compare with all tags in entire cache to see if data is there
Fully Associative Cache • 8KB with 4W blocks (512 cache blocks) 3 31 0 Cache Tag (28 bits long) Byte Offset Cache Data Valid Cache Tag = = = = : : : : = Compare tags in parallel
Fully Associative Cache • Benefit of fully associative caches • Data can go anywhere in the cache - no Conflict misses • Drawbacks • expensive in hardware since every single entryneeds comparator (if we have a 64KB of data in cache with 4B entries, we need 16K comparators: infeasible) • may slow processor clock rate
N-Way set-associative cache (1/3) • Memory address fields: • Tag: same as before • Offset: same as before • Index: points us to the correct “row” (called a set in this case) • So what is the difference? • each set contains multiple blocks • once we have found correct set, must compare with all tags in that set to find our data
N-Way set-associative cache (2/3) • Summary • Cache is direct-mapped with respect to sets • Each set is fully associative • Basically N direct-mapped caches working in parallel: each has its own valid bit and data • Given memory address: • Find correct set using Index value • Compare Tag with all Tag values in the determined set • If a match occurs, it is a hit, otherwise a miss • Finally, use the offset field as usual to find the desired data within the desired block
N-Way set-associative cache (3/3) • What is so great about this? • even a 2-way set associative cache avoids a lot of conflict misses • hardware cost is not so bad (N comparators) • Disadvantages • Extra MUX delay for the data • Data comes AFTER Hit/Miss decision and set selection;In a direct mapped cache, Cache Block is available BEFORE Hit/Miss • In fact, for a cache with M blocks • it is Direct Mapped if it is 1-way set associative • it is Fully Associative if it is M-way set associative • so these two are just special cases of the more general set associative
2-Way set associative, 8KB cache with 4W blocks, W=64b CPU <34> Offset Address <22> <7> <5> Data in Tag Index Data out Valid<1> Tag<22> Data<256> ... ... 2:1 MUX 4:1 Mux =? Write buffer 4:1 Mux =? Lower level memory ... ...
Block Replacement Policy (1/2) • Direct Mapped Cache: index completely specifies position which a block can go in on a miss • N-Way Set Associative (N > 1): index specifies a set, but block can occupy any position within the set on a miss • Fully Associative: block can be written into any position • Question: if we have the choice, where should we write an incoming block?
Block Replacement Policy (2/2) • If there are any locations with valid bit off (empty), then usually write the new block into the first one • If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss • Replacement policies • LRU (Least Recently Used) • Random • FIFO (First In First Out)
Block Replacement Policy: LRU (1/2) • Reduce the chance of throwing out data that will be needed soon • Accesses to blocks are recorded • Idea: throw out block which has been accessed least recently
Block Replacement Policy: LRU (2/2) • Pro: exploit temporal locality => recent past use implies likely future use: in fact, this is a very effective policy • Con: implementation • simple for 2-way set associative - one LRU bit to keep track • complex for 4-way or greater (requires complicated hardware and much time to keep track of this) => frequently it is only approximated
LRU Example • Assume 2-way set associative cache with 4 blocks total capacity; We perform the following block accesses: 0, 4, 1, 2, 3, 1, 5 • How many misses will there for the LRU? How many hits block replacement policy?
Instruction and Data caches • When a load or store instruction is executed, the pipelined processor will simultaneously request both a data word and an instruction word=> structural hazard for load and stores • Divide and conquer • Instruction cache (dedicated to instructions) • Data cache (dedicated to data) • Opportunity for optimizing each cache separately:different capacities, block sizes, and associativities may led to better performance
Proc Proc I-Cache-L1 Proc D-Cache-L1 Unified Cache-L1 Unified Cache-L2 Unified Cache-L2 Example: Harvard Architecture • Unified vs. Separate I&D (Harvard) • Table on page 384: • 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% • 32KB unified: Aggregate miss rate=1.99% • Which is better (ignore L2 cache)? • Assume 33% data ops 75% accesses from instructions (1.0/1.33) • hit time=1, miss time=50 • Note that data hit has 1 stall for unified cache (only one port) • AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 • AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
Things to Remember • Where can a block be placed in the upper level? Block placement • direct-mapped, fully associative, set-associative • How is a block found if it is in the upper level? Block identification: Tag, Index, Offset • Which block should be replaced on a miss? Block replacement • Random, LRU (Least Recently Used) • What happens on a write? Write strategy • Write-through (with write buffer) vs. write-back • Write allocate vs. No-write allocate • AMAT = Average Memory Access Time