340 likes | 769 Views
Chapter Seven Part I Cache Memory. Memories: Review. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10).
E N D
Memories: Review • SRAM: • value is stored on a pair of inverting gates • very fast but takes up more space than DRAM (4 to 6 transistors) • DRAM: • value is stored as a charge on capacitor (must be refreshed) • very small but slower than SRAM (factor of 5 to 10)
Exploiting Memory Hierarchy • Users want large and fast memories! SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.DRAM access times are 50-70ns at cost of $100 to $200 per GB.Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. • Try and give it to them anyway • build a memory hierarchy 2004
Locality • A principle that makes having a memory hierarchy a good idea • If an item is referenced,temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? • Library books analogy • Our initial focus: two levels (upper, lower) • block: minimum unit of data • hit: data requested is in the upper level • miss: data requested is not in the upper level
Cache • Two issues: • How do we know if a data item is in the cache? • If it is, how do we find it? • Our first example: • block size is one word of data • "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level
Direct Mapped Cache • Mapping: address is modulo the number of blocks in the cache
Direct Mapped Cache • Add a set of tags and a valid bit • The tag contains the upper portion of the address • The valid bit indicates whether an entry contains a valid address • Example: 1 word / block, memory 32 words 32 blocks, cache 8 blocks Note that the address in this example is the word address. Q: How do you obtain the byte address from a word address?
Direct Mapped Cache • For MIPS (4 bytes/word): What kind of locality are we taking advantage of?
Direct Mapped Cache • What is the address format? • How many bits are required to build a cache? • Example (page 479) 32-bit address, byte-addressable, 4 bytes/word, 4 words/block, cache size = 16 KB 16KB / 4 bytes/word = 4K words 4K words / 4 words/block = 1K blocks 10 bits for block index, 2 bits for block offset, 2 bits for byte offset 32 – 10 – 2 – 2 = 18 bits for tag Address format: 31 30 … 14 13 12 … 4 3 2 1 0 [ tag | index | block offset | byte offset] For each block, 1 bit for V, 18 bits for tag, 4x32 bits for data 1 + 18 + 128 = 147 bits/block Total cache size = 147 bits/block x 1K blocks = 147 Kbits > 16 KB
Direct Mapped Cache • Taking advantage of spatial locality (>1 words / block): • Example: FastMATH embedded microprocessor, 16 KB cache, 16 words / block
Direct Mapped Cache Consider a cache with 64 blocks, the block size is 16 bytes. What block number does byte address 1204 map to? Its word address is 1204/4 = 30110. With 4 words per block, word address 30110 is block address 301/4 = 75. The block offset is 110 which is 012 The index will be 75 modulo 64 =1110 which is 0010112 The tag will be 110 which is 00 0000 0000 0000 0000 00012 Binary solution: 64 blocks 6 bits for index 4 bits for block offset & byte offset 1204 = 0000 0000 0000 0000 0000 0100 1011 01 00 [ tag | index | blcok offset | byte offset]
Hits vs. Misses • Read hits • this is what we want! • Read misses • stall the CPU, fetch block from memory, deliver to cache, restart • Write hits: • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the cache later) • Write misses: • read the entire block into the cache, then write the word
Direct Mapped Cache – handling read miss • CPU fetches instruction from the cache, if an instruction access results in a miss: • Send the value of “current PC – 4” to the memory • Instruct the memory to read and wait for the memory to complete its access • Write the cache entry, put the data from memory to cache entry, write the upper bits of the address from the ALU into the tag field, and turn the valid bit on • Restart the instruction execution at the first step, fetch the instruction in the cache
Direct Mapped Cache – handling write • Data inconsistency: on a store instruction, after the data is written into the cache, memory would have a different value from that in cache • Solutions: • Write through: always write the data into both the memory and the cache • Write buffer: use the buffer to store the data while it is being writing to the memory. The processor continues execution • Write back: the modified block in cache is written to the memory only when it needs to be replaced in cache
Hardware Issues • Make reading multiple words easier by using banks of memory
C P U M u l t i p l e x o r C a c h e B u s M e m o r y b . W i d e m e m o r y o r g a n i z a t i o n Increasing Memory Bandwidth • Widen the memory and the bus between the memory and processor • Reduce the time by minimizing the number of times we must start a new memory access Assuming: 4 words per block, 1 clock cycle to send the address, 15 clock cycles to initiate DRAM access and 1 clock cycle to send a word of data With a main memory width of 1 words. The miss penalty will be 1+4x15+4x1=65 With a main memory width of 4 words. The miss penalty will be 1+15+1=17
C P U C a c h e B u s M e m o r y M e m o r y M e m o r y M e m o r y b a n k 0 b a n k 1 b a n k 2 b a n k 3 c . I n t e r l e a v e d m e m o r y o r g a n i z a t i o n Increasing Memory Bandwidth • Widen the memory only by using multiple memory banks • One address is sent to 4 banks of memory, reading is done simultaneously • Reduce the time by minimizing the number of times we must start a new memory access Assuming: 4 words per block, 1 clock cycle to send the address, 15 clock cycles to initiate DRAM access and 1 clock cycle to send a word of data With 4 memory banks of 1 word width. The miss penalty will be 1+15+4x1=20
AMAT • Average memory access time (AMAT) • Average time to access memory considering both hits and misses and the frequency of different accesses • Used to capture the fact that the time to access data for both hits and misses affects performance • Useful as a figure of merit for different cache systems • AMAT = Time for a hit + Miss rate × Miss penalty • 7.17 [5] <§7.2> Find the AMAT for a processor with a 2 ns clock, a miss penalty of 20 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls. • 7.18 [5] <§7.2> Suppose we can improve the miss rate to 0.03 misses per reference by doubling the cache size. This causes the cache access time to increase to 1.2 clock cycles. Using the AMAT as a metric, determine if this is a good trade-off.
Performance • Increasing the block size tends to decrease miss rate • But, if the block size becomes a significant fraction of the cache size, the number of cache blocks will be small • A block of will be replaced in the cache before many of its words are accessed. • It causes the increase of the miss rate. • How do we solve this?
Performance • Simplified model: execution time = (execution cycles + stall cycles) ´ cycle time stall cycles = # of instructions ´ miss ratio ´ miss penalty • Two ways of improving performance: • decreasing the miss ratio • decreasing the miss penalty What happens if we increase block size?
Performance Assume an instruction cache miss rate for gcc is 2% and a data miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. The frequency of load instruction for gcc is 36% To run a program, CPU will fetch both instructions and data from the memory If there is a miss for instruction, penalty will be # Instruction miss cycles = IC * 2% * 100 = 2.00 IC If there is a data miss, penalty will be # Data miss cycles = IC * 36% * 4% * 100 = 1.44 IC The total memory-stall cycles is 2.00 IC + 1.44 IC = 3.44 IC The CPI with memory stalls is 2 + 3.44 = 5.44 (CPU time with stalls)/(CPU time perfect) = (IC * CPIstall * Clock cycle) / (IC * CPperfect * Clock cycle) = 5.44 / 2 = 2.72
O n e - w a y s e t a s s o c i a t i v e ( d i r e c t m a p p e d ) B l o c k T a g D a t a 0 T w o - w a y s e t a s s o c i a t i v e 1 S e t T a g D a t a T a g D a t a 2 0 3 1 4 2 5 3 6 7 F o u r - w a y s e t a s s o c i a t i v e S e t T a g D a t a T a g D a t a T a g D a t a T a g D a t a 0 1 E i g h t - w a y s e t a s s o c i a t i v e ( f u l l y a s s o c i a t i v e ) T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a Decreasing Miss Ratio with Associativity • N-way set-associative cache consists of a number of sets, each set has n blocks. • Each memory block is mapped to a unique set in cache given by the index field • A memory block can be placed in any element of the set (Block number) modulo (Number of sets in the cache)
O n e - w a y s e t a s s o c i a t i v e ( d i r e c t m a p p e d ) B l o c k T a g D a t a 0 T w o - w a y s e t a s s o c i a t i v e 1 S e t T a g D a t a T a g D a t a 2 0 3 1 4 2 5 3 6 7 F o u r - w a y s e t a s s o c i a t i v e S e t T a g D a t a T a g D a t a T a g D a t a T a g D a t a 0 1 E i g h t - w a y s e t a s s o c i a t i v e ( f u l l y a s s o c i a t i v e ) T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a T a g D a t a Decreasing Miss Ratio with Associativity • If block address is 13 and the cache has 8 blocks Set index = (Block number) modulo (Number of sets in the cache) one-way (direct): 13 modulo 8 = 5 two-way: 13 modulo 4 = 1 four-way: 13 modulo 2 = 1 search 3 6 1 13
Set-Associative Cache • Compared to direct mapped, give a series of references that: • results in a lower miss ratio using a 2-way set associative cache • results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy
Set-Associative Cache Issues • Locating a block in the cache • Address format [ Tag | Set Index | Block Offset | Byte Offset ] • Block replacement • LRU (Least Recently Used) • Random
Set-Associative Cache Example Assume a cache of 4K blocks, four-word block size, and a 32-bit address, find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way and four-way set associative, and fully associative. (Page 504) The direct-mapped cache has same number of sets as blocks, 4K = 212, hence the total number of tag bits is (32 – 12 – 4) x 4K = 64 Kbits For a 2-way set-associative cache, there are 2K = 211 sets. The total number of tag bits is (32 – 11 – 4) x 2 x 2K = 68 Kbits For a 4-way set-associative cache, there are 1K = 210 sets. The total number of tag bits is (32 – 10 – 4) x 4 x 1K = 72 Kbits For a fully set-associative cache, there is only 1 set. The total number of tag bits is (32 – 0 – 4) x 4K x 1 = 112 Kbits
Decreasing Miss Penalty with Multilevel Caches • Add a second level cache: • often primary cache is on the same chip as the processor • use SRAMs to add another cache above primary memory (DRAM) • miss penalty goes down if data is in 2nd level cache • Example: (Pages 505 – 506) • CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access • Adding 2nd level cache with 5ns access time decreases miss rate to .5% • Using multilevel caches: • try and optimize the hit time on the 1st level cache • try and optimize the miss rate on the 2nd level cache
Cache Complexities • Not always easy to understand implications of caches: Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort
Cache Complexities • Here is why: • Memory system performance is often critical factor • Multilevel caches, pipelined processors, make it harder to predict outcomes • Compiler optimizations to increase locality sometimes hurt ILP • Difficult to predict best algorithm: need experimental data