620 likes | 764 Views
55:035 Computer Architecture and Organization. Lecture 7. Outline. Cache Memory Introduction Memory Hierarchy Direct-Mapped Cache Set-Associative Cache Cache Sizes Cache Performance. Introduction. Memory access time is important to performance!
E N D
55:035Computer Architecture and Organization Lecture 7 55:035 Computer Architecture and Organization
Outline • Cache Memory Introduction • Memory Hierarchy • Direct-Mapped Cache • Set-Associative Cache • Cache Sizes • Cache Performance 55:035 Computer Architecture and Organization
Introduction • Memory access time is important to performance! • Users want large memories with fast access times ideally unlimited fast memory • To use an analogy, think of a bookshelf containing many books: • Suppose you are writing a paper on birds. You go to the bookshelf, pull out some of the books on birds and place them on the desk. As you start to look through them you realize that you need more references. So you go back to the bookshelf and get more books on birds and put them on the desk. Now as you begin to write your paper, you have many of the references you need on the desk in front of you. • This is an example of the principle of locality: This principle states that programs access a relatively small portion of their address space at any instant of time. 55:035 Computer Architecture and Organization
Levels of the Memory Hierarchy CPU Part of The On-chip CPU Datapath ISA 16-128 Registers Farther away from the CPU: Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput/ Bandwidth Registers One or more levels (Static RAM): Level 1: On-chip 16-64K Level 2: On-chip 256K-2M Level 3: On or Off-chip 1M-16M Cache Level(s) Dynamic RAM (DRAM) 256M-16G Main Memory Interface: SCSI, RAID, IDE, 1394 80G-300G Magnetic Disc Optical Disk or Magnetic Tape 55:035 Computer Architecture and Organization
Memory Hierarchy Comparisons Capacity Access Time Cost faster Staging Xfer Unit CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes 10-100 ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Memory OS 4K-16K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit Disk -5 -6 user/operator Mbytes Files Larger Tape infinite sec-min 10 Tape -8 55:035 Computer Architecture and Organization
Memory Hierarchy • We can exploit the natural locality in programs by implementing the memory of a computer as a memory hierarchy. • Multiple levels of memory with different speeds and sizes. • The fastest memories are more expensive, and usually much smaller in size (see figure). • The user has the illusion of a memory that is both large and fast. • Accomplished by using efficient methods for memory structure and organization. 55:035 Computer Architecture and Organization
Inventor of Cache M. V. Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Transactions on Electronic Computers, vol. EC-14, no. 2, pp. 270-271, April 1965. 55:035 Computer Architecture and Organization
Cache • Processor does all memory operations with cache. • Miss – If requested word is not in cache, a block of words containing the requested word is brought to cache, and then the processor request is completed. • Hit – If the requested word is in cache, read or write operation is performed directly in cache, without accessing main memory. • Block – minimum amount of data transferred between cache and main memory. Processor words Cache small, fast memory blocks Main memory large, inexpensive (slow) 55:035 Computer Architecture and Organization
The Locality Principle • A program tends to access data that form a physical cluster in the memory – multiple accesses may be made within the same block. • Physical localities are temporal and may shift over longer periods of time – data not used for some time is less likely to be used in the future. Upon miss, the least recently used (LRU) block can be overwritten by a new block. • P. J. Denning, “The Locality Principle,” Communications of the ACM, vol. 48, no. 7, pp. 19-24, July 2005. 55:035 Computer Architecture and Organization
Temporal & Spatial Locality • There are two types of locality: TEMPORAL LOCALITY (locality in time) If an item is referenced, it will likely be referenced again soon. Data is reused. SPATIAL LOCALITY (locality in space) If an item is referenced, items in neighboring addresses will likely be referenced soon • Most programs contain natural locality in structure. For example, most programs contain loops in which the instructions and data need to be accessed repeatedly. This is an example of temporal locality. • Instructions are usually accessed sequentially, so they contain a high amount of spatial locality. • Also, data access to elements in an array is another example of spatial locality. 55:035 Computer Architecture and Organization
Data Locality, Cache, Blocks Memory Increase block size to match locality size Increase cache size to include most blocks Cache Data needed by a program Block 1 Block 2 55:035 Computer Architecture and Organization
Basic Caching Concepts • Memory system is organized as a hierarchy with the level closest to the processor being a subset of any level further away, and all of the data is stored at the lowest level (see figure). • Data is copied between only two adjacent levels at any given time. We call the minimum unit of information contained in a two-level hierarchy a block or line. See the highlighted square shown in the figure. • If data requested by the user appears in some block in the upper level it is known as a hit. If data is not found in the upper levels, it is known as a miss. 55:035 Computer Architecture and Organization
Basic Cache Organization Tags Data Array Block address Full byte address: Idx Tag Off Decode & Row Select Muxselect Compare Tags ? Data Word Hit 55:035 Computer Architecture and Organization
LRU Direct-Mapped Cache Memory Cache Swap-out Data needed by a program Block 1 Block 2 Data needed Swap-in 55:035 Computer Architecture and Organization
LRU Set-Associative Cache Memory Swap-out Cache Data needed by a program Block 1 Swap-in Swap-in Block 2 Data needed 55:035 Computer Architecture and Organization
Three Major Placement Schemes 55:035 Computer Architecture and Organization
Direct-Mapped Placement • A block can only go into one place in the cache • Determined by the block’s address (in memory space) • The index number for block placement is usually given by some low-order bits of block’s address. • This can also be expressed as: (Index) = (Block address) mod (Number of blocks in cache) • Note that in a direct-mapped cache, • Block placement & replacement choices are both completely determined by the address of the new block that is to be accessed. 55:035 Computer Architecture and Organization
Direct-Mapped Cache 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Cache of 8 blocks index (local address) Block size = 1 word tag 00 10 11 01 01 00 10 11 000 001 010 011 100 101 110 111 32-word word-addressable memory cache address: tag index Main memory 11 101 → memory address 55:035 Computer Architecture and Organization
Direct-Mapped Cache 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Cache of 4 blocks Block size = 2 word index (local address) tag 32-word word-addressable memory 00 11 00 10 00 01 10 11 0 1 block offset cache address: tag index block offset Main memory 11 10 1 → memory address 55:035 Computer Architecture and Organization
Direct-Mapped Cache (Byte Address) 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Cache of 8 blocks Block size = 1 word index tag 00 10 11 01 01 00 10 11 000 001 010 011 100 101 110 111 32-word byte-addressable memory cache address: tag index Main memory 11 101 00 → memory address byte offset 55:035 Computer Architecture and Organization
b6 b5 b4 b3 b2 b1 b0 Finding a Word in Cache Memory address Tag 32 words byte-address byte offset Index Valid 2-bit Index bit Tag Data 000 001 010 011 100 101 110 111 Cache size 8 words Block size = 1 word = Data 1 = hit 0 = miss 55:035 Computer Architecture and Organization
Miss Rate of Direct-Mapped Cache 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 This block is needed Cache of 8 blocks Block size = 1 word tag index 000 001 010 011 100 101 110 111 00 10 11 01 01 00 10 11 32-word word-addressable memory Least recently used (LRU) block cache address: tag index Main memory 11 101 00 → memory address byte offset 55:035 Computer Architecture and Organization
Miss Rate of Direct-Mapped Cache 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Memory references to addresses: 0, 8, 0, 6, 8, 16 Cache of 8 blocks 1. miss Block size = 1 word 3. miss 2. miss tag index 4. miss 00 / 01 / 00 / 10 xx xx xx xx xx 00 xx 000 001 010 011 100 101 110 111 32-word word-addressable memory 5. miss 6. miss cache address: tag index Main memory 11 101 00 → memory address byte offset 55:035 Computer Architecture and Organization
Fully-Associative Cache (8-Way Set Associative) 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 This block is needed Cache of 8 blocks Block size = 1 word tag 00 10 11 01 01 00 10 11 000 001 010 011 100 101 110 01010 111 32-word word-addressable memory LRU block cache address: tag Main memory 11101 00 → memory address byte offset 55:035 Computer Architecture and Organization
Miss Rate: Fully-Associative Cache 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Memory references to addresses: 0, 8, 0, 6, 8, 16 Cache of 8 blocks 1. miss Block size = 1 word 4. miss tag 2. miss 00000 01000 00110 10000 xxxxx xxxxx xxxxx xxxxx 32-word word-addressable memory 6. miss 5. hit 3. hit cache address: tag Main memory 11101 00 → memory address byte offset 55:035 Computer Architecture and Organization
Finding a Word in Associative Cache Memory address b6 b5 b4 b3 b2 b1 b0 5 bit Tag 32words byte-address byte offset no index Index Valid 5-bit Data bit Tag Cache size 8 words Block size = 1 word Must compare with all tags in the cache = Data 1 = hit 0 = miss 55:035 Computer Architecture and Organization
Eight-Way Set-Associative Cache Cache size 8 words Memory address b31 b30 b29 b28 b27 index b1 b0 32 words byte-address Block size = 1 word 5 bit Tag byte offset V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data = = = = = = = = 8 to 1 multiplexer 1 = hit 0 = miss Data 55:035 Computer Architecture and Organization
Two-Way Set-Associative Cache 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 This block is needed Cache of 8 blocks Block size = 1 word tags index 000 | 011 100 | 001 110 | 101 010 | 111 00 01 10 11 32-word word-addressable memory LRU block cache address: tag index Main memory 111 01 00 → memory address byte offset 55:035 Computer Architecture and Organization
Miss Rate: Two-Way Set-Associative Cache 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Memory references to addresses: 0, 8, 0, 6, 8, 16 Cache of 8 blocks 1. miss Block size = 1 word tags index 2. miss 000 | 010 xxx | xxx 001 | xxx xxx | xxx 00 01 10 11 4. miss 32-word word-addressable memory 3. hit 5. hit 6. miss cache address: tag index Main memory 111 01 00 → memory address byte offset 55:035 Computer Architecture and Organization
V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data Two-Way Set-Associative Cache Memory address b6 b5 b4 b3 b2 b1 b0 Cache size 8 words 32 words byte-address byte offset 3 bit tag Block size = 1 word 2 bit index 00 01 10 11 = = 2 to 1 MUX Data 1 = hit 0 = miss 55:035 Computer Architecture and Organization
Using Larger Cache Block (4 Words) Memory address b31… b16 b15… b4 b3 b2 b1 b0 16 bit Tag 4GB = 1G words byte-address byte offset 12 bit Index Val. 16-bit Data Index bit Tag (4 words=128 bits) 2 bit block offset 0000 0000 0000 Cache size 16K words 4K Indexes Block size = 4 word 1111 1111 1111 = 1 = hit 0 = miss M U X Data 55:035 Computer Architecture and Organization
Number of Tag and Index Bits Main memory Size=W words Cache Size = w words Each word in cache has unique index (local addr.) Number of index bits = log2w Index bits are shared with block offset when a block contains more words than 1 Assume partitions of w words each in the main memory. W/w such partitions, each identified by a tag Number of tag bits = log2(W/w) 55:035 Computer Architecture and Organization
How Many Bits Does Cache Have? • Consider a main memory: • 32 words; byte address is 7 bits wide: b6 b5 b4 b3 b2 b1 b0 • Each word is 32 bits wide • Assume that cache block size is 1 word (32 bits data) and it contains 8 blocks. • Cache requires, for each word: • 2 bit tag, and one valid bit • Total storage needed in cache = #blocks in cache × (data bits/block + tag bits + valid bit) = 8 (32+2+1) = 280 bits Physical storage/Data storage = 280/256 = 1.094 55:035 Computer Architecture and Organization
A More Realistic Cache • Consider 4 GB, byte-addressable main memory: • 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0 • Each word is 32 bits wide • Assume that cache block size is 1 word (32 bits data) and it contains 64 KB data, or 16K words, i.e., 16K blocks. • Number of cache index bits = 14, because 16K = 214 • Tag size = 32 – byte offset – #index bits = 32 – 2 – 14 = 16 bits • Cache requires, for each word: • 16 bit tag, and one valid bit • Total storage needed in cache = #blocks in cache × (data bits/block + tag size + valid bits) = 214(32+16+1) = 16×210×49 = 784×210bits = 784 Kb = 98 KB Physical storage/Data storage = 98/64 = 1.53 But, need to increase the block size to match the size of locality. 55:035 Computer Architecture and Organization
Cache Bits for 4-Word Block • Consider 4 GB, byte-addressable main memory: • 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0 • Each word is 32 bits wide • Assume that cache block size is 4 words (128 bits data) and it contains 64 KB data, or 16K words, i.e., 4K blocks. • Number of cache index bits = 12, because 4K = 212 • Tag size = 32 – byte offset – #block offset bits – #index bits = 32 – 2 – 2 – 12 = 16 bits • Cache requires, for each word: • 16 bit tag, and one valid bit • Total storage needed in cache = #blocks in cache × (data bits/block + tag size + valid bit) = 212(4×32+16+1) = 4×210×145 = 580×210bits =580 Kb = 72.5 KB Physical storage/Data storage = 72.5/64 = 1.13 55:035 Computer Architecture and Organization
Cache size equation • Simple equation for the size of a cache: (Cache size) = (Block size) × (Number of sets) × (Set Associativity) • Can relate to the size of various address fields: (Block size) = 2(# of offset bits) (Number of sets) = 2(# of index bits) (# of tag bits) = (# of memory address bits) (# of index bits) (# of offset bits) Memory address 55:035 Computer Architecture and Organization
Interleaved Memory Processor • Reduces miss penalty. • Memory designed to read words of a block simultaneously in one read operation. • Example: • Cache block size = 4 words • Interleaved memory with 4 banks • Suppose memory access ~15 cycles • Miss penalty = 1 cycle to send address + 15 cycles to read a block + 4 cycles to send data to cache = 20 cycles • Without interleaving, Miss penalty = 65 cycles words Cache Small, fast memory blocks Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Main memory 55:035 Computer Architecture and Organization
Cache Design • The level’s design is described by four behaviors: • Block Placement: • Where could a new block be placed in the given level? • Block Identification: • How is a existing block found, if it is in the level? • Block Replacement: • Which existing block should be replaced, if necessary? • Write Strategy: • How are writes to the block handled? 55:035 Computer Architecture and Organization
Handling a Miss • Miss occurs when data at the required memory address is not found in cache. • Controller actions: • Stall pipeline • Freeze contents of all registers • Activate a separate cache controller • If cache is full • select the least recently used (LRU) block in cache for over-writing • If selected block has inconsistent data, take proper action • Copy the block containing the requested address from memory • Restart Instruction 55:035 Computer Architecture and Organization
Miss During Instruction Fetch • Send original PC value (PC – 4) to the memory. • Instruct main memory to perform a read and wait for the memory to complete the access. • Write cache entry. • Restart the instruction whose fetch failed. 55:035 Computer Architecture and Organization
Writing to Memory • Cache and memory become inconsistent when data is written into cache, but not to memory – the cache coherence problem. • Strategies to handle inconsistent data: • Write-through • Write to memory and cache simultaneously always. • Write to memory is ~100 times slower than to (L1) cache. • Write-back • Write to cache and mark block as “dirty”. • Write to memory occurs later, when dirty block is cast-out from the cache to make room for another block 55:035 Computer Architecture and Organization
Writing to Memory: Write-Back • Write-back (or copy back) writes only to cache but sets a “dirty bit” in the block where write is performed. • When a block with dirty bit “on” is to be overwritten in the cache, it is first written to the memory. • “Unnecessary” writes may occur for both write-through and write-back • write-through has extra writes because each store instruction causes a transaction to memory (e.g. eight 32-bit transactions versus 1 32-byte burst transaction for a cache line) • write-back has extra writes because unmodified words in a cache line get written even if they haven’t been changed • penalty for write-through is much greater, thus write-back is far more popular 55:035 Computer Architecture and Organization
Cache Hierarchy Processor • Average access time = T1 + (1 – h1) [ T2 + (1 – h2)Tm ] • Where • T1 = L1 cache access time (smallest) • T2 = L2 cache access time (small) • Tm = memory access time (large) • h1, h2 = hit rates (0 ≤ h1, h2 ≤ 1) • Average access time reduces by adding a cache. Access time = T1 L1 Cache (SRAM) Access time = T2 L2 Cache (DRAM) Access time = Tm Main memory large, inexpensive (slow) 55:035 Computer Architecture and Organization
Average Access Time T1 + (1 – h1) [ T2 + (1 – h2)Tm ] T1 < T2 < Tm T1+T2+Tm h2 = 0 T1+T2+Tm / 2 Access time h2 = 0.5 h2 = 1 T1+T2 T1 miss rate, 1- h1 0 h1=1 1 h1=0 55:035 Computer Architecture and Organization
Processor Performance Without Cache • 5GHz processor, cycle time = 0.2ns • Memory access time = 100ns = 500 cycles • Ignoring memory access, Clocks Per Instruction (CPI) = 1 • Assuming no memory data access: CPI = 1 + # stall cycles = 1 + 500 = 501 55:035 Computer Architecture and Organization
Performance with Level 1 Cache • Assume hit rate, h1 = 0.95 • L1 access time = 0.2ns = 1 cycle • CPI = 1 + # stall cycles = 1 + 0.05 x 500 = 26 • Processor speed increase due to cache = 501/26 = 19.3 55:035 Computer Architecture and Organization
Performance with L1 and L2 Caches • Assume: • L1 hit rate, h1 = 0.95 • L2 hit rate, h2 = 0.90 (this is very optimistic!) • L2 access time = 5ns = 25 cycles • CPI = 1 + # stall cycles = 1 + 0.05 (25 + 0.10 x 500) = 1 + 3.75 = 4.75 • Processor speed increase due to both caches = 501/4.75 = 105.5 • Speed increase due to L2 cache = 26/4.75 = 5.47 55:035 Computer Architecture and Organization
Cache Miss Behavior • If the tag bits do not match, then a miss occurs. • Upon a cache miss: • The CPU is stalled • Desired block of data is fetched from memory and placed in cache. • Execution is restarted at the cycle that caused the cache miss. • Recall that we have two different types of memory accesses: • reads (loads) or writes (stores). • Thus, overall we can have 4 kinds of cache events: • read hits, read misses, write hits and write misses. 55:035 Computer Architecture and Organization
Fully-Associative Placement • One alternative to direct-mapped is: • Allow block to fill any empty place in the cache. • How do we then locate the block later? • Can associate each stored block with a tag • Identifies the block’s home address in main memory. • When the block is needed, we can use the cache as an associative memory, using the tag to match all locations in parallel, to pull out the appropriate block. 55:035 Computer Architecture and Organization
Set-Associative Placement • The block address determines not a single location, but a set. • A set is several locations, grouped together. (set #) = (Block address) mod (# of sets) • The block can be placed associatively anywhere within that set. • Where? This is part of the placement strategy. • If there are n locations in each set, the scheme is called “n-way set-associative”. • Direct mapped = 1-way set-associative. • Fully associative = There is only 1 set. 55:035 Computer Architecture and Organization