Recap: Memory Hierarchy

Recap: Memory Hierarchy

Processor Secondary Storage (Disk) Control Main Memory (DRAM) L2 Off-Chip Cache L1 On-Chip Cache Datapath Registers Speed: Size: Cost: Memory Hierarchy - the Big Picture • Problem: memory is too slow and or too small • Solution: memory hierarchy Slowest Fastest Biggest Smallest Lowest Highest

Probability of reference Address Space 0 2n - 1 Why Hierarchy Works • The principle of locality • Programs access a relatively small portion of the address space at any instant of time. • Temporal locality: recently accessed instruction/data is likely to be used again • Spatial locality: instruction/data near recently accessed /instruction data is likely to be used soon • Result: the illusion of large, fast memory

D C[99] C[98] C[97] C[96] . . . . . . . . . . . . . . C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0] . . . . . . . . . . . . . . B[11] B[10] B[9] B[8] B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0] A[99] A[98] A[97] A[96] . . . . . . . . . . . . . . A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0] Example of Locality int A[100], B[100],C[100],D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; }

1. Where can block be placed in cache? (block placement) 2. How can block be found in cache? …using a tag (block identification) 3. Which block should be replaced on a miss? (block replacement) 4. What happens on a write? (write strategy) Four Key Cache Questions:

Q1: Block Placement • Where can block be placed in cache? • In one predetermined place - direct-mapped • Use fragment of address to calculate block location in cache • Compare cache block with tag to test if block present • Anywhere in cache - fully associative • Compare tag to every block in cache • In a limited set of places - set-associative • Use address fragment to calculate set • Place in any block in the set • Compare tag to every block in set • Hybrid of direct mapped and fully associative

Cache *0 *4 *8 *C 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory Direct Mapped Block Placement address maps to block: location = (block address MOD # blocks in cache)

0x0F 00000 00000 0x55 11111 0xAA 0xF0 11111 Direct Mapping Index Tag Data 0 00000 0 0x55 0x0F 1 00000 1 00001 0 • Direct mapping: • A memory value can only be placed at a single corresponding location in the cache 11111 0 0xAA 0xF0 11111 1

Fully Associative Block Placement Cache arbitrary block mapping location = any 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory

0x0F 0x0F 0000 0000 0000 0000 0x55 0x55 1111 1111 0xAA 0xAA 0xF0 0xF0 1111 1111 000000 0x55 0x0F 000001 000110 111110 0xAA 0xF0 111111 Fully Associative Mapping Tag Data 000000 0x55 0x0F 000001 000110 • Fully-associative mapping: • A memory value can be anywhere in the cache 111110 0xAA 0xF0 111111

Set-Associative Block Placement Cache *0 *0 *4 *4 *8 *8 *C *C address maps to set: location = (block address MOD # sets in cache)(arbitrary location in set) Set 3 Set 2 Set 1 Set 0 00 04 08 0C 10 14 18 1C 20 24 28 2C 30 34 38 3C 40 44 48 4C Memory

0x0F 0000 0000 0x55 1111 0xAA 0xF0 1111 Set Associative Mapping (2-Way) Way Way 1 Way 0 Index Data Tag 0 0000 00 0x55 0x0F 1 0000 01 0001 10 • Set-associative mapping: • A memory value can be placed in any of a set of corresponding locations in the cache 1111 10 0xAA 0xF0 1111 11

Q2: Block Identification • Every cache block has an address tag and index that identifies its location in memory • Hit when tag and index of desired word match(comparison by hardware) • Q: What happens when a cache block is empty?A: Mark this condition with avalid bit Valid Tag/index Data 1 0x00001C0 0xff083c2d

Byte Offset V Tag Data 1 0x00001C0 0xff083c2d 0 1 0x0000000 0x00000021 1 0x0000000 0x00000103 0 0 1 0 0x23F0210 0x00000009 Direct-Mapped Cache Design Cache Index DATA HIT =1 ADDRESS Tag 0x0000000 3 0 ADDR CACHE SRAM DATA[59] DATA[58:32] DATA[31:0] =

Set Associative Cache Design • Key idea: • Divide cache into sets • Allow block anywhere in a set • Advantages: • Better hit rate • Disadvantage: • More tag bits • More hardware • Higher access time A Four-Way Set-Associative Cache (Fig. 7.17)

tag 11110111 data 1111000011110000101011 Fully Associative Cache Design • Key idea: set size of one block • 1 comparator required for each block • No address decoding • Practical only for small caches due to hardware demands tag in 11110111 data out 1111000011110000101011 = tag 00011100 data 0000111100001111111101 = = tag 11110111 data 1111000011110000101011 = tag 11111110 data 0000000000001111111100 = tag 00000011 data 1110111100001110000001 = tag 11100110 data 1111111111111111111111

Cache Replacement Policy • Random • Replace a randomly chosen line • LRU (Least Recently Used) • Replace the least recently used line

E G D C C A C E D C C B A D E A B D A D LRU Policy MRU-1 LRU+1 LRU MRU A B C D Access C Access D Access E MISS, replacement needed Access C MISS, replacement needed Access G

Cache Write Strategies • Need to keep cache consistent with the main memory • Reads are easy - require no modification • Writes- when does the update occur • Write Though: Data is written to both the cache block and to a block of main memory. • The lower level always has the most updated data; an important feature for I/O and multiprocessing. • Easier to implement than write back. • Write back: Data is written or updated only to the cache block. The modified or dirty cache block is written to main memory when it’s being replaced from cache. • Writes occur at the speed of cache • Uses less memory bandwidth than write through.

Write-through Policy 0x1234 0x1234 0x1234 0x5678 0x5678 0x1234 Processor Cache Memory

Write-back Policy 0x1234 0x1234 0x1234 0x5678 0x9ABC 0x5678 0x1234 0x5678 Processor Cache Memory

Cache Processor DRAM Write Buffer Write Buffer for Write Through • A Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: • Typical number of entries: 4 • Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Processor Control Processor Unified Level One Cache L1 Control Datapath Registers L1 I-cache Datapath Registers L1 D-cache Unified vs.Separate Level 1 Cache • Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for both instructions and data. • Separate instruction/data Level 1 caches (Harvard Memory Architecture): The level 1 (L1) cache is split into two caches, one for instructions (instruction cache, L1 I-cache) and the other for data (data cache, L1 D-cache). Separate Level 1 Caches (Harvard Memory Architecture) Unified Level 1 Cache (Princeton Memory Architecture)

Recap: Memory Hierarchy