CSC I 2510 Computer Organization

CSCI 2510Computer Organization Memory System II Cache In Action

Cache-Main Memory Mapping • A way to record which part of the Main Memory is now in cache • Design concerns: • Be Efficient: fast determination of cache hits/ misses • Be Effective: make full use of the cache; increase probability of cache hits • In the following discussion, we assume: • 16-bit address, i.e. 216 = 64k Bytes in the Main Memory • One Word = One Byte • Synonym: Cache line === Cache block

CPU FASTEST Cache 64kB FAST Main Memory 64kB SLOW Imagine: Trivial Conceptual Case • Cache size == Main Memory size == 64kB • Trivial one-to-one mapping • Do we need Main Memory any more?

Main Memory Block 0 Block 1 Block 127 Cache tag Block 0 Block 128 tag Block 1 Block 129 tag Block 127 Block 255 Block 256 Block 257 Block 4095 Reality: Cache Block/ Cache Line • Cache size is much smaller than the Main Memory size • A block in the Main Memory maps to a block in the Cache • Many-to-One Mapping

Main Memory Block 0 Block 1 Block 127 Cache tag Block 0 Block 128 tag Block 1 Block 129 tag Block 127 Block 255 Block 256 Block 257 Block 4095 Word/ Byte Address within block (4-bit) Cache tag Cache Block No 5 7 4 16-bit Main Memory address 12-bit Main Memory Block number/ address Direct Mapping • Direct mapped cache • Block j of Main Memory maps to block (j mod 128) of Cache[same colour in figure] • Cache hit occurs if tag matches desired address • there are 24 = 16 words (bytes) in a block • 27 = 128 Cache blocks • 2(7+5) = 27 x 25 = 4096 Main Memory blocks

Main Memory Block 0 Block 1 Block 127 Cache tag Block 0 Block 128 tag Block 1 Block 129 tag Block 127 Block 255 Block 256 Block 257 Block 4095 Word/ Byte Address within block (4-bit) Cache tag Cache Block No 5 7 4 16-bit Main Memory address 12-bit Main Memory Block number/ address E.g. CPU is looking for [A7B4] MAR = 1010011110110100 Go to cache block 1111011 See if the tag is 10100 If YES, cache hit! Get the word/ byte 0100 in the cache block 1111011 Direct Mapping • Memory address divided into 3 fields • Main Memory Block number determines position of block in cache • Tag used to keep track of which block is in cache (as many MM blocks can map to same position in cache) • The last 4 bits in the address selects target word in the block • Given an address t,b,w (16-bit) • See if it is already in cache by comparing t with the tag in block b • If not, cache miss!, replace the current block at b with a new one from memory block t,b (12-bit)

In direct mapping, a Main Memory block is restricted to reside in a given position in the Cache (determined by mod) Associative mapping allows a MM block to reside in an arbitrary Cache block location In this example, all 128 tag entries must be compared with the address Tag in parallel (by hardware) E.g. CPU is looking for [A7B4] MAR = 1010011110110100 See if the tag 101001111011 matches one of the 128 cache tags If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block Main Memory Block 0 Block 1 Cache tag Block 0 tag Block 1 Block i tag Block 127 Block 4095 T ag Word/ Byte 16-bit Main Memory address 12 4 Associative Mapping Tag width is 12 bits

Combination of direct and associative Same colour Main Memory blocks 0,64,128,…,4032 map to cache set 0 and can occupy either of 2 positions within that set A cache with k-blocks per set is called a k-way set associative cache. (j mod 64) derives the Set Number Within the target Set, comparethe 2 tag entries with the address Tag in parallel (by hardware) Main Memory Block 0 Block 1 Cache tag Block 0 Set 0 Block 63 tag Block 1 Block 64 tag Block 2 Set 1 Block 65 tag Block 3 Block 127 tag Block 126 Set 63 Block 128 tag Block 127 Block 129 Block 4095 Set Associative Mapping Tag width is 6 bits Set Number Tag Word/ Byte 6 6 4 16-bit Main Memory address

Main Memory Block 0 Block 1 Cache tag Block 0 Set 0 Block 63 tag Block 1 Block 64 tag Block 2 Set 1 Block 65 tag Block 3 Block 127 tag Block 126 Set 63 Block 128 tag Block 127 Block 129 Block 4095 Set Associative Mapping E.g. 2-Way Set Associative CPU is looking for [A7B4] MAR = 1010011110110100 Go to cache Set111011 (5910) - Block 1110110 (11810) - Block 1110111 (11910) See if ONE of the TWO tags in the Set 111011 is 101001 If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block Tag width is 6 bits Set Number Tag Word/ Byte 6 6 4 16-bit Main Memory address

Replacement Algorithms • Direct mapped cache • Position of each block fixed • Whenever replacement is needed (i.e. cache miss  new block to load), the choice is obvious and thus no "replacement algorithm" is needed • Associative and Set Associative • Need to decide which block to replace (thus keep/ retain ones likely to be used in cache in near future again) • One strategy is least recently used (LRU) • e.g. for a 4-block/ set cache, use a log24 = 2-bit counter for each block • Reset the counter to 0 whenever the block is accessed; counters of other blocks in the same set should be incremented • On cache miss, replace/ uncache a block with counter reaching 3 • Another is random replacement (choose random block) • Advantage is that it is easier to implement at high speed

Assume separate instruction and data caches We consider only the data Cache has space for 8 blocks A block contains one 16-bit word (i.e. 2 bytes) A[10][4] is an array of words located at 7A00-7A27 in row-major order The program is to normalize first column of the array (a vertical vector) by its average short A[10][4]; int sum = 0; int j, i; double mean; // forward loop for (j = 0; j <= 9; j++) sum += A[j][0]; mean = sum / 10.0; // backward loop for (i = 9; i >= 0; i--) A[i][0] = A[i][0] / mean; Cache Example

Memory word address in hex Memory word address in binary Array Contents (40 elements) A[0][0] (7A00) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 A[0][1] (7A01) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 (7A02) 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 A[0][2] 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 A[0][3] (7A03) A[1][0] (7A04) 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 (7A24) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 A[9][0] (7A25) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 1 A[9][1] 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 A[9][2] (7A26) A[9][3] (7A27) 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 To simplify the discussion in this example: 16-bit word address; byte addresses are not shown One item == One block == One word == Two bytes Cache Example Tag for Direct Mapped 8 blocks in cache, 3 bits encodes cache block number Tag for Set-Associative 4 blocks/ set, 2 cache sets, 1 bit encodes cache set number Tag for Associative

Direct Mapping Tags not shown but are needed • Least significant 3-bits of address determine location in cache • No replacement algorithm is needed in Direct Mapping • When i = 9 and i = 8, get a cache hit (2 hits in total) • Only 2 out of the 8 cache positions used • Very inefficient cache utilization

Associative Mapping Tags not shown but are needed LRU Counters not shown but are needed • LRU replacement policy: get cache hits for i = 9, 8, …, 2 • If i loop was a forward one, we would get no hits!

Set 0 Set 1 Set Associative Mapping Tags not shown but are needed LRU Counters not shown but are needed • Since all accessed blocks have even addresses (7A00, 7A04, 7A08, ...), only half of the cache is used, i.e. they all map to set 0 • LRU replacement policy: get hits for i = 9, 8, 7 and 6 • Random replacement would have better average performance • If i loop was a forward one, we would get no hits!

Comments on the Example • In this example, Associative is best, then Set-Associative, lastly Direct Mapping • What are the advantages and disadvantages of each scheme? • In practice, • low hit rates like in the example is very rare • usually Set-Associative with LRU replacement scheme • Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache memory

Core Processing unit Core Processing unit L1 instruction L1 data L1 instruction L1 data cache 32kB cache 32kB cache 32kB cache 32kB L2 cache 4MB System bus Main Input/Output memory Real-life Example: Intel Core 2 Duo

Real-life Example: Intel Core 2 Duo Number of processors : 1 Number of cores : 2 per processor Number of threads : 2 per processor Name : Intel Core 2 Duo E6600 Code Name : Conroe Specification : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Technology : 65 nm Core Speed : 2400 MHz Multiplier x Bus speed : 9.0 x 266.0 MHz = 2400 MHzFront-Side-Bus speed :4 x 266.0MHz = 1066 MHz Instruction sets : MMX, SSE, SSE2, SSE3, SSSE3, EM64T L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L2 cache 4096 KBytes, 16-way set associative, 64-byte line size

k bits m bits Module Address in module MM address ABR DBR ABR DBR ABR DBR Module Module Module n - 1 0 i (a) Consecutive words in a module m bits k bits Address in module Module MM address ABR DBR ABR DBR ABR DBR Module Module Module k 2 - 1 0 i (b) Consecutive words in consecutive modules Memory Module Interleaving • Processor and cache are fast, main memory is slow. • Try to hide access latency by interleaving memory accesses across several memory modules. • Each memory module has own Address Buffer Register (ABR) and Data Buffer Register (DBR) • Which scheme below can be better interleaved?

Memory Module Interleaving • Memory interleaving can be realized in technology such as “Dual Channel Memory Architecture.” • Two or more compatible (identical the best) memory modules are used in matching banks. • Within a memory module, in fact, 8 chips are used in “parallel”, to achieve a 8 x 8 = 64-bit memory bus. This is also a kind of memory interleaving.

Memory Interleaving Example • Suppose we have a cache read miss and need to load from main memory • Assume cache with 8-word block (cache line size = 8 bytes) • Assume it takes one clock to send address to DRAM memory and one clock to send data back. • In addition, DRAM has 6 cycle latency for first word;good that each of subsequent words in same row takes only 4 cycles • Read 8 bytes without interleaving: 1+(1x6)+(7x4)+1 = 36 cycles • For4-module interleaved scheme: 1+(1x6)+(1x8) = 15 cycles • Transfer of first 4 words are overlapped with access of second 4 words

1 1 6 6 1 1 1 1 1 4 4 4 1 1 1 Memory Without Interleaving Single Memory Read: 1 + 6 + 1 = 8 Cycles … Read 8 bytes (non-interleaving): 1 + 1x6 + 7x4 + 1 = 36 Cycles

Four Memory Modules Interleaved 1 6 1 1 6 1 1 6 1 1 6 1 1 4 1 1 4 1 1 4 1 1 4 1 Read 8 bytes (interleaving): 1 + 6 + 1x8 = 15 Cycles

Goal is to have a big memory system with speed of cache High cache hit rates (> 90%) are essential Miss penalty must also be reduced Example h is hit rate M is miss penalty C is cache access time Average memory access time = h * C + (1-h) * M Cache Hit Rate and Miss Penalty

Optimistically assume 8 cycles to read a single memory item; 15 cycles to load a 8-byte block from main memory (previous example); cache access time = 1 cycle For every 100 instructions, statistically 30 instructions are data read/ write Instruction fetch: 100 memory access: assume hit rate = 0.95 Data read/ write: 30 memory access: assume hit rate = 0.90 Execution cycles without cache = (100 + 30) x 8 Execution cycles with cache = 100(0.95x1+0.05x15)+30(0.9x1+0.1x15) Ratio = 4.30, i.e. cache speeds up execution more than 4 times! Cache Hit Rate and Miss Penalty

Summary • Cache Organizations: • Direct, Associative, Set-Associative • Cache Replacement Algorithms: • Random, Least Recently Used • Memory Interleaving • Cache Hit and Miss Penalty

CSC I 2510 Computer Organization

CSC I 2510 Computer Organization

Presentation Transcript

CSC 317 Computer Organization and Architecture

CSC 317 Computer Organization and Architecture

CSC 3210 Computer Organization and Programming

CSC 221 Computer Organization and Assembly Language

CSC 3210 Computer Organization and Programming

CSC 2510 Test 3

CSC 3210 Computer Organization and Programming

UNIT-I COMPUTER ORGANIZATION

CSC 221 Computer Organization and Assembly Language

CSC 3210 Computer Organization and Programming

CSC 3210 Computer Organization and Programming

CSC 3210 Computer Organization and Programming

CSC 1401 S1 Computer Programming I

CSC 3210 Computer Organization and Programming

CSC 2400 Computer Systems I

CSC 3210 Computer Organization and Programming

CSC 405 Computer Organization

CSC 1401 S1 Computer Programming I

CSC 3210 Computer Organization and Programming

CSC 3210 Computer Organization and Programming

CSC 2510 Test 3

CSC 3210 Computer Organization and Programming