1 / 60

Memory & Cache

Memory & Cache. Memories: Review. Memory is required for storing Data Instructions Different memory types Dynamic RAM Static RAM Read-only memory (ROM) Characteristics Access time Price Volatility. Principle of Locality. Users want indefinitely large memory

williamv
Download Presentation

Memory & Cache

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory & Cache

  2. Memories: Review • Memory is required for storing • Data • Instructions • Different memory types • Dynamic RAM • Static RAM • Read-only memory (ROM) • Characteristics • Access time • Price • Volatility

  3. Principle of Locality • Users want • indefinitely large memory • fast access to data items in the memory. • Principle of locality • Temporal locality: If an item is referenced, it will tend to be referenced again soon. • Spatial locality: If an item is referenced, items whose addresses are close by will tend to be referenced soon. • To take advantage of the principle of locality • The memory of a computer implemented as a memory hierarchy.

  4. Comparing Memories

  5. Cost($/bit) Size Speed Highest Smallest Fastest Largest Lowest Slowest Memory Hierarchy CPU Memory Memory Memory

  6. Organization of the Hierarchy • Data in a memory level closer to the processor is a subset of data in any level further away. • All the data is stored in the lowest level.

  7. Access to the Data • Data transfer takes place between two adjacent layers. • The minimum unit of information is called a block. • If a data requested by the processor appears in some block in the upper level, this is called a hit. Otherwise a miss occurs. • Hit rateor hit ratiois the fraction of memory accesses found in the upper level. • used to measure the performance of the memory hierarchy. • Miss rate is the fraction of memory accesses not found in the upper memory level ( = 1 – hit rate).

  8. Hit & Miss • Hit timeis the time to access to upper level of memory hierarchy, • which include the time needed to determine whether the access is a hit or miss. • Miss penaltyis the time to replace a block in the upper level with corresponding block from lower level, • plus the time to deliver this block to processor. • Hit time is much smaller than the miss penalty. • Read from register: one cycle • Read from 1st level cache: one-two cycles • Read from 2nd level cache: four-five cycles • Read from main memory: 20-100 cycles

  9. Memory Pyramid CPU Level 1 Increasing distancefrom the CPU in terms of access time Levels in thememory hierarchy Level 2 … Level n Size of the memory at each level

  10. Taking Advantage of Locality • Temporal Locality: keeping the recently accessed items closer to the processor. • Usually in a fast memory called cache. • Spatial Locality: Moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy.

  11. Main memory Cache CPU The Basics of Cache • Cache is a term used to refer to any storage taking advantage of locality of access. • In general, it is the fast memory between the CPU and main memory. • First appeared in machines in the early 1960s. • Virtually every general-purpose machine built today, from the fastest to the slowest, includes a cache.

  12. X4 X4 X1 X1 Xn-2 Xn-2 Xn-1 Xn-1 X2 X2 Xn X3 X3 after the reference to Xn before the reference to Xn Cache Example • X1, X2, …, Xn-1 • Access to word Xn • It is a miss • Xnbrought from memory into cache

  13. Direct-Mapped Cache • Two issues involved: • How do we know if a data item is in the cache? • If it is, how do we find it? • Direct-mapped cache • Each memory location is mapped exactly to one location in the cache. • Many items at the lower level share locations in the cache • The mapping is simple(Block address) mod (number of blocks in the cache)

  14. 00001 00101 01001 000 001 010 011 01101 100 101 110 111 10001 Cache 10101 11001 11101 Main memory Direct-Mapped Cache

  15. Fields in the Cache • If the number of blocks in the cache is a power of two, then • lowerlog2(cache size in blocks)-bits of the address is used as the cache address. • The remaining upper bits are used as tag to identify whether the requested block is in the cache • Memory address = tag || cache address • Valid bit is used to indicate whether a location in the cache contains a valid entry (e.g. startup ).

  16. Ex: 8-word Direct-Mapped Cache

  17. Ex: 8-word Direct-Mapped Cache The initial state of the cache Address of the memory reference:10110 => MISS/HIT After handling the miss

  18. Ex: 8-word Direct-Mapped Cache Address of the memory reference:11010 => ?

  19. Ex: 8-word Direct-Mapped Cache Address of the memory reference:10110 => ? Address of the memory reference:11010 => ? Address of the memory reference:10000 => ?

  20. Ex: 8-word Direct-Mapped Cache Address of the memory reference:00011 => ?

  21. Ex: 8-word Direct-Mapped Cache Address of the memory reference:10000 => ? Address of the memory reference:10010 => ?

  22. A More Realistic Cache • 32-bit data, 32-bit address • Cache size is 1 K (=1024) words • Block size is 1 word • 1 word is 32 bits. • cache index size = ? • tag size = ? • 2-bit byte offset • A valid bit

  23. 31 30 … 13 12 11 … 2 1 0 Address 20 10 data valid tag data hit 0 1 2 … … … … 1021 1022 1023 20 32 A More Realistic Cache byte offset =

  24. Cache Size • A formula for computing cache size 2n (block size + tag size + 1) where 2n is the number of blocks in the cache. • Example: Size of a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address? • 64 KB = 214 blocks • Tag size is 32 – 14 - 2 = 16-bit • Valid bit : 1 bit • Total bits in the cache is 214 (32 + 16 + 1) = 802816 bits

  25. Handling Cache Misses • When the requested data is found in the cache, the processor continues its normal execution. • Cache misses are handled with CPU control unit and a separate control unit • When a cache miss occurs: • Stall the processor • Activate the memory controller • Get the requested data item from the memory to the cache • Load it into the cache • Continue as if it is a hit.

  26. Read & Write • Read misses • stall the CPU, • fetch the requested block from memory, • deliver it to the cache, and • restart • Write hits & misses: • Inconsistency • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the memory later)

  27. Write-Through Scheme • A memory writes takes additional 100 cycles • In SPEC2000Int benchmark • 10% of all instructions are stores and the CPI without cache misses is about 1.17. • With cache misses CPI = 1.17 + 0.1  100 = 11.17 • A Write buffercan store the data while it is waiting to be written to the memory. • Meanwhile, the processor can continue execution. • if the rate at which the processor generates writes is more than the rate at which the memory system can accept them, then buffering is not a solution.

  28. Write-Back Scheme • When a write occurs, the new value is written only to the block in the cache. • The modified block in the cache is written into the memory when it is replaced • Write back scheme is especially useful, when the processor generates writes faster than the writes can be handled by the main memory • Write-back schemes are more complicated to implement

  29. Unified vs. Split Cache • For instruction and data cache, there are two approaches: • Split caches: • Higher miss rate due to their sizes • Higher bandwidth due to separate data path • No conflict when accessing the data and the cache at the same time • Unified cache: • Lower miss rate thanks to larger size • Lower bandwidth due to a single datapath. • Possible stalls due to the simultaneous access to data and instruction.

  30. Taking Advantage of Spatial Locality • The cache we described so far does not take advantage of spatial locality but temporal locality. • Basic idea: whenever we have a miss, load a group of adjacent memory cells into the cache (i.e. having blocks of longer than one word and transfer entire block from memory to cache on a cache miss). • Block mapping:cache index =(block address) % (# of blocks in cache)

  31. An Example Cache • The Intrinsity FastMATH processor • Embedded processor • Uses MIPS Architecture • 12 stage pipeline • Separate instruction and data cache • Each cache is 16 KB (4 K words) • 16-word block • Tag size = ?

  32. 4 Block offset data Intrinsity FastMATH processor 31 30 … 14 13 … 6 1 0 5 2 Address tag 8 18 cache index data tag V hit 256 entries 32 32 18 32 = MUX

  33. 16-Word Cache Blocks • Tag:[31–14]Index:[13-6]Block offset: [5-2]Byte offset: [1-0] • Example: What is the block address that byte address 1800 corresponds to? • Block address = (byte address) / (bytes per block) = 1800 /64 = 28

  34. Read & Writes in Multi-Word Caches • Read misses: always brings the entire block • Write hits & misses: more complicated • Compare the tag in the cache and the upper address bits • If they match, it is a hit. Continue with write back or write through • If tags are not identical, then this is a miss • Read entire block from memory into the cache and rewrite the cache with the word that caused the write miss. • Unlike the case with one-word block, write misses with multi-word block will require reading from memory.

  35. Performance of the Caches • Intrinsic FastMATH for SPEC2000 • Instruction cache: 16 KB • Data cache: 16 KB • Effective combined miss rate for unified cache • 3.18%

  36. Block Size • Small block size • High miss rate • Does not take full advantage of spatial locality • Short block loading time • Large block size • Low miss rate • Long time for loading the entire block • Higher miss penalty • Early start: resume execution as soon as the requested word arrived in the cache • Critical word first: the requested word is returned first, the rest is transferred later.

  37. Miss Rate vs. Block Size

  38. Memory System to Support Cache • DRAM (Dynamic Random Access Memory) • Access time: The time between when a read is requested and when the desired word arrives in CPU. • A hypothetical memory access time • 1 clock cycle to send the address • 15 clock cycles for initiating access for DRAM (for each word) • 1 clock cycle to send a word of data

  39. One-Word-Wide Memory CPU • Given a cache block of four words, the miss penalty for one-word-wide memory organization, miss penalty: 1 + 415 + 41 = 1 + 60 + 4 = 65 • Bandwidth (# of bytes transferred per clock cycle) (44)/65  0.25 Cache Bus Memory

  40. Bus Wide Memory Organization CPU • With main memory of 4 words, the miss penalty for 4-word block: 1 + 15 + 11 = 17 • The bandwidth (44)/17  0.94 Multiplexor Cache Memory

  41. Bus Interleaved Memory Organization CPU • With main memory of 4 banks, the miss penalty for a 4-word block 1 + 15 + 41 = 20 • The bandwidth (44)/20=0.80 Cache Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3

  42. Example 1/2 • Block size: 1 word, • Memory bus width: 1 word, • miss rate: 3%, • memory access per instruction: 1.2 and CPI = 2 • block size = 2 words  the miss rate is 2%, • block size = 4 words  the miss rate is 1%, • What is the improvement in performance of interleaving two ways and four ways versus doubling the memory width and the bus assuming the access times are 1, 15, 1 clock cycles

  43. Example 2/2 • CPI for one-word-wide machine • CPI = 2 + (1.2  3%  17) = 2.617 • Two-word block • one-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15  2 + 1  2)) = 2.792 • one-word bus & memory; interleaving CPI = 2 + (1.2  2%  (1 + 15 + 2  1)) = 2.432 • two-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15 + 1)) = 2.408 • Four-word block • one-word bus & memory; no-interleaving CPI = 2 + (1.2  1%  (1 + 15  4 + 1  4)) = 2.780 • one-word bus & memory; interleaving CPI = 2 + (1.2  1%  (1 + 15 + 4  1)) = 2.24 • two-word bus & memory; no-interleaving CPI = 2 + (1.2  1%  (1 + 15  2 + 2 1)) = 2.396

  44. Improving Cache Performance • Reduce the miss rate • By reducing the probability of contention • Multilevel caching • Second and third level caches • good for reducing miss penalty as well

  45. Flexible Placement of Cache Blocks • Direct mapped cache: • A memory block goes exactly to one location in the cache • Easy to find • (Block no.) % (# of blocks in the cache) • Compare the tags • Many blocks contend for the same location • Fully associative cache: • A memory block can go in any cache line • Difficult to find • Search all the tags to see if the requested block is in the cache

  46. Flexible Placement of Cache Blocks • Set-associative cache: • There is a fixed number of cache locations (at least two) where each memory block can be placed. • A set-associative cache with nlocations for a block is called an n-way set-associative cache. • The minimum set size is 2. • Finding the block in the cache is relatively easier than fully associative cache. • (Block no.) % (# of sets in the cache) • Tags are compared within the set.

  47. Fully set-associative 2-way set-associative Direct Mapped Block # 0 1 2 3 4 5 6 7 Set # 0 1 2 3 Data Data Data Tag Tag Tag Search Locating Memory Blocks in the Cache A block with address 12 Search Search

  48. Example • Consider the following successive memory accesses for direct-mapped, two-way and four-way. Block length is one word. Access pattern is 0, 8, 0, 6, 8 Address of memory block accessed Hit or Miss Contents of cache blocks after reference 0 1 2 3 0 Miss Memory[0] 8 Miss Memory[8] 0 Miss Memory[0] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Direct mapped cache

  49. Example • Memory access: 0, 8, 0, 6, 8 Address of memory block accessed Hit or Miss Contents of cache blocks after reference Set 0 Set 0 Set 1 Set 1 0 Miss Memory[0] 8 Miss Memory[0] Memory[8] 0 Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Two-way set-associative cache

  50. Example • Memory access: 0, 8, 0, 6, 8 Address of memory block accessed Hit or Miss Contents of cache blocks after reference Block 0 Block 1 Block 2 Block 3 0 Miss Memory[0] 8 Miss Memory[0] Memory[8] 0 Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[8] Memory[6] 8 Hit Memory[0] Memory[8] Memory[6] Full associative cache

More Related