1 / 50

CS1104 Help Session I Memory Semester II 2001/02

CS1104 Help Session I Memory Semester II 2001/02. Colin Tan, S15-04-05, Ctank@comp.nus.edu.sg . Memory. Memory can be visualized as a stack of pigeon holes. Current computers have about 128,000,000 pigeon holes.

mya
Download Presentation

CS1104 Help Session I Memory Semester II 2001/02

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS1104 Help Session IMemorySemester II 2001/02 Colin Tan, S15-04-05, Ctank@comp.nus.edu.sg

  2. Memory • Memory can be visualized as a stack of pigeon holes. Current computers have about 128,000,000 pigeon holes. • Each pigeon hole is given a number, starting from 0. This number is called an “address”. • Each pigeon hole will contain either data (e.g. numbers you want to add together) or instruction (e.g. add two numbers)

  3. Memory • Memory locations 0 to 3 contain instructions, locations 4 to 6 contain data. • Note: In reality, instructions are also encoded into numbers!

  4. Addresses • As mentioned, each pigeon hole has a number identifying it called an “address”. • When the CPU requires an instruction, it will send the instruction’s “address” to memory, and the memory will return the instruction at that address. • E.g. At IF CPU will send “0” to memory, and the memory returns li t1, 5 • At MEM CPU will send “6” to memory, and memory returns “10”. • At WB, CPU writes “10” back to t1.

  5. Addressing Bits • Computers work only in binary • Hence addresses generated in the previous example are also in binary! • In general, to address a maximum of n memory locations, you will need m = log2 n bits in your address. • Conversely, if you had m bits in your address, you can access a maximum of 2mmemory locations.

  6. Memory Hierarchy • Motivation • Not all memory is created equal • Cheap Memory => Slow • Fast Memory => Expensive • DRAM, 70 ns access time, $1/MByte • SRAM, 8 ns access time, $50/Mbyte • So, you can choose either: • Have fast but very small memory, OR • Large but very slow memory.

  7. Memory Hierarchy • Memory hierarchy gives you a third option: • Large, but very fast memory • Though slower than the expensive memory mentioned earlier.

  8. Locality • “Locality” is a particular type of behavior exhibited by running programs: • Spatial locality: If a memory location has been accessed, it is very likely its neighbor will also be accessed. • Temporal locality: If a memory location has been accessed, it is very likely that it will be accessed again sometime soon.

  9. Locality - Example • Consider the following program: for(i=0; i<10; i++) a[i] = b[i] + c[i];

  10. Locality - Example • In memory it will look like this:

  11. Locality - Example • Tracing the execution of the program:

  12. Locality - Example • Focusing only on the addresses of the fetched instructions, we see that the addresses the instructions are fetched from are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, … • Here we see both: • Spatial locality (e.g. after location 0 is accessed, location 1 is accessed, then 2, etc.) • Temporal locality (e.g. location 2 is accessed 10 times!)

  13. Effect of Locality • Locality means that in the short run out of all the memory you have (perhaps up to 128,000,000 pigeon holes!), only a very small number of locations are actually being accessed! • In our example for ten iterations only memory locations 2 to 10 are being accessed out of 128,000,000 possible locations! • What if we had a tiny amount of very fast (but expensive!) memory and kept these locations in that fast memory? • We can speed up access times dramatically!! • This is the idea behind caches.

  14. How Do Caches Help? • The average time to access memory (AMAT) is given by: • AMAT = hit_rate * Tcache + miss_rate x (Tmemory + Tcache) • Tcache = Time to read the cache (8ns for SRAM cache) • Tmemory = Time to read main memory (70ns for DRAM) • miss_rate = Probability of not finding what we want in the cache. • Because of locality, miss_rate is very small • Typically about 3% to 5%. • Here, our AMAT = 0.95 * 8ns + 0.05 x (70 + 8) ns = 11.5 ns • Our AMAT is about 43% slower than pure SRAM cache memory (11.5 ns vs. 8 ns)

  15. How Do Caches Help? • What about cost? • Let’s consider: • A system with 32 MB of DRAM memory, 512KB of SRAM cache. • Cost is $1/MB for DRAM, and $50/MB for SRAM. • If we had 32MB of SRAM, access time is 8 ns, but cost will be $1,600 • With 32MB of DRAM, cost is $32, but access time is 70 ns! • But with 32MB of DRAM and 512 (1/2 MB) of SRAM, cost will be: $32 + (512/1024) * 50 = $57!

  16. How do Caches Help? • So with pure SRAM, we can have 8 ms average access time at $1,600. • With pure DRAM, our memory will cost $32, but all accesses will take 70 ns! • With DRAM memory and SRAM cache, we can have 11.5 ms access time at $57. • So for a performance drop of 43%, we have a cost savings of >2700%! • Hence caches give us large memory size (32 MB), at close to the cost of the DRAM technology ($57 vs. $32), but at close to the speed of expensive SRAM technology (11.5 ms vs. 8 ms)

  17. Block Cache Architecture • Caches consist of blocks (or lines). Each block stores data from memory: • Block allocation problem: • Given data from an address A, how do we decide which block of cache its data should go to?

  18. The Block Allocation Problem • 3 possible solutions: • Data from each address A will go to to a fixed block. • Direct Mapped Cache • Data from each address A will go to any block. • Fully associative cache • Data from address A will go to a fix set of blocks. • Data may be put into any block within a set. • Set associative cache.

  19. Direct Mapped Caches • The value of a portion of memory address is used to decide which block to send the data to: Address A Tag Block Index Block Offset Byte Offset • The Block Index portion is used to decide which block data from this address should go to.

  20. Example • The number of bits in the block index is log2N, where N is the total number of blocks. • For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11. • The exact value of these 2 bits will determine which block the data for that address will go to.

  21. Direct Mapped Addressing E.g. • Show how an addresses generated by the MIPS CPU will be divided into byte offset, block offset, block index and tag portions for the following cases: i) Block size: 1 word, 128 blocks ii) Block size: 4 words, 64 blocks • All MIPS addresses are 32 bit byte addresses (i.e. they address individual bytes in a word).

  22. Case I

  23. Case II

  24. Cache 00 01 10 11 Example • The value of the two block index bits will determine which block the data will go to, following the scheme shown below:

  25. Solving Direct-Mapped Cache Problems • Question 7.7 • Basic formula: • Blk_Addr = floor(word_address/words_per_block) mod N • N here is the total number of blocks in the cache • This is the mathematical version of taking the value of the Block Index bits from the address.

  26. A Complication:Multiple Word Blocks • Single word blocks do not support spatial locality • Spatial locality: Likelihood of accessing neighbor of a piece of data that was just accessed is high. • But with single word blocks, none of the neighbors are in cache! • All accesses to neighbors that were not accessed before will miss!

  27. An ExampleQuestion 7.8

  28. Tag Block Index Block Offset Byte Offset Accessing Individual Words • In our example, each block has 4 words. • But we always access memory 1 word at a time! (e.g. lw) • Use the Block Offset to specify which of the 4 words in a block we want to read: Address A

  29. The Block Offset • Number of block offset bits = log2M, where M is the number of words per block. • For our example, M=4. So number of block offset bits is 2. • These two bits can take on the values of 00, 01, 10 and 11. • Note that for single word blocks, the number of block offset bits is log2 1, which is 0. I.e. There are no block offset bits for single-word blocks. • These values determine exactly which word within a block address A is referring to:

  30. 01000 00010010 00000000 00 01010 00010010 00000000 00 11011 00010010 00000000 00 Who am I?Purpose of the Tag • Many different addresses may map to the same block: e.g. (Block Index portions shown highlighted) • All 3 addresses are different, but all map to block 00010010

  31. Disambiguation • We need a way to disambiguate the situation • Otherwise how do we know that the data in block x actually comes from address A and not from another address A’ that has the same block index bit value? • The portion of the address A to the left of the Block Index can be used for disambiguation. • This portion is called the tag, and the tag for address A is stored in the cache together with address A data.

  32. Tag Word 00 Word 01 Word 10 Word 11 00 01 10 11 The Tag • When we access the cache, the Tag portion and Block Index portions of address A are extracted. • The Block Index portion will tell the cache controller which block of cache to look at. • The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache.

  33. Address A Tag Block Index Block Offset Byte Offset Accessing Individual Bytes • MIPS addresses are byte addresses, and actually index individual bytes rather than words. • Each MIPS word consists of 4 bytes. • The byte offset tells us exactly which byte within a word we are referring to.

  34. Advantages & Disadvantages ofDirect Mapped Caches • Advantages: • Simple to implement • Fast performance • Less time to detect a cache hit => less time to get data from the cache => faster performance • Disadvantages • Poor temporal locality. • Many addresses may map to the same block. • The next time address A is accessed, it may have been replaced by the contents of address A’.

  35. Improving Temporal LocalityThe Fully Associative Cache • In the fully associative cache, data from an address A can go to any block in cache. • In practice, data will go into the first available cache block. • When the cache is full, a replacement policyis invoked to choose which block of cache to throw out.

  36. Advantages and DisadvantagesFully Associative Cache • Good temporal locality properties • Flexible block placement allows smart replacement policies such that blocks that are likely to be referenced again will not be replaced. E.g. LRU, LFU. • Disadvantages • Complex and too expensive for large caches • Each block needs a comparator to check the tag. • With 8192 blocks, we need 8192 comparators!

  37. A CompromiseSet Associative Caches • Represents a compromise between direct-mapped and fully associative caches. • Cache is divided into sets of blocks. • An address A is mapped directly to a set using a similar scheme as for direct mapped caches. • Once the set has been determined, the data from A may be stored in any block within a set - Fully associative within a set!

  38. Set Associative Cache • An n-way set associative cache will have n blocks per set. • For example, for a 16-block cache that is implemented as a 2-way set associative cache, each set has 2 blocks, and we have a total of 8 sets.

  39. Advantages and DisadvantagesSet Associative Cache • Advantages • Almost as simple to build as a direct-mapped cache. • Only n comparators are needed for an n-way set associative cache. For 2-way set-associative, only 2 comparators are needed to compare tags. • Supports temporal locality by having full associativity within a set.

  40. Advantages and DisadvantagesSet Associative Cache • Disadvantages • Not as good as fully-associative cache in supporting temporal locality. • For LRU schemes, because of small associativity, actually possible to have 0% hit rate for temporally local data. • E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2 and A3 map to the same 2-way set, then hit rate is 0% as subsequent accesses replace previous accesses in the LRU scheme.

  41. Multi-level Cache • Let the first level of cache (closest to CPU) be called “L1”, and the next level “L2”. • Let Phit_l1 be the hit rate of L1, Tcache_L1 be the cache access time of L1, Tmiss_L1 be the miss penalty of L1. • AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * Tmiss_L1 • What is Tmiss_L1? • If L1 misses, then we will attempt to get data from L2. Hence Tmiss_l1 is actually just the AMAT of L2! • Let Phit_l2 be the hit rate of L2, Tcache_l2 be the cache access time of L2, Tmiss_l2 be the miss penalty of L2.

  42. Multilevel Cache • Tmiss_l1 = AMATl2 = Phit_l2 * Tcache_L2 + (1-Phit_l2) * Tmiss_L2 • Substitute this back and we get: AMAT of L1 = Phit_l1 * Tcache_L1 + (1-Phit_l1) * (Phit_l2 * Tcache_L2 + (1-Phit_l2) * Tmiss_L2) • Tmiss_l2 is of course the time taken to access the slow DRAM memory. • What if we had an L3 cache?

  43. Other Problems • Question 7.9

  44. Virtual Memory Motivation • Drive space is very very cheap • Typically about 2cents per megabyte. • It would be ideal if we could set aside a portion of drive space to be used as memory. • Unfortunately disk drives are very slow • Fastest access time is about 10ms, or about 1,000 times slower than SRAM and several hundred times slower than DRAM. • Idea: Use drive space as memory, and main memory to cache the drive space! • This is the idea behind virtual memory.

  45. System Cache Is cached by Main Memory Is cached by Virtual Memory Main Idea • Virtual memory (residing on disk) is cached by main memory • Main memory is cached by system cache • All memory transfers are only between consecutive levels (e.g. VM to main memory, main memory to cache).

  46. Cache vs. VM • Concept behind VM is almost identical to concept behind cache. • But different terminology! • Cache: Block VM: Page • Cache: Cache Miss VM: Page Fault • Caches implemented completely in hardware. VM implemented in software, with hardware support from CPU. • Cache speeds up main memory access, while main memory speeds up VM access.

  47. Technical Issues of VM • Relatively cheap to remedy cache misses • Miss penalty is essentially the time taken to access the main memory (around 60-80ns). • Pipeline freezes for about 60-80 cycles. • Page Faults are EXPENSIVE! • Page fault penalty is the time taken to access the disk. • May take up to 50 or more ms, depending on the speed of the disk and I/O bus. • Wastes millions of processor cycles!

  48. Virtual Memory Design • Because page-miss penalties are so heavy, not practical to implement direct-mapped or set-associative architectures • These have poorer hit rates. • Main memory caching of VM is always fully associative. • 1% or 2% improvement in hit rate over other fully associative or set associative designs. • But with heavy page-miss penalties, 1% improvement is A LOT! • Also relatively cheap to implement full associativity in software

  49. Summary • Memory can be thought of as pigeon holes where CPU stores instructions and data. • Each pigeon hole (memory location) is given a number called its address. • Memory technology can be cheap and slow (DRAM) or fast and expensive (SRAM) • Locality allows us to use a small amount of fast expensive memory to store parts of the cheap and slow memory to improve performance. • Caches are organized into blocks.

  50. Summary • Mapping between memory addresses and blocks can be accomplished by: • Directly mapping a memory location to a cache block (direct map) • Slotting a memory location to any block (fully associative) • Mapping a memory location to a set of blocks, then slotting it into any block within the set (set associative) • Virtual memory attempts to use disk space as “main memory”, DRAM main memory as cache to the disk memory, and SRAM as cache to the DRAM.

More Related