1 / 41

Memory Hierarchy Part 1

Refreshing Memory. Memory Hierarchy Part 1. Optional: Bryant , Randal E., O’Hallaron , David, Computer Systems: A Programmer’s Perspective , Prentice Hall, 2003. (B&H). Reading Assignment. Chapter 6: The Memory Hierarchy. Required: Sections 8.4 and 12.4 of the Clements textbook.

Download Presentation

Memory Hierarchy Part 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Refreshing Memory Memory HierarchyPart 1 CMPUT 229

  2. CMPUT 229 Optional: Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H) Reading Assignment Chapter 6: The Memory Hierarchy Required: Sections 8.4 and 12.4 of the Clements textbook.

  3. CMPUT 229 Types of Memories Read/Write Memory (RWM): we can store and retrieve data. the time required to read or write a bit of memory is independent of the bit’s location. Random Access Memory (RAM): once a word is written to a location, it remains stored as long as power is applied to the chip, unless the location is written again. Static Random Access Memory (SRAM): the data stored at each location must be refreshed periodically by reading it and then writing it back again, or else it disappears. Dynamic Random Access Memory (DRAM):

  4. 0 1 2 3 4 5 6 7 DIN2 DIN0 DIN3 DIN1 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT2 DOUT1 DOUT0

  5. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  6. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  7. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  8. CMPUT 229 1 written refreshes Vcap VCC HIGH LOW 0V time 0 stored Refreshing the Memory The solution is to periodically refresh the memory cells by reading and writing back each one of them.

  9. CMPUT 229 SRAM with Bi-directional Data Bus microprocessor IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DIO3 DIO2 DIO1 DIO0

  10. CMPUT 229 DRAM High Level View DRAM chip Cols 0 1 2 3 Memory controller 0 2 / addr 1 Rows 2 (to CPU) 3 8 / data Internal row buffer Byant/O’Hallaron, pp. 459

  11. CMPUT 229 DRAM chip Cols 0 Memory controller 1 2 3 RAS = 2 2 / 0 addr 1 Rows 2 3 8 / data Row 2 Internal row buffer DRAM RAS Request RAS = Row Address Strobe Byant/O’Hallaron, pp. 460

  12. CMPUT 229 DRAM CAS Request DRAM chip Cols 0 Memory controller 1 2 3 CAS = 1 2 / 0 addr 1 Rows 2 Supercell (2,1) 3 8 / data Internal row buffer CAS = Column Address Strobe Byant/O’Hallaron, pp. 460

  13. addr (row = i, col = j) : Supercell (i,j) DRAM 0 64 MB memory module consisting of 8 8Mx8 DRAMs DRAM 7 data bits 56-63 bits 48-55 bits 40-47 bits 32-39 bits 24-31 bits 16-23 bits 8-15 bits 0-7 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 Memory controller 64-bit double word at main memory address A 64-bit doubleword to CPU chip Memory Modules Byant/O’Hallaron, pp. 461

  14. Step 1: Apply row address Step 2: RAS go from high to low and remain low 2 8 Step 3: Apply column address 5 Step 4: WE must be high Step 5: CAS goes from high to low and remain low 3 1 Step 6: OE goes low 4 Step 7: Data appears 6 Step 8: RAS and CAS return to high 7 Read Cycle on an Asynchronous DRAM

  15. CMPUT 229 Improved DRAMs Central Idea: Each read to a DRAM actually reads a complete row of bits or word line from the DRAM core into an array of sense amps. A traditional asynchronous DRAM interface then selects a small number of these bits to be delivered to the cache/microprocessor. All the other bits already extracted from the DRAM cells into the sense amps are wasted.

  16. CMPUT 229 Fast Page Mode DRAMs In a DRAM with Fast Page Mode, a page is defined as all memory addresses that have the same row address. To read in fast page mode, all the steps from 1 to 7 of a standard read cycle are performed. Then OE and CAS are switched high, but RAS remains low. Then the steps 3 to 7 (providing a new column address, asserting CAS and OE) are performed for each new memory location to be read.

  17. A Fast Page Mode Read Cycle on an Asynchronous DRAM

  18. CMPUT 229 Enhanced Data Output RAMs (EDO-RAM) The process to read multiple locations in an EDO-RAM is very similar to the Fast Page Mode. The difference is that the output drivers are not disabled when CAS goes high. This distintion allows the data from the current read cycle to be present at the outputs while the next cycle begins. As a result, faster read cycle times are allowed.

  19. An Enhanced Data Output Read Cycle on an Asynchronous DRAM

  20. CMPUT 229 Synchronous DRAMs (SDRAM) A Synchronous DRAM (SDRAM) has a clock input. It operates in a similar fashion as the fast page mode and EDO DRAM. However the consecutive data is output synchronously on the falling/rising edge of the clock, instead of on command by CAS. How many data elements will be output (the length of the burst) is programmable up to the maximum size of the row. The clock in an SDRAM typically runs one order of magnitude faster than the access time for individual accesses.

  21. CMPUT 229 DDR SDRAM A Double Data Rate (DDR) SDRAM is an SDRAM that allows data transfers both on the rising and falling edge of the clock. Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate of a standard SDRAM with the same clock frequency.

  22. CMPUT 229 The Rambus DRAM (RDRAM) Multiple memory arrays (banks) Rambus DRAMs are synchronous and transfer data on both edges of the clock.

  23. CMPUT 229 SDRAM Memory Systems Complex circuits for RAS/CAS/OE. Each DIMM is connected in parallel with the memory controller. (DIMM = Dual In-line Memory Module) Often requires buffering. Needs the whole clock cycle to establish valid data. Making the bus wider is mechanically complicated.

  24. CMPUT 229 RDRAM Memory Systems

  25. CMPUT 229 Locality We say that a computer program exhibits good locality if the program tends to reference data that is nearby or data that has been referenced recently. Because a program might do one of these things, but not the other, the principle of locality is separated into two flavors: Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future. Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future. Byant/O’Hallaron, pp. 478

  26. CMPUT 229 Examples In the Sampler function below, RandInt returns a randomly selected integer within the specified interval. Which program has better locality? 1 intSampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j]; 10 } 11 return sum/K; 12 } 1 intSumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 } Byant/O’Hallaron, pp. 479

  27. L1 cache holds cache lines retrieved from the L2 cache. CPU registers hold words retrieved from cache memory. L0: Registers L2 cache holds cache lines retrieved from memory. On-chip L1 cache (SRAM) L1: Off-chip L2 cache (SRAM) L2: Main memory holds disk blocks retrieved from local disks. Main memory (DRAM) L3: Local disks hold files retrieved from disks on remote network servers. Local secondary storage (local disks) L4: Remote secondary storage (distributed file systems, Web servers) L5: Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices Byant/ O’Hallaron, pp. 483

  28. CMPUT 229 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: 4 9 14 3 Data is copied between levels in block-sized transfer units 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 5 6 7 Level k+1: 8 9 10 11 12 13 14 15 Caching Principle Byant/O’Hallaron, pp. 484

  29. CMPUT 229 Cache Misses Cold Misses, or compulsory misses, occur the first time that a data is referenced. Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use. Capacity Misses, occur when there are no more free lines in the cache.

  30. CMPUT 229 Simplest Cache: Direct Mapped Memory Address Memory 0 4 Byte Direct Mapped Cache 1 Cache Index • Location 0 can be occupied by data from: • Memory location 0, 4, 8, ... etc. • In general: any memory locationwhose 2 LSBs of the address are 0s • Address<1:0> => cache index • Which one should we place in the cache? • How can we tell which one is in the cache? 2 0 3 1 4 2 5 3 6 7 8 9 A B C D E F

  31. CMPUT 229 1 KB Direct Mapped Cache, 32B blocks • For a 2 ** N byte cache: • The uppermost (32 - N) bits are always the Cache Tag • The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 9 4 0 Cache Tag Cache Index Byte Select Example: 0x50 Ex: 0x01 Ex: 0x00 Cache Tag is Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

  32. CMPUT 229 Direct-mapped Cache Clements pp. 346

  33. CMPUT 229 Identifying sets in Direct-mapped Caches Clements pp. 347

  34. CMPUT 229 Operation of a Direct-mapped Cache Clements pp. 348

  35. CMPUT 229 Full-Associative Cache Clements pp. 348

  36. CMPUT 229 Cache Data Cache Tag Valid Cache Block 0 : : : Compare Two-way Set Associative Cache • N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel (N typically 2 to 4) • Example: Two-way set associative cache • Cache Index selects a “set” from the cache • The two tags in the set are compared in parallel • Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Block 0 : : : Adr Tag Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit

  37. CMPUT 229 Set associative-mapped cache Clements pp. 349

  38. CMPUT 229 L1 and L2 Bus System CPU chip Register file ALU L1 cache Cache bus System bus Memory bus Main memory L2 cache Bus interface I/O bridge Byant/O’Hallaron, pp. 488

  39. t tag bits per line 1 valid bit per line B = 2b bytes per cache block Valid Tag 0 1 • • • B–1 • • • E lines per set Set 0: Valid Tag 0 1 • • • B–1 Valid Tag 0 1 • • • B–1 • • • Set 1: S = 2s sets Valid Tag 0 1 • • • B–1 • • • Valid Tag 0 1 • • • B–1 • • • Set S -1: Valid Tag 0 1 • • • B–1 Cache size: C = B x E x S data bytes Cache Organization Byant/O’Hallaron, pp. 488

  40. CMPUT 229 t bits s bits b bits Address: m-1 0 Tag Set index Block offset Address Partition Selects which word, inside the block, is referenced. Compared with tags in the cache to find a match. Used to find the set where the data might be found in the cache. Byant/O’Hallaron, pp. 488

  41. CMPUT 229 Multi-Level Cache Organization CPU Main memory L1 d-cache L2 unified cache Regs Disk L1 i-cache Byant/O’Hallaron, pp. 504

More Related