1 / 27

Main Memory and Virtual Memory

Main Memory and Virtual Memory. Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for Friday: Sections 5.5 – 5.8 Reading for Monday: Sections 5.8 – 5.12 and 5.16. Main Memory Background. Performance of Main Memory:

Download Presentation

Main Memory and Virtual Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENGS 116 Lecture 14 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for Friday: Sections 5.5 – 5.8 Reading for Monday: Sections 5.8 – 5.12 and 5.16

  2. ENGS 116 Lecture 14 Main Memory Background • Performance of Main Memory: • Latency: Cache miss penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & large block miss penalty (L2) • Main Memory is DRAM: dynamic random access memory • Dynamic since needs to be refreshed periodically (1% time) • Addresses divided into 2 halves (memory as a 2-D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: static random access memory • No refresh; 6 transistors/bit vs. 1 transistor; Size: DRAM/SRAM ≈ 4-8; Cost/Cycle time: SRAM/DRAM ≈ 8-16

  3. ENGS 116 Lecture 14 4 Key DRAM Timing Parameters • tRAC: minimum time from RAS line falling to the valid data output. • Quoted as the speed of a DRAM when buying • A typical 512Mbit DRAM tRAC = 60-40 ns • tRC: minimum time from the start of one row access to the start of the next. • tRC = 80 ns for a 512Mbit DRAM with a tRAC of 60-40 ns • tCAC: minimum time from CAS line falling to valid data output. • 5 ns for a 512Mbit DRAM with a tRAC of 60-40 ns • tPC: minimum time from the start of one column access to the start of the next. • 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns

  4. ENGS 116 Lecture 14 DRAM Performance • A 40 ns (tRAC) DRAM can • perform a row access only every 80 ns (tRC) • perform column access (tCAC) in 5 ns, but time between column accesses is at least 15 ns (tPC). • In practice, external address delays and turning around buses make it 20 to 25 ns • These times do not include the time to drive the addresses off the microprocessor or the memory controller overhead!

  5. ENGS 116 Lecture 14 DRAM History • DRAMs: capacity + 60%/yr, cost – 30%/yr • 2.5X cells/area, 1.5X die size in ≈ 3 years • ‘98 DRAM fab line costs $2B • Rely on increasing numbers of computers & memory per computer (60% market) • SIMM or DIMM is replaceable unit  computers use any generation DRAM • Commodity, second source industry  high volume, low profit, conservative • Little organization innovation in 20 years • Order of importance: 1) Cost/bit, 2) Capacity • First RAMBUS: 10X BW, + 30% cost  little impact • Current SDRAM yield very high: > 80%

  6. ENGS 116 Lecture 14 Main Memory Performance • Simple: • CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UltraSPARC 512) • Interleaved: • CPU, Cache, Bus 1 word; Memory N modules (4 modules); example is word interleaved

  7. Address Bank 0 Address Bank 1 Address Bank 2 Address Bank 3 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 ENGS 116 Lecture 14 Main Memory Performance • Timing model (word size is 32 bits) • 1 to send address, • 6 for access time, 1 to send data • Cache Block is 4 words • Simple memory 4  (1 + 6 + 1) = 32 • Wide memory 1 + 6 + 1 = 8 • Interleaved memory  1 + 6 + 4  1 = 11

  8. ... ENGS 116 Lecture 14 Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses • Multiprocessor • I/O (DMA) • CPU with Hit under n Misses, Non-blocking Cache • Superbank: all memory active on one block transfer (or Bank) • Bank: portion within a superbank that is word interleaved (or subbank) Superbank Superbank offset (Bank) Superbank # Bank offset Bank #

  9. ENGS 116 Lecture 14 Independent Memory Banks • How many banks? number banks ≥ number clocks to access word in bank • For sequential accesses, otherwise will return to original bank before it has next word ready • (like in vector case) • Increasing DRAM  fewer chips  harder to have banks

  10. ENGS 116 Lecture 14 Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: prime number of banks • bank number = address mod number of banks • address within bank = address / number of words in bank • modulo & divide per memory access with prime no. banks? • address within bank = address mod number words in bank • bank number? easy if 2N words per bank

  11. ENGS 116 Lecture 14 Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) • Extended Data Out (EDO): 30% faster in page mode • New DRAMs to address gap; what will they cost, will they survive? • RAMBUS: startup company; reinvent DRAM interface >> Each chip a module vs. slice of memory >> Short bus between CPU and chips >> Does own refresh >> Variable amount of data returned >> 1 byte / 2 ns (500 MB/s per chip) • Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) • Intel claims RAMBUS Direct is future of PC memory • Niche memory or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output

  12. ENGS 116 Lecture 14 Virtual Memory Virtual Address (232, 264) to Physical Address mapping (228) Virtual memory in terms of cache: Cache block? Cache miss? How is virtual memory different from caches? What controls replacement Size (transfer unit, mapping mechanisms) Lower-level use

  13. Virtual address: Physical address: 0 4K 0 8K 4K Physical main memory 12K 8K 12K Virtual memory 16K 20K A 24K C B 28K C D A D Disk B ENGS 116 Lecture 14 Figure 5.36 The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D.

  14. Figure 5.37 Typical ranges of parameters for caches and virtual memory. ENGS 116 Lecture 14

  15. ENGS 116 Lecture 14 Virtual Memory • 4 Questions for Virtual Memory (VM)? • Q1: Where can a block be placed in the upper level? fully associative, set associative, or direct mapped? • Q2: How is a block found if it is in the upper level? • Q3: Which block should be replaced on a miss? random or LRU? • Q4: What happens on a write? write back or write through? • Other issues: size; pages or segments or hybrid

  16. Virtual address Virtual page number Page offset Main memory Page table Physical address ENGS 116 Lecture 14 Figure 5.40 The mapping of a virtual address to a physical address via a page table.

  17. Page offset <13> Page-frame address <30> <30> Tag <21> Physical page # 1 <1> <2><2> 2 V R W (low-order 13 bits of address)   <13>  34-bit physical address 32:1 MUX 4 3 <21> (high-order 21 bits of address) ENGS 116 Lecture 14 Fast Translation: Translation Buffer (TLB) • Cache of translated addresses • Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit • Alpha 21064 data TLB: 32-entry fully associative

  18. ENGS 116 Lecture 14 Selecting a Page Size • Reasons for larger page size • Page table size is inversely proportional to the page size; therefore memory saved • Fast cache hit time easy when cache ≤ page size (VA caches); bigger page makes it feasible as cache grows in size • Transferring larger pages to or from secondary storage, possibly over a network, is more efficient • Number of TLB entries is restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses • Reasons for a smaller page size • Fragmentation: don’t waste storage; data must be contiguous within page • Quicker process start for small processes • Hybrid solution: multiple page sizes • Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)

  19. Page Table Base Register + + + Page table entry Page table entry Page table entry Physical address physical page-frame number page offset ENGS 116 Lecture 14 Alpha VM Mapping 21 Virtual address seg0/seg1 selector 000 … 0 or 111 … 1 level1 level2 level3 page offset 10 10 10 13 • “64-bit” address divided into 3 segments • seg0 (bit 63 = 0) user code/heap • seg1 (bit 63 = 1, 62 = 1) user stack • kseg (bit 63 = 1, 62 = 0) kernel segment for OS • Three level page table, each one page • Alpha only 43 bits of VA • (future min page size up to 64 KB  55 bits of VA) • PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit) • What do you do? L1 page table . . . L2 page table . . . L3 page table . . . . . . . . . . . . 8 bytes 32 bit address 32 bit fields Main memory

  20. ENGS 116 Lecture 14 Protection • Avoid separate processes to access each others memory • Causes Segmentation Fault: sigSEG • Useful for Multitasking systems • Operating system issue • At least two levels of protection: • Supervisor (Kernel) mode (privileged) • Creates page tables, sets process bounds, handles exceptions • User mode (non-privileged) • Can only make requests to Kernel: called SYSCALLs • Shared memory • SYSCALL parameter passing

  21. ENGS 116 Lecture 14 Protection 2 • Each page needs: • PID bit • Read/Write/Execute bit • Each process needs • Stack frame page(s) • Text or code pages • Data or heap pages • State table keeping: • PC and other CPU status registers • State of all registers

  22. ENGS 116 Lecture 14 Alpha 21064 • Separate Instruction & Data TLB & Caches • TLBs fully associative • TLB updates in SW(“Private Arch Lib”) • Caches 8KB direct mapped, write through • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4  64-bit modules • Victim buffer: to give read priority over write • 4-entry write buffer between D$ & L2$ Data Instr Write Buffer Stream Buffer Victim Buffer

  23. ENGS 116 Lecture 14 Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + register conflicts, structural conflicts

  24. ENGS 116 Lecture 14 Pitfall: Predicting Cache Performance of One Program from Another (ISA, compiler, ...) 35% • 4KB data cache: miss rate 8%, 12%, or 28%? • 1KB instruction cache: miss rate 0%, 3%, or 10%? • Alpha vs. MIPSfor 8 KB Data $:17% vs. 10% • Why 2X Alpha v. MIPS? D$, Tom 30% D: tomcatv D: gcc 25% D: espresso I: gcc D$, gcc I: espresso 20% I: tomcatv Miss Rate D$, esp 15% 10% I$, gcc 5% I$, esp 0% 1 2 4 8 16 32 64 128 I$, Tom Cache Size (KB)

  25. ENGS 116 Lecture 14 Pitfall: Simulating Too Small an Address Trace 4.5 4 Cumulative Average Memory Access Time 3.5 3 2.5 2 1.5 1 0 1 2 3 4 5 6 7 8 9 10 11 12 I$ = 4 KB, B = 16 B D$ = 4 KB, B = 16 B L2 = 512 KB, B = 128 B MP = 12, 200 (miss penalties) Instructions Executed (billions)

  26. ENGS 116 Lecture 14 Additional Pitfalls • Having too small an address space • Ignoring the impact of the operating system on the performance of the memory hierarchy

  27. ENGS 116 Lecture 14 Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.

More Related