390 likes | 514 Views
Virtual Memory. Virtual Memory. An Address Remapping Scheme Permits applications to grow bigger than main memory size Helps with multiple process management Each process gets its own chunk of memory Permits protection of one process’ chunk from another
E N D
Virtual Memory An Address Remapping Scheme • Permits applications to grow bigger than main memory size • Helps with multiple process management • Each process gets its own chunk of memory • Permits protection of one process’ chunk from another • Mapping of multiple chunks onto shared physical memory • Mapping also facilitates relocation • Think of it as another level of cache in the hierarchy: • Caches disk pages into DRAM • Miss becomes a page fault • Block becomes a page (or segment)
Virtual Memory Basics • Programs reference “virtual” addresses in a non-existent memory • These are then translated into real “physical” addresses • Virtual address space may be bigger than physical address space • Divide physical memory into blocks, called pages • Anywhere from 512 to 16MB (4k typical) • Virtual-to-physical translation by indexed table lookup • Add another cache for recent translations (the TLB) • Invisible to the programmer • Looks to your application like you have a lot of memory! • Anyone remember overlays?
VM: Page Mapping Process 1’s Virtual Address Space Page Frames Process 2’s Virtual Address Space Disk Physical Memory
VM: Address Translation 20 bits 12 bits Log2 of pagesize Virtual page number Page offset Per-process page table Valid bit Protection bits Dirty bt Reference bit Page Table base Physical page number Page offset To physical memory
Typical Page Parameters • It’s a lot like what happens in a cache • But everything (except miss rate) is a LOT worse
Cache vs. VM Differences • Replacement • cache miss handled by hardware • page fault usually handled by the OS • This is OK - since fault penalty is so horrific • hence some strategy of what to replace makes sense • Addresses • VM space is determined by the address size of the CPU • cache size is independent of the CPU address size • Lower level memory • For caches, the main memory is not shared by something else (well, there is I/O…) • For VM, most of the disk contains the file system • File system addressed differently, usually in I/O space • The VM lower level is usually called Swap Space
Paging vs. Segmentation • Pages are fixed sized blocks • Segments vary from 1 byte to 232 (for 32bit addresses) bytes
Pages are Cached in a Virtual Memory System Can Ask the Same Four Questions we did about caches • Q1: Block Placement • choice: lower miss rates and complex placement or vice versa • miss penalty is huge • so choose low miss rate ==> place page anywhere in physical memory • similar to fully associative cache model • Q2: Block Addressing - use additional data structure • fixed size pages - use a page table • virtual page number ==> physical page number and concatenate offset • tag bit to indicate presence in main memory
Normal Page Tables • Size is number of virtual pages • Purpose is to hold the translation of VPN to PPN • Permits ease of page relocation • Make sure to keep tags to indicate page is mapped • Potential problem: • Consider 32bit virtual address and 4k pages • 4GB/4KB = 1MW required just for the page table! • Might have to page in the page table… • Consider how the problem gets worse on 64bit machines with even larger virtual address spaces! • Alpha has a 43bit virtual address with 8k pages… • Might have multi-level page tables
Inverted Page Tables Similar to a set-associative mechanism • Make the page table reflect the # of physical pages (not virtual) • Use a hash mechanism • virtual page number ==> HPN index into inverted page table • Compare virtual page number with the tag to make sure it is the one you want • if yes • check to see that it is in memory - OK if yes - if not page fault • If not - miss • go to full page table on disk to get new entry • implies 2 disk accesses in the worst case • trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table
Inverted Page Table Page Offset • Only store entries • For pages in physical • memory Hash Page Frame V = OK Frame Offset
Address Translation Reality • The translation process using page tables takes too long! • Use a cache to hold recent translations • Translation Lookaside Buffer • Typically 8-1024 entries • Block size same as a page table entry (1 or 2 words) • Only holds translations for pages in memory • 1 cycle hit time • Highly or fully associative • Miss rate < 1% • Miss goes to main memory (where the whole page table lives) • Must be purged on a process switch
Back to the 4 Questions • Q3: Block Replacement (pages in physical memory) • LRU is best • So use it to minimize the horrible miss penalty • However, real LRU is expensive • Page table contains a use tag • On access the use tag is set • OS checks them every so often, records what it sees, and resets them all • On a miss, the OS decides who has been used the least • Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate
Last Question • Q4: Write Policy • Always write-back • Due to the access time of the disk • So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out. • Anything else is pretty silly • Remember – the disk is SLOW!
Page Sizes An architectural choice • Large pages are good: • reduces page table size • amortizes the long disk access • if spatial locality is good then hit rate will improve • Large pages are bad: • more internal fragmentation • if everything is random each structure’s last page is only half full • Half of bigger is still bigger • if there are 3 structures per process: text, heap, and control stack • then 1.5 pages are wasted for each process • process start up time takes longer • since at least 1 page of each type is required to prior to start • transfer time penalty aspect is higher
More on TLBs • The TLB must be on chip • otherwise it is worthless • small TLB’s are worthless anyway • large TLB’s are expensive • high associativity is likely • ==> Price of CPU’s is going up! • OK as long as performance goes up faster
30 bits 13 bits 43-bit Virtual Address Page Frame Number Page Offset 1 1 2 2 30 bits 21 bits 2 V R W VPN Tag Physical PN V R W VPN Tag Physical PN V R W VPN Tag Physical PN 32 entries total V R W VPN Tag Physical PN hit location 32:1 Mux 3 Alpha AXP 21064 TLB 56 bits/entry protection l s t s a i e c b i r - s d 4 4 y d 3 h a p
Protection • Multiprogramming forces us to worry about it • think about what happens on your workstation • it would be annoying if your program clobbered your email files • There are lots of processes • Implies lots of task switch overhead • HW must provide savable state • OS must promise to save and restore properly • most machines task switch every few milliseconds • a task switch typically takes several microseconds • also implies inter-process communication • which implies OS intervention • which implies a task switch • which implies less of the duty cycle gets spent on the application
Protection Options • Simplest - base and bound • 2 registers - check each address falls between the values • these registers must be changed by the OS but not the app • need for 2 modes: regular & privileged • hence need to privilege-trap and provide mode switch ability • VM provides another option • check as part of the VA --> PA translation process • The protection bits reside in the page table & TLB • Other options • rings - ala MULTIC’s and now Pentium • inner is most privileged - outer is least • capabilities (i432) - similar to a key or password model • OS hands them out - so they’re difficult to forge • in some cases they can be passed between app’s
VAX 11/780 Example • Hybrid - segmented + paged • segments used for system/process separation • paging used for VM, relocation, and protection • reasonably common again (was common in the late 60’s too) • Segments - 1 for system and 1 for processes • high order address bit 31: 1- system, 0 - process • all processes have their own space but share the process segment • bit 30 divides the process segment into 2 regions • =1 P1 grows downward (stack),=0 P0 grows upward (text,heap) • Protection • pair of base and bound registers for S, PO, and P1 • saves page table space since the page size was 512 bytes
VAX-11/780 Address Mapping 31 30 29 9 8 0 21 bit virtual page num- 9 bit page offset Page TBL S + Page TBL P0 PT P1 > Page Index Fault Note - 3 separate page tables 21 bit page frame number 9 bit page offset SPT, P0PT, P1PT To the TLB
More VAX-11/780 • System page tables are pinned (frozen) in physical memory • virtual and physical page numbers are the same • OS handles replacement so it never moves these pages • P0 and P1 tables are in virtual memory • hence they can be missed as well • double page fault is therefore possible • Page Table Entry • M - modify indicating dirty page • V - valid indicating a real entry • PROT - 4 protection bits • 21 bit physical page number • no reference or use bit - hence hokey LRU accounting - use M
VAX PROT bits • 4 levels of use - each with its own stack and stack pointer R15 • kernel - most trusted • executive • supervisor • user - least trusted • 3 types of access • no access, read, or write • Bizarre encoding • all 16 4-bit patterns meant something • if there was a model to their encoding method it eludes everybody I know • 1001 - R/W for kernel and exec, R for sup , zip for user
VAX 11/780 TLB Parameters • 2-way set associative but partitioned • 2 64 entry banks - top 32 of each bank is reserved for system • on a task switch only need to invalidate half the TLB • high order address bit selects the bank (corresponds to P0, P1 distinction) • split increases miss rate but under high task switch rate may still be a win
VAX 11/780 TLB Action • Steps 1&2 • bit 31 ## Index passed to both banks (both set members) • V must be 1 for anything to continue - TLB miss trap if V=0 • PROT bits checked against R/W access type and which kind of user (from the PSW) - TLB protection trap if violation • Step 3 • page table tag compared with TLB tag • both banks done in parallel • both cannot match - if no match then TLB-miss trap • Step 4 • matching side’s 21 bit physical page address passed thru MUX • Step 5 • 21 bit physical page addr ## 9 bit page offset = physical address • if P=1 then cache hit and address goes to the cache - note cache hit time stretch • if P=0 then page fault trap
The Alpha AXP 21064 Also a page/segment combo • Segmented 64 bit address space • Seg0 (addr63 = 0) & Seg1 (addr[63:62] = 11) • Mapped into pages for user code • Seg0 for text and heap sections grows upwards • Seg1 for stack which grows downward • Kseg (addr[63:62] = 10) • Reserved for the O.S. • Uniformly protected space, no memory management • Advantages of the split model • Segmentation conserves page table space • Paging provides VM, protection, and relocation • O.S. tends to need to be resident anyway
Page Table Problems… 64-bit address is the first-order cause • Reduction approach - go hierarchical • 3 levels - each page table limited to a single page size • current page size is 8KB but claim to support 16, 32, and 64KB in the future • Superpage model extends TLB REACH – used in MIPS R10k • Uses 34-bit physical addresses (max could be 41) • Virtual address = [seg-selectorLvl1Lvl2Lvl3Offset] • Mapping • Page table base register points to base of LVL1-TBL • LVL1-TBL[Lvl1] + Lvl2 points to LVL2-TBL entry • LVL2-TBL entry + Lvl3 points to LVL3-TBL entry • LVL3-TBL entry provides physical page number finally • PPN##Offset => physical address for main memory
Alpha VPN => PPN Mapping virtual address level 1 seg0 or 1 level 1 level 1 page offset page table base reg. L1 Table L1 Table L1 Table entry entry entry Note potential delay physical address problem – sequential Lookups X 3 physical page frame # page offset
Page Table Entries 64-bit = 1k entries per table • low order 32 bits = page frame number • 5 Protection fields • valid - e.g. OK to do address translation • user Read Enable • kernel Read Enable • user Write Enable • kernel Write Enable • Other fields • system accounting - like a USE field • high order bits basically unused - real Virtual Addr = 43 bits • common hack to save chip area in the early implementations - different now interesting to see OS problems as VM goes to bigger pages
Alpha TLB Stats • Contiguous pages mapped as 1 • Option of 8, 64, 512 pages – also extends TLB reach • Complicates the TLB somewhat… • Separate ITLB and DTLB
The Whole Alpha Memory System • Boot • initial Icache fill from boot PROM • hence no Valid bit needed - Vbits in Dcache are cleared • PC set to kseg so no address translation and TLB can bebypassed • then start loading the TLB for real from the kseg OS code • subsequent misses call PAL (Priv. Arch. Library) to remap and fill the miss • Once ready to run user code • OS sets PC to appropriate seg0 address • then the memory hierarchy is ready to go
Things to Note • Prefetch stream buffer • holds next Ifetch access so an Icache miss checks there first • L1 is write through with a write buffer • avoids CPU stall on a write miss • 4 block capacity • write merge - if block is same then merges the writes • L2 is write back with a victim buffer • 1 block - so delays stall on a L2 miss dirty until the 2nd miss • normally this is safe • Full datapath shown in Figure 5.47 • worth the time to follow the numbered balls
Alpha 21064 I-fetch • I-Cache Hit (1 cycle) • Step 1: Send address to TLB an Cache • Step 2: Icache lookup (8KB, DM, 32-byte blocks) • Step 3: ITLB Lookup (12 entries, FA) • Step 4: Cache hit and valid PTE • Step 5: Send 8 bytes to CPU • I-Cache miss, Prefetch buffer (PFB) hit (1 cycle) • Step 6: Start L2 access, just in case • Step 7: Check prefetch buffer • Step 8: Prefetch buffer hit, send 8 bytes to CPU • Step 9: Refill I-Cache from PFB, and cancel L2 access • I-Cache Miss, Prefetch buffer miss, L2 Cache Hit • Step 10: Check L2 cache tag • Step 11: Return critical 16B (5 cycles) • Step 12: Return other 16B (5 cycles) • Step 13: Prefetch next sequential block to PFB (10 cycles)
Alpha 21064 L2 Cache Miss • L2 Cache Miss • Step 14: Send new request to main memory • Step 15: Put dirty victim block in victim buffer • Step 16: Load new block in L2 cache, 16B at a time • Step 17: Write Victim buffer to memory • Data loads are like instruction fetches, except use DTLB and D-Cache instead of the ITLB and I-Cache • Allows hits under miss • On a read miss, the write buffer is flushed first to avoid RAW hazards
Alpha 21064 Data Store • Data Store • Step 18: DTLB Lookup and protection violation check • Step 19: D-Cache lookup (8KB, DM, write through) • Step 22: Check D-Cache tag • Step 24: Send data to write buffer • Step 25: Send data to delayed write buffer in front of D-Cache • Write hits are pipelined • Step 26: Write previous delayed write buffer to D-Cache • Step 27: Merge data into write buffer • Step 28: Write data at the head of write buffer to L2 cache (15 cycles)
AXP Memory is Quite Complex How well does it work? • Table 5.48 in the book tells… • Summary – interesting to note: • Alpha 21064 is a dual-issue machine, but nowhere is the CPI < 1 • Worst case are the TPC benchmarks • Memory intensive, so more stalls • CPI bloats to 4.3 (1.8 of which is from the caches) • 1.67 due to other stalls (I.e. load dependencies) • Average SPECint CPI is 1.86 • .77 contributed by caches, .74 by other stalls • Average SPECfp CPI is 2.14 • .45 by caches, .98 by other stalls Superscalar issue is sorely limited!
Summary 1 • CPU performance is outpacing main memory performance • Principle of locality saves us: thus the Memory Hierarchy • The memory hierarchy should be designed as a system • Key old ideas: • Bigger caches • Higher set-associativity • Key new ideas: • Non-blocking caches • Ancillary caches • Multi-port caches • Prefetching (software and/or hardware controlled)