Understanding Virtual Memory and Address Translation Basics

Virtual Memory and Address Translation

text data BSS user stack args/env kernel Virtual Addressing virtual memory (big) physical memory (small) User processes address memory through virtual addresses. The kernel controls the virtual-physical translations in effect for each space. data The kernel and the machine collude to translate virtual addresses to physical addresses. The machine does not allow a user process to access memory unless the kernel “says it’s OK”. virtual-to-physical translations The specific mechanisms for memory management and address translation are machine-dependent.

header text data idata wdata symbol table relocation records What’s in an Object File or Executable? Header “magic number” indicates type of image. program instructions p Section table an array of (offset, len, startVA) immutable data (constants) “hello\n” program sections writable global/static data j, s j, s ,p,sbuf Used by linker; may be removed after final link step and strip. int j = 327; char* s = “hello\n”; char sbuf[512]; int p() { int k = 0; j = write(1, s, 6); return(j); }

text data header BSS text data idata user stack wdata args/env symbol table kernel relocation records The Program and the Process VAS BSS “Block Started by Symbol” (uninitialized global data) e.g., heap and sbuf go here. Process text segment is initialized directly from program text section. data segments sections Process BSS segment may be expanded at runtime with a system call (e.g., Unix sbrk) called by the heap manager routines. Process data segment(s) are initialized from idata and wdata sections. process VAS program Process stack and BSS (e.g., heap) segment(s) are zero-filled. Text and idata segments may be write-protected. Args/env strings copied in by kernel when the process is created.

Virtual Address Translation virtual address 29 0 13 Example: typical 32-bit architecture with 8KB pages. 00 VPN offset Virtual address translation maps a virtual page number (VPN) to a physical page frame number (PFN): the rest is easy. address translation Deliver exception to OS if translation is not valid and accessible in requested mode. { + PFN physical address offset

Role of MMU Hardware and OS VM address translation must be very cheap (on average). • Every instruction includes one or two memory references. (including the reference to the instruction itself) VM translation is supported in hardware by a Memory Management Unit or MMU. • The addressing model is defined by the CPU architecture. • The MMU itself is an integral part of the CPU. The role of the OS is to install the virtual-physical mapping and intervene if the MMU reports that it cannot complete the translation.

The OS Directs the MMU The OS controls the operation of the MMU to select: (1) the subset of possible virtual addresses that are valid for each process (the process virtual address space); (2) the physical translations for those virtual addresses; (3) the modes of permissible access to those virtual addresses; read/write/execute (4) the specific set of translations in effect at any instant. need rapid context switch from one address space to another MMU completes a reference only if the OS “says it’s OK”. MMU raises an exception if the reference is “not OK”.

The Translation Lookaside Buffer (TLB) An on-chip hardware translation buffer (TB or TLB) caches recently used virtual-physical translations (ptes). Alpha 21164: 48-entry fully associative TLB. A CPU pipeline stage probes the TLB to complete over 99% of address translations in a single cycle. Like other memory system caches, replacement of TLB entries is simple and controlled by hardware. e.g., Not Last Used If a translation misses in the TLB, the entry must be fetched by accessing the page table(s) in memory. cost: 10-500 cycles

Care and Feeding of TLBs The OS kernel carries out its memory management functions by issuing privileged operations on the MMU. Choice 1: OS maintains page tables examined by the MMU. • MMU loads TLB autonomously on each TLB miss • page table format is defined by the architecture • OS loads page table bases and lengths into privileged memory management registers on each context switch. Choice 2: OS controls the TLB directly. • MMU raises exception if the needed pte is not in the TLB. • Exception handler loads the missing pte by reading data structures in memory (software-loaded TLB).

A Simple Page Table Each process/VAS has its own page table. Virtual addresses are translated relative to the current page table. process page table PFN 0 PFN 1 PFN i In this example, each VPN j maps to PFN j, but in practice any physical frame may be used for any virtual page. PFN i + offset page #i offset The page tables are themselves stored in memory; a protected register holds a pointer to the current page table. user virtual address physical memory page frames

Page Tables (2) Second-level page tables 32 bit address with 2 page table fields Two-level page tables Top-level page table [from Tanenbaum]

PFN A Page Table Entry (PTE) This is (roughly) what a MIPS/Nachos page table entry (pte) looks like. valid bit: OS uses this bit to tell the MMU if the translation is valid. write-enable: OS touches this to enable or disable write access for this mapping. dirty bit: MMU sets this when a store is completed to the page (page is modified). reference bit: MMU sets this when a reference is made through the mapping.

Page Tables (3) Typical page table entry [from Tanenbaum]

Completing a VM Reference probe page table MMU access physical memory load TLB start here probe TLB access valid? raise exception load TLB zero-fill OS page on disk? page fault? fetch from disk allocate frame signal process

Demand Paging and Page Faults OS may leave some virtual-physical translations unspecified. mark the pte for a virtual page as invalid If an unmapped page is referenced, the machine passes control to the kernel exception handler (page fault). passes faulting virtual address and attempted access mode Handler initializes a page frame, updates pte, and restarts. If a disk access is required, the OS may switch to another process after initiating the I/O. Page faults are delivered at IPL 0, just like a system call trap. Fault handler executes in context of faulted process, blocks on a semaphore or condition variable awaiting I/O completion.

text data BSS user stack args/env kernel Where Pages Come From Modified (dirty) pages are pushed to backing store (swap) on eviction. file volume with executable programs data Fetches for clean text or data are typically fill-from-file. Paged-out pages are fetched from backing store when needed. Initial references to user stack and BSS are satisfied by zero-fill on demand.

Processes and the Kernel The kernel sets up process execution contexts to “virtualize” the machine. processes in private virtual address spaces data data ...and upcalls (e.g., signals) system call traps Threads or processes enter the kernel for services. shared kernel code and data in shared address space CPU and devices force entry to the kernel to handle exceptional events.

The Kernel • Today, all “real” operating systems have protected kernels. The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset. Our “kernel” is called the executive in some systems (e.g., XP). • The kernel is (mostly) a library of service procedures shared by all user programs, butthekernel is protected: User code cannot access internal kernel data structures directly, and it can invoke the kernel only at well-defined entry points (system calls). • Kernel code is like user code, but the kernel is privileged: The kernel has direct access to all hardware functions, and defines the machine entry points for interrupts and exceptions.

Protecting Entry to the Kernel Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, XP, etc). • The machine defines a small set of exceptional event types. • The machine defines what conditions raise each event. • The kernel installs handlers for each event at boot time. e.g., a table in kernel memory read by the machine user The machine transitions to kernel mode only on an exceptional event. The kernel defines the event handlers. Therefore the kernel chooses what code will execute in kernel mode, and when. interrupt or exception trap/return kernel

Example: System Call Traps User code invokes kernel services by initiating system call traps. • Programs in C, C++, etc. invoke system calls by linking to a standard library of procedures written in assembly language. the library defines a stub or wrapper routine for each syscall stub executes a special trap instruction (e.g., chmk or callsys or int) syscall arguments/results passed in registers or user stack Alpha CPU architecture • read() in Unix libc.a library (executes in user mode): • #define SYSCALL_READ 27 # code for a read system call • move arg0…argn, a0…an # syscall args in registers A0..AN • move SYSCALL_READ, v0 # syscall dispatch code in V0 • callsys # kernel trap • move r1, _errno # errno = return status • return

Faults Faults are similar to system calls in some respects: • Faults occur as a result of a process executing an instruction. Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. • The completed fault handler may return to the faulted context. But faults are different from syscall traps in other respects: • Syscalls are deliberate, but faults are “accidents”. divide-by-zero, dereference invalid pointer, memory page fault • Not every execution of the faulting instruction results in a fault. may depend on memory state or register contents

The Role of Events A CPU event is an “unnatural” change in control flow. Like a procedure call, an event changes the PC. Also changes mode or context (current stack), or both. Events do not change the current space! The kernel defines a handler routine for each event type. Event handlers always execute in kernel mode. The specific types of events are defined by the machine. Once the system is booted, every entry to the kernel occurs as a result of an event. In some sense, the whole kernel is a big event handler.

CPU Events: Interrupts and Exceptions An interrupt is caused by an external event. device requests attention, timer expires, etc. An exception is caused by an executing instruction. CPU requires software intervention to handle a fault or trap. control flow exception.cc AST: Asynchronous System Trap Also called a software interrupt or an Asynchronous or Deferred Procedure Call (APC or DPC) event handler (e.g., ISR: Interrupt Service Routine) • Note: different “cultures” may use some of these terms (e.g., trap, fault, exception, event, interrupt) slightly differently.

Mode, Space, and Context At any time, the state of each processor is defined by: 1. mode: given by the mode bit Is the CPU executing in the protected kernel or a user program? 2. space: defined by V->P translations currently in effect What address space is the CPU running in? Once the system is booted, it always runs in some virtual address space. 3. context: given by register state and execution stream Is the CPU executing a thread/process, or an interrupt handler? Where is the stack? These are important because the mode/space/context determines the meaning and validity of key operations.

0 text data data BSS user stack args/env The Virtual Address Space 0x0 A typical process VAS space includes: • user regions in the lower half V->P mappings specific to each process accessible to user or kernel code • kernel regions in upper half shared by all processes, but accessible only to kernel code • NT (XP?) on x86 subdivides kernel region into an unpaged half and a (mostly) paged upper half at 0xC0000000 for page tables and I/O cache. • Win95/98 uses the lower half of system space as a system-wide shared region. sbrk() jsr 2n-1 kernel text and kernel data 2n-1 0xffffffff A VAS for a private address space system (e.g., Unix, NT/XP) executing on a typical 32-bit system (e.g., x86).

Process and Kernel Address Spaces 0x0 0 n-bit virtual address space 32-bit virtual address space data data 0x7FFFFFFF 2n-1-1 2n-1 0x80000000 0xFFFFFFFF 2n-1

Faults Faults are similar to system calls in some respects: • Faults occur as a result of a process executing an instruction. Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. • The completed fault handler may return to the faulted context. But faults are different from syscall traps in other respects: • Syscalls are deliberate, but faults are “accidents”. divide-by-zero, dereference invalid pointer, memory page fault • Not every execution of the faulting instruction results in a fault. may depend on memory state or register contents

Options for Handling a Fault (1) 1. Some faults are handled by “patching things up” and returning to the faulted context. Example: the kernel may resolve an address fault (virtual memory fault) by installing a new virtual-physical translation. The fault handler may adjust the saved PC to re-execute the faulting instruction after returning from the fault. 2. Some faults are handled by notifying the process that the fault occurred, so it may recover in its own way. Fault handler munges the saved user context (PC, SP) to transfer control to a registered user-mode handler on return from the fault. Example: Unix signals or Microsoft NT user-mode Asynchronous Procedure Calls (APCs).

Options for Handling a Fault (2) 3. The kernel may handle unrecoverable faults by killing the user process. Program fault with no registered user-mode handler? Destroy the process, release its resources, maybe write the memory image to a file, and find another ready process/thread to run. In Unix this is the default action for many signals (e.g., SEGV). 4. How to handle faults generated by the kernel itself? Kernel follows a bogus pointer? Divides by zero? Executes an instruction that is undefined or reserved to user mode? These are generally fatal operating system errors resulting in a system crash, e.g., panic()!

Issues for Paged Memory Management The OS tries to minimize page fault costs incurred by all processes, balancing fairness, system throughput, etc. (1) fetch policy: When are pages brought into memory? prepaging: reduce page faults by bring pages in before needed clustering: reduce seeks on backing storage (2) replacement policy: How and when does the system select victim pages to be evicted/discarded from memory? (3) backing storage policy: Where does the system store evicted pages? When is the backing storage allocated? When does the system write modified pages to backing store?

text data BSS user stack args/env kernel Virtual Memory as a Cache virtual memory (big) physical memory (small) backing storage executable file header pageout/eviction text data idata data wdata symbol table, etc. page fetch program sections physical page frames process segments virtual-to-physical translations

Rationale for I/O Cache Structure Goal: maintain Kslots in memory as a cache over a collection of m items on secondary storage (K << m). 1. What happens on the first access to each item? Fetch it into some slot of the cache, use it, and leave it there to speed up access if it is needed again later. 2. How to determine if an item is resident in the cache? Maintain a directory of items in the cache: a hash table. Hash on a unique identifier (tag) for the item (fully associative). 3. How to find a slot for an item fetched into the cache? Choose an unused slot, or select an item to replace according to some policy, and evict it from the cache, freeing its slot.

Mechanism for Cache Eviction/Replacement Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse. • Busy items in active use are not on the list. E.g., some in-memory data structure holds a pointer to the item. E.g., an I/O operation is in progress on the item. • The best candidates are slots that do not contain valid items. Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed). • Other slots are listed in order of value of the items they contain. These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.

Replacement Policy The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list. defines the replacement policy A typical cache replacement policy is Least Recently Used. • Assume hot items used recently are likely to be used again. • Move the item to the tail of the free list on every release. • The item at the front of the list is the coldest inactive item. Other alternatives: • FIFO: replace the oldest item. • MRU/LIFO: replace the most recently used item.

The Page Caching Problem Each thread/process/job utters a stream of page references. • reference string: e.g., abcabcdabce.. The OS tries to minimize the number of faults incurred. • The set of pages (the working set) actively used by each job changes relatively slowly. • Try to arrange for the resident set of pages for each active job to closely approximate its working set. Replacement policy is the key. • On each page fault, select a victim page to evict from memory; read the new page into the victim’s frame. • Most systems try to approximate an LRU policy.

VM Page Cache Internals HASH(memory object/segment, logical block) 1. Pages in active use are mapped through the page table of one or more processes. 2. On a fault, the global object/offset hash table in kernel finds pages brought into memory by other processes. 3. Several page queues wind through the set of active frames, keeping track of usage. 4. Pages selected for eviction are removed from all page tables first.

Managing the VM Page Cache Managing a VM page cache is similar to a file block cache, but with some new twists. 1. Pages are typically referenced by page table (pmap) entries. Must invalidate mappings before reusing the frame. 2. Reads and writes are implicit; the TLB hides them from the OS. How can we tell if a page is dirty? How can we tell if a page is referenced? 3. Cache manager must run policies periodically, sampling page state. Continuously push dirty pages to disk to “launder” them. Continuouslycheck references to judge how “hot” each page is. Balance accuracy with sampling overhead.

The Paging Daemon Most OS have one or more system processes responsible for implementing the VM page cache replacement policy. • A daemon is an autonomous system process that periodically performs some housekeeping task. The paging daemon prepares for page eviction before the need arises. • Wake up when free memory becomes low. • Clean dirty pages by pushing to backing store. prewrite or pageout • Maintain ordered lists of eviction candidates. • Decide how much memory to allocate to file cache, VM, etc.

LRU Approximations for Paging Pure LRU and LFU are prohibitively expensive to implement. • most references are hidden by the TLB • OS typically sees less than 10% of all references • can’t tweak your ordered page list on every reference Most systems rely on an approximation to LRU for paging. • periodically sample the reference bit on each page visit page and set reference bit to zero run the process for a while (the reference window) come back and check the bit again • reorder the list of eviction candidates based on sampling

FIFO with Second Chance Idea: do simple FIFO replacement, but give pages a “second chance” to prove their value before they are replaced. • Every frame is on one of three FIFO lists: active, inactive and free • Page fault handler installs new pages on tail of active list. • “Old” pages are moved to the tail of the inactive list. Paging daemon moves pages from head of active list to tail of inactive list when demands for free frames is high. Clear the refbit and protect the inactive page to “monitor” it. • Pages on the inactive list get a “second chance”. If referenced while inactive, reactivate to the tail of active list.

Illustrating FIFO-2C I.Restock inactive list by pulling pages from the head of the active list: clear the ref bit and place on inactive list (deactivation). active list II.Inactive list scan from head: 1. Page has been referenced? Return to tail of active list (reactivation). inactive list 2. Page has not been referenced? pmap_page_protect and place on tail of free list. 3. Page is dirty? Push to backing store and return it to inactive list tail (clean). free list Consume frames from the head of the free list (free). If free shrinks below threshhold, kick the paging daemon to start a scan (I, II). Paging daemon typically scans a few times per second, even if not needed to restock free list.

FIFO-2C in Action (FreeBSD)

What Do the Pretty Colors Mean? This is a plot of selected internal kernel events during a run of a process that randomly reads/writes its virtual memory. • x-axis: time in milliseconds (total run is about 3 seconds) • y-axis: each event pertains to a physical page frame, whose PFN is given on the y-axis The machine is an Alpha with 8000 8KB pages (64MB total) The process uses 48MB of virtual memory: force the paging daemon to do FIFO-2C bookkeeping, but little actual paging. • events: page allocate (yellow-green), page free (red), deactivation (duke blue), reactivation (lime green), page clean (carolina blue).

What to Look For • Some low physical memory ranges are reserved to the kernel. • Process starts and soaks up memory that was initially free. • Paging daemon frees pages allocated to other processes, and the system reallocates them to the test process. • After an initial flurry of demand-loading activity, things settle down after most of the process memory is resident. • Paging daemon begins to run more frequently as memory becomes overcommitted (dark blue deactivation stripes). • Test process touches pages deactivated by the paging daemon, causing them to be reactivated. • Test process exits (heavy red bar).

page alloc

deactivate

clean

free

activate

Questions for Paged Virtual Memory 1. How do we prevent users from accessing protected data? 2. If a page is in memory, how do we find it? Address translation must be fast. 3. If a page is not in memory, how do we find it? 4. When is a page brought into memory? 5. If a page is brought into memory, where do we put it? 6. If a page is evicted from memory, where do we put it? 7. How do we decide which pages to evict from memory? Page replacement policy should minimize overall I/O.

Understanding Virtual Memory and Address Translation Basics