420 likes | 598 Views
The Machine and the Kernel. Jeff Chase Duke University. Today. The operating system kernel! What is it? Where is it? How do we get there? How is it protected? How does it control resources? How does it control access to data? How does it keep control?. User processes/segments. user
E N D
The Machine and the Kernel Jeff Chase Duke University
Today • The operating system kernel! • What is it? • Where is it? • How do we get there? • How is it protected? • How does it control resources? • How does it control access to data? • How does it keep control? User processes/segments user space kernel space kernel
Precap: the kernel • Today, all “real” operating systems have protected kernels. • The kernel resides in a well-known file: the machine automatically loads it into memory and starts it (boot) on power-on/reset. • The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected: • User code cannot access internal kernel data structures directly. • User code can invoke the kernel only at well-defined entry points (system calls). • Kernel code is “just like” user code, but the kernel is privileged: • The kernel has direct access to all machine functions, and defines the handler entry points for CPU events: trap, fault, interrupt. • Once booted, the kernel acts as one big event handler.
The kernel VAS 0x0 • The kernel is just a program: a collection of modules and their state. • E.g., it may be written in C and compiled/linked a little differently. • E.g., linked with –static option: no dynamic libs • At runtime, kernel code and data reside in a protected range of virtual addresses. • The range (kernel space) is “part of” every VAS. • VPN->PFN translations for kernel space are global. • (Details vary by machine and OS configuration) • Access to kernel space is denied for user programs. • Portions of kernel space may be non-pageable and/or direct-mapped to machine memory. user space kernel code kernel space kernel data high
Example: Windows/IA32 User spaces one per VAS occupies “low half” of VAS (2GB) Alternative configuration allows user spaces larger than 2GB kernel space high-order bit set in virtual address kernel space two highest bits are set (0xc00..)
Windows IA-32(Kernel space) • The point is: • There are lots of different regions within kernel space to meet internal OS needs. • - page tables for various VAS • - page table for kernel space itself • - file block cache • - internal data structures The details aren’t important.
CPU mode:user and kernel CPU core • The current mode of a CPU core is represented by a field in a protected register. • We consider only two possible values: user mode or kernel mode (also called protected mode or supervisor mode). • If the core is in protected mode then it can: • access kernel space • access certain control registers • - execute certain special instructions U/K mode R0 Rn x PC registers If software attempts to do any of these things when the core is in user mode, then the core raises a CPU exception (a fault).
x86 control registers The details aren’t important. See [en.wikipedia.org/wiki/Control_register]
Entering the kernel • Suppose a CPU core is running user code in user mode: • The user program controls the core. • The core goes where the program code takes it… • …as determined by its register state (context) and the values encountered in memory. • How does the OS get control back? How does the core switch to kernel mode? • CPU exception: trap, fault, interrupt • On an exception, the CPU transitions to kernel mode and resets the PC and SP registers. • Set the PC to execute a pre-designated handler routine for that exception type. • Set the SP to a pre-designated kernel stack. user space Safe control transfer kernel code kernel space kernel data
Exceptions: trap, fault, interrupt intentional happens every time unintentional contributing factors trap: system call open, close, read, write, fork, exec, exit, wait, kill, etc. fault invalid or protected address or opcode, page fault, overflow, etc. synchronous caused by an instruction asynchronous caused by some other event “software interrupt” software requests an interrupt to be delivered at a later time interrupt caused by an external event: I/O op completed, clock tick, power fail, etc.
Entry to the kernel Every entry to the kernel is the result of a trap, fault, or interrupt. The core switches to kernel mode and transfers control to a handler routine. syscall trap/return fault/return OS kernel code and data for system calls (files, process fork/exit/wait, pipes, binder IPC, low-level thread support, etc.) and virtual memory management (page faults, etc.) I/O completions timer ticks interrupt/return The handler accesses the core register context to read the details of the exception (trap, fault, or interrupt). It may call other kernel routines.
“Limited direct execution” user mode syscall trap fault fault time u-start u-return u-start u-return kernel “top half” kernel mode kernel “bottom half” (interrupt handlers) interrupt interrupt return boot User code runs on a CPU core in user mode in a user space. If it tries to do anything weird, the core transitions to the kernel, which takes over. The kernel executes a special instruction to transition to user mode (labeled as “u-return”), with selected values in CPU registers.
The kernel must be bulletproof Secure kernels handle system calls verrry carefully. User program / user space Syscalls indirect through syscall dispatch table by syscall number. No direct calls to kernel routines from user space! Kernel copies all arguments into kernel space and validates them. user buffers trap copyin copyout Kernel interprets pointer arguments in context of the user VAS, and copies the data in/out of kernel space (e.g., for read and writesyscalls). What about references to kernel data objects passed as syscall arguments (e.g., file to read or write)? read() {…} write() {…} kernel Use an integer index into a kernel table that points at the data object. The value is called a handle or descriptor. No direct pointers to kernel data from user space!
Safe copy primitives This slide clarifies the safe copy primitives used by kernel syscall handlers. The names may be confusing to some of us (;-). Copyin to the kernel, copyout from the kernel. This is an example from BSD Unix systems: the details aren’t important, but note the safety features. [From http://www.unix.com/man-page/FreeBSD/9/copyout/] copyin() Copies len bytes of data from the user-space address uaddr to the kernel-space address kaddr. copyout() <copyout copies out of the kernel and in to the user-space buffer. > copyinstr() Copies a NUL-terminated string, at most len bytes long, from user-space address uaddr to kernel-space address kaddr. The number of bytes actually copied, including the terminating NUL, is returned in *done… RETURN VALUES The copy functions return 0 on success or EFAULT if a bad address is encountered. In addition, the copystr(), and copyinstr() functions return ENAMETOOLONG if the string is longer than len bytes.
Example: Unix file I/O An open file is represented by an integer file descriptorvalue returned by the kernel. char buf[BUFSIZE]; intfd; if ((fd = open(“../zot”, O_TRUNC | O_RDWR) == -1) { perror(“open failed”); exit(1); } while(read(0, buf, BUFSIZE)) { if (write(fd, buf, BUFSIZE) != BUFSIZE) { perror(“write failed”); exit(1); } } Pass the file descriptor value back to kernel to reference the open file on subsequent syscalls. Read/write syscalls pass virtual address of a user-space buffer. For a write, the kernel retrieves data from the buffer and copies it in to kernel space. For a read, the kernel copies the data out of kernel space and places it into the buffer.
Unix “file descriptors” illustrated user space kernel space file Disclaimer: this drawing is oversimplified (and we will talk about pipes, sockets, and tty later) intfd pointer pipe socket per-process descriptor table tty system-wide open file table Processes often reference OS kernel objects with integers that index into a table of pointers in the kernel. Windows calls them handles. Example: a Unix file descriptor is a value stored in an ordinary integer variable in a user program. The kernel chooses the value for the descriptor: when the program opens a file, the kernel selects a free entry in the descriptor table and returns its index as the value. The program remembers the number and uses it to name the open file for subsequent system calls.
Syscalltraps • Programs in C, C++, etc. invoke system calls by linking to a standard library (libc) written in assembly. • The library defines a stub or wrapper routine for each syscall. • Stub executes a special trap instruction (e.g., chmk or callsysor syscall/sysenter instruction) to change mode to kernel. • Syscall arguments/results are passed in registers (or user stack). • OS+machine defines Application Binary Interface (ABI). read() in Unix libc.a Alpha library (executes in user mode): #define SYSCALL_READ 27 # op ID for a read system call move arg0…argn, a0…an # syscallargs in registers A0..AN move SYSCALL_READ, v0 # syscall dispatch index in V0 callsys # kernel trap move r1, _errno # errno= return status return Example read syscall stub for Alpha CPU ISA (defunct)
MacOS x86-64 syscall example section .data hello_worlddb "Hello World!", 0x0a section .text global start start: movrax, 0x2000004 ; System call write = 4 movrdi, 1 ; Write to standard out = 1 movrsi, hello_world ; The address of hello_world string movrdx, 14 ; The size to write syscall ; Invoke the kernel movrax, 0x2000001 ; System call number for exit = 1 mov rdi, 0 ; Exit success = 0 syscall ; Invoke the kernel Illustration only: this program writes “Hello World!” to standard output (fd == 1), ignores the syscall error return, and exits. http://thexploit.com/secdev/mac-os-x-64-bit-assembly-system-calls/
Linux x64 syscall conventions (ABI) (user buffer addresses) Illustration only: the details aren’t important.
Anatomy of a read syscall 3. Figure out what disk blocks to fetch, and fetch them into kernel buffers. 6. Return to user mode. 2. Enter kernel for read syscall. 5. Copy data from kernel buffer to user buffer in read. (kernel mode) 1. Compute (user mode) 4. sleep for I/O (stall) CPU Disk seek transfer (DMA) Time
Hear the fans blow int main() { while(1); } How does the OS regain control of the core from this program? No system calls! No faults! How to give someone else a chance to run? How to “make” processes share machine resources fairly?
Timer interrupts user mode while(1); … resume time u-start kernel “top half” kernel mode kernel “bottom half” (interrupt handlers) clock interrupt interrupt return boot Enablestimeslicing The system clock (timer) interrupts periodically, giving control back to the kernel. The kernel can do whatever it wants, e.g., switch threads. time
Virtual resource sharing Understand that the OS kernel implements resource allocation (memory, CPU,…) by manipulating name spaces and contexts visible to user code. The kernel retains control of user contexts and address spaces via the machine’s limited direct execution model, based on protected mode and exceptions. space time
Memory Allocation Howshould an OS allocate its memory resources among contending demands? • Virtual address spaces: fork, exec, sbrk, page fault. • The kernel controls how many machine memory frames back the pages of each virtual address space. • The kernel can take memory away from a VAS at any time. • The kernel always gets control if a VAS (or rather a thread running within a VAS) asks for more. • The kernel controls how much machine memory to use as a cache for data blocks whose home is on slow storage. • Policy choices: which pages or blocks to keep in memory? And which ones to evict from memory to make room for others?
VM and files: the story so far Process (running program) Files on “disk” File system calls (e.g., open/read/write) globals Memory-mapped sections of program file text Thread Program heap Anonymous Segments (zero-fill) register context Per-file inodes indexed with logical blockID #. stack Segments (regions) in Virtual Address Space Read disk block address from map entry.
Operating Systems: The Classical View Each process has a private virtual address space and one or more threads. Programs run as independent processes. data data Protected system calls ...and upcalls (e.g., signals) Protected OS kernel mediates access to shared resources. Threads enter the kernel for OS services. The kernel code and data are protected from untrusted processes.
What is a Virtual Address Space? • Protection domain • A “sandbox” for threads that limits what memory they can access for read/write/execute. • A “lockbox” that limits which threads can access any given segment of virtual memory. • Uniform name space • Threads access their code and data items without caring where they are in machine memory, or even if they are resident in memory at all. • A set of VP translations • A level of indirection mapping virtual pages to page frames. • The OS kernel controls the translations in effect at any time.
Virtual memory faults (1) • Machine memory is “just a cache” over files and segments: a page fault is “just a cache miss”. • Machine passes faulting address to kernel (e.g., x86 control register CR2) with fault typeand faulting PC. • Kernel knows which virtual space is active on the core (e.g., x86 control register CR3). • Kernel consults other data structures related to virtual memory to figure out how to resolve the fault. • If the fault indicates an error, then signal/kill the process. • Else construct (or obtain) a frame containing the missing page, install the missing translation in the page table, and resume the user code, restarting the faulting instruction.
Virtual Addressing: Under the Hood probe page table MMU access physical memory load TLB start here yes miss probe TLB access valid? raise exception hit no load TLB zero-fill OS no (first reference) page on disk? page fault? (lookup and/or) allocate frame fetch from disk kill yes legal reference illegal reference How to monitor page reference events/frequency along the fast path?
Inside the VAS The vm_map is a linked list of map entries, one for each segment, sorted by starting virtual address. The kernel keeps a vm_map for each VAS. Each map entry points to a descriptor for the segment (a vm_object). This data structure is used in Mach-derived kernels, including BSD Unix and Mac OSX. (heap) The triangles represent VM objects (segments). The dots represent pages within segments. A segment may have any number of pages resident. “Vnode” refers to the inode for the underlying (backing) file. [http://manrix.sourceforge.net/microkernelservice.htm]
Inside the VAS Pages from anonymous segments are initialized to zero, but the process may write to them. If they are evicted from memory the contents must be stored somewhere on disk. Text and initialized static data are mapped from the executable file. (heap) Missing pages may be fetched from the (backing) file on demand. The stack and heap are zero-filled virtual memory: called anonymous because the backing file has no name (i.e., no links: it is destroyed if the process dies). [http://manrix.sourceforge.net/microkernelservice.htm]
Virtual memory faults (2) • Kernel searches maps for the object mapped at that address. No object? Then it’s an error, e.g., segmentation fault. • Kernel checks intended mode of access for the object (rwx). Access not allowed? Then it’s an error, e.g., protection fault. vm_map 2. If we find the segment, check the protection to see if the access is legal. 1. Run down the vm_map for the VAS to find the segment/region containing the faulting address. vm_object 3. If the access is legal, identify the backing object containing the page.
Virtual memory faults (3) • Is the missing page (object/offset) in a memory frame somewhere, but just missing from the page table? • Index page cache (object/offset hash table) to find out. (The page could be resident if the segment or backing object is shared, and the page is resident in memory on behalf of some other process.) • If not, then find a free frame of memory to hold the missing page. • Is the missing page in an object on backing storage? Figure out where: index the inode block map. Fetch page into the frame. • Or: is it the first reference to a page in a zero-fill object (e.g., stack or heap)? Then fill the frame with zeros. • So far so good? Install a translation in the page table entry (pte) mapping the faulted virtual page to its frame. • Adjust PC to restart faulted instruction, and return to user mode.
Recap: timers, interrupts, faults, etc. • When processor core is running a user program, the user program/thread controls (“drives”) the core. • The hardware has a timer device that interrupts the core after a given interval of time. • Interrupt transfers control back to the OS kernel, which may switch the core to another thread, or resume. • Other events also return control to the kernel. • Wild pointers • Divide by zero • Other program actions • Page faults
Recap: OS protection Know how a classical OS uses the hardware to protect itself and implement a limited direct execution model for untrusted user code. • Virtual addressing. Applications run in sandboxes that prevent them from calling procedures in the kernel or accessing kernel data directly (unless the kernel chooses to allow it). • Events. The OS kernel installs handlers for various machine events when it boots (starts up). These events include machine exceptions (faults), which may be caused by errant code, interrupts from the clock or external devices (e.g., network packet arrives), and deliberate kernel calls (traps) caused by programs requesting service from the kernel through its API. • Designated handlers. All of these machine events make safe control transfers into the kernel handler for the named event. In fact, once the system is booted, these events are the only ways to ever enter the kernel, i.e., to run code in the kernel.
I hope we get to here Extra slides
Concept: isolation Butler Lampson’s definition: “I am isolated if anything that goes wrong is my fault (or my program’s fault).” Three dimensions of isolation for protected contexts (e.g., processes): • Faultisolation. One app or app instance (process) can fail independently of others. If it runs amok, the OS can kill it and reclaim its memory, etc. • Performance isolation. The OS manages resources (“metal and glass”: computing power, memory, disk space, I/O bandwidth, etc.). Each instance needs the “right amount” of resources to run properly. The OS prevents apps from impacting the performance of other apps. • Security. An app may contain malware that tries to corrupt the system, steal data, or otherwise compromise the integrity of the system. The OS uses protected contexts and a reference monitor to check and authorize all accesses to data or objects.
Architectural foundations • A CPU event (an interruptor exception, i.e., a trap or fault) is an “unnatural” change in control flow. • Like a procedure call, an event changes the PC register. • Also changes mode or context (current stack), or both. • Events do not change the current space! • On boot, the kernel defines a handlerroutine for each event type. • The machine defines the event types. • Event handlers execute in kernel mode. • Every kernel entry results from an event. • Enter at the handler for the event. control flow exception.cc event handler (e.g., ISR: Interrupt Service Routine) • In some sense, the whole kernel is a “big event handler.”
Protecting Entry to the Kernel Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, Windows, etc). • The machine defines a small set of exceptional event types. • The machine defines what conditions raise each event. • The kernel installs handlers for each event at boot time. e.g., a table in kernel memory read by the machine user The machine transitions to kernel mode only on an exceptional event. The kernel defines the event handlers. Therefore the kernel chooses what code will execute in kernel mode, and when. interrupt or fault trap/return interrupt or fault kernel
Example handlers • Illegal operation • Reserved opcode, divide-by-zero, illegal access • That’s a fault! Kernel generates a signal to user program, e.g., to kill it or invoke an application’s exception handler. • Page fault • Case 1: Fetch page (or zero it), map it in PTE, restart instruction • Case 2: Signal error (e.g., “segmentation fault”) • Interrupts • I/O completion, e.g., disk read complete: resume a program • Arriving network packet, etc.: kick the network stack • Clock ticks (timer interrupt): maybe do a context switch • Power fail etc.: save state
Caches in Linux fasynccache 1 202 16 1 1 1 uid_cache 4 113 32 1 1 1 skbuff_head_cache 93 96 160 4 4 1 sock 115 126 1280 40 42 1 sigqueue 0 29 132 0 1 1 cdev_cache 156 177 64 3 3 1 bdev_cache 69 118 64 2 2 1 mnt_cache 13 40 96 1 1 1 inode_cache 5561 5580 416 619 620 1 dentry_cache 7599 7620 128 254 254 1 dquot 0 0 128 0 0 1 filp 1249 1280 96 32 32 1 names_cache 0 8 4096 0 8 1 buffer_head 15303 16920 96 422 423 1 mm_struct 47 72 160 2 3 1 vm_area_struct 1954 2183 64 34 37 1 fs_cache 46 59 64 1 1 1 files_cache 46 54 416 6 6 1 slabinfo - version: 1.1 kmem_cache 59 78 100 2 2 1 ip_fib_hash 10 113 32 1 1 1 ip_conntrack 0 0 384 0 0 1 urb_priv 0 0 64 0 0 1 clip_arp_cache 0 0 128 0 0 1 ip_mrt_cache 0 0 96 0 0 1 tcp_tw_bucket 0 30 128 0 1 1 tcp_bind_bucket 5 113 32 1 1 1 tcp_open_request 0 0 96 0 0 1 inet_peer_cache 0 0 64 0 0 1 ip_dst_cache 23 100 192 5 5 1 arp_cache 2 30 128 1 1 1 blkdev_requests 256 520 96 7 13 1 dnotifycache 0 0 20 0 0 1 file lock cache 2 42 92 1 1 1 The columns are cache name, active objects, total number of objects, object size, number of full or partial pages, total allocated pages, and pages per slab.
Page/block cache internals Lookup: HASH(blockID) free/eviction list This is what a software-based cache looks like. Each frame/buffer of memory is described by a meta-object (header). Resident pages/blocks are accessible for lookup in a global hash table. An ordered list of eviction candidates winds through the hash chains. Hash table bucket array • bucket lists • Policy choices: which pages or blocks to keep in memory? Which to evict from memory to make room for others? How to handle writes?