680 likes | 1.46k Views
Understanding Linux Kernel - Booting, Syscalls, Interrupts & Context Switching. By – Jayant Upadhyay 2003CS50214 Pankaj K. Sharma 2003CS50219 Sohit Bansal 2003CS50224 Akshay Gaur 2003CS50209. Overview of Booting. The process can be divided into following six logical stages:
E N D
Understanding Linux Kernel- Booting, Syscalls, Interrupts& Context Switching By – Jayant Upadhyay 2003CS50214 Pankaj K. Sharma 2003CS50219 Sohit Bansal 2003CS50224 Akshay Gaur 2003CS50209
Overview of Booting The process can be divided into following six logical stages: • BIOS selects the boot device • BIOS loads the boot sector from the boot device • Boot-sector loads setup, decompression routines and compressed kernel image • Kernel is uncompressed in protected mode • Low level initialization is performed by the asm code • High-level C initialization
BIOS POST • POST – Power On Self Test • Power supply starts the clock generator and asserts #POWERGOOD signal on the bus • CPU #RESET line is asserted • POST checks are performed with interrupts disabled • IVT initialized at address zero • BIOS bootstrap function is invoked via INT 0x19. This loads track 0, sector 1 at physical address 0x7C00(0x07C0:0000)
Boot-sector & Setup The boot-sector to boot linux kernel could be either: • Linux boot-sector(arch/i386/boot/bootsect.S) • LILO (or other bootloader’s) boot-sector
Linux Boot-sector • bootsector.S • Firstly moves the bootsector’s code from 0x7C00 to 0x90000 • Then it jumps to the newly made copy of bootsector i.e. in segment 0x90000 • Prepares the stack at $INITSEG:0x4000-0xC • This is where the limit on setup size comes from • Setup sectors are loaded immediately after the bootsector i.e. at physical address using BIOS service INT 0x13
If loading is failed due to some reason error code is dumped n it retry in endless loop • If loading setup_sects sectors of setup code succeeded we jump to label ok_load_setup • Kernel image is then loaded 0x10000. This is done to preserve the firmware data in low memory ( 0-64K ) • After the kernel is loaded we jump to $SETUPSEG:0(arch/i386/boot/setup.S)
setup.S • Once the data is no longer needed (e.g. no more calls to BIOS) it is overwritten by moving the entire (compressed) kernel image from 0x10000 to 0x1000. • sets things up for protected mode and jumps to 0x1000 which is the head of the compressed kernel, i.e. arch/386/boot/compressed/{head.S,misc.c} • This sets up stack and calls decompress_kernel() which uncompresses the kernel to address 0x100000 and jumps to it.
How to load a big kernel? • The setup sectors are loaded as usual at 0x90200, but the kernel is loaded 64K chunk at a time using a special helper routine that calls BIOS to move data from low to high memory. • This helper routine is referred to by bootsect_kludge in bootsect.S and is defined as bootsect_helper in setup.S. The bootsect_kludge label in setup.S contains the value of setup segment and the offset of bootsect_helper code in it so that bootsector can use the lcall instruction to jump to it (inter−segment jump). • This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memory and resets %es to always point to 0x10000. This ensures that the code in bootsect.S doesn't run out of low memory when copying data from disk.
Using LILO as bootloader • There are several advantages in using a specialised bootloader (LILO) over a bare bones Linux bootsector: • Ability to choose between multiple Linux kernels or even multiple OSes. • Ability to pass kernel command line parameters • Ability to load much larger bzImage kernels − up to 2.5M vs 1M. • Old versions of LILO (v17 and earlier) could not load bzImage kernels. The newer versions (as of a couple of years ago or earlier) use the same technique as bootsect+setup of moving data from low into high memory by means of BIOS services.
High Level Initialization • By "high−level initialisation" we consider anything which is not directly related to bootstrap, even though parts of the code to perform this are written in asm, namely arch/i386/kernel/head.S which is the head of the uncompressed kernel. The following steps are performed: • Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS = 0x18). • Initialise page tables. • Enable paging by setting PG bit in %cr0. • Zero−clean BSS (on SMP, only first CPU does this). • Copy the first 2k of bootup parameters (kernel commandline). • Check CPU type using EFLAGS and, if possible, cpuid, able to detect 386 and higher. • The first CPU calls start_kernel(), all others call arch/i386/kernel/smpboot.c:initialize_secondary() if ready=1, which just reloads esp/eip and doesn't return.
The init/main.c:start_kernel() is written in C and does the following: • Perform arch−specific setup (memory layout analysis, copying boot command line again, etc.). • Print Linux kernel "banner" containing the version, compiler used to build it etc. to the kernel ring • buffer for messages. This is taken from the variable linux_banner defined in init/version.c and is the same string as displayed by cat /proc/version. • Initialise traps, irqs, data required for scheduler. • Parse boot commandline options & Initialise console. • If module support was compiled into the kernel, initialise dynamical module loading facility.
If "profile=" command line was supplied, initialise profiling buffers. • kmem_cache_init(), initialise most of slab allocator. • Enable interrupts. • Calculate BogoMips value for this CPU. • Call mem_init() which calculates max_mapnr, totalram_pages and high_memory and prints out the "Memory: ..." line. • kmem_cache_sizes_init(), finish slab allocator initialisation. • Initialise data structures used by procfs. • fork_init(), create uid_cache, initialise max_threads based on the amount of memory • available and configure RLIMIT_NPROC for init_task to be max_threads/2.
Create various slab caches needed for VFS, VM, buffer cache, etc. • If System V IPC support is compiled in, initialise the IPC subsystem. Note that for System V shm, this includes mounting an internal (in−kernel) instance of shmfs filesystem. • If quota support is compiled into the kernel, create and initialise a special slab cache for it. • Perform arch−specific "check for bugs" and, whenever possible, activate workaround for processor/bus/etc bugs. Comparing various architectures reveals that "ia64 has no bugs" and "ia32 has quite a few bugs", good example is "f00f bug" which is only checked if kernel is compiled for less than 686 and worked around accordingly. • Set a flag to indicate that a schedule should be invoked at "next opportunity" and create a kernel thread init() which execs execute_command if supplied via "init=" boot parameter, or tries to exe /sbin/init, /etc/init, /bin/init, /bin/sh in this order; if all these fail, panic with "suggestion" to use "init=" parameter. • Go into the idle loop, this is an idle thread with pid=0.
Interrupts and Exceptions • Hardware support for getting CPUs attention • Often transfers from user to kernel mode • Nested interrupts are possible; interrupt can occur while an interrupt handler is already executing (in kernel mode) • Asynchronous: device or timer generated • Unrelated to currently executing process • Synchronous: immediate result of last instruction • Often represents a hardware error condition • Intel terminology and hardware • Irqs, vectors, IDT, gates, PIC, APIC • Interrupt handling: data structures, flow of control • Handlers: softirqs, tasklets, bottom halves
Basic Ideas • Similar to context switch (but lighter weight) • Hardware saves a small amount of context on stack • Includes interrupted instruction if restart needed • Execution resumes with special “iret” instruction • Structure: top and bottom halves • Top-half: do minimum work and return • Bottom-half: deferred processing • Handler code executed in response • Possible to temporarily mask interrupts • Handlers need not be reentrant • But other interrupts can occur, causing nesting
Interrupts vs Exceptions • Varying terminology but for Intel: • Interrupt (synchronous, device generated) • Maskable: device-generated, associated with IRQs (interrupt request lines); may be temporarily disabled (still pending) • Nonmaskable: some critical hardware failures • Exceptions (asynchronous) • Processor-detected • Faults – correctable (restartable); e.g. page fault • Traps – no reexecution needed; e.g. breakpoint • Aborts – severe error; process usually terminated (by signal) • Programmed exceptions (software interrupts) • int (system call), int3 (breakpoint) • into (overflow), bounds (address check)
Vectors, IDT • Vector: index (0-255) into descriptor table (IDT) • Special register: idtr points to table (use lidt to load) • IDT: table of “gate descriptors” • Segment selector + offset for handler • Descriptor Privilege Level (DPL) • Gates (slightly different ways of entering kernel) • Task gate: includes TSS to transfer to (not used by Linux) • Interrupt gate: disables further interrupts • Trap gate: further interrupts still allowed • Vector assignments • Exceptions, NMI are fixed • Maskable interrupts can be assigned as needed
PIC • Programmable Interrupt Controller (PIC) • chip between devices and cpu • Fixed number of wires in from devices • IRQs: Interrupt ReQuest lines • Single wire to CPU + some registers • PIC translates IRQ to vector • Raises interrupt to CPU • Vector available in register • Waits for ack from CPU • Other interrupts may be pending • Possible to “mask” interrupts at PIC or CPU • Early systems cascaded two 8 input chips (8259A)
vector PIC CPU IDT handler Interrupt Handling Components IRQs Memory Bus 0 0 INTR idtr 15 Mask points 255
IO-APIC, LAPIC • Advanced PIC for SMP systems • Used in all modern systems • Interrupts “routed” to CPU over system bus • IPI: inter-processor interrupt • Local APIC versus “frontend” IO-APIC • Devices connect to front-end IO-APIC • IO-APIC communicates (over bus) with Local APIC • Interrupt routing • Allows broadcast or selective routing of interrupts • Need to distribute interrupt handling load • Routes to lowest priority process • Special register: Task Priority Register (TPR) • Arbitrates (round-robin) if equal priority
Intel Exceptions • Architecture (processor) dependent • Intel has about 20 (out of 32 possible) • Most exceptions send signal to current process • Default action often just kills process • Page fault is the one exception; very complex handler • Some examples: • 0 SIGFPE Divide by zero error • 3 SIGTRAP Breakpoint • 6 SIGILL Invalid op-code • 11 SIGBUS Segment not present • 12 SIGBUS Stack overflow • 13 SIGSEGV General protection fault (DPL violation) • 14 SIGSEGV Page fault
Hardware Handling • On entry: • Which vector? • Get corresponding descriptor in IDT • Find specified descriptor in GDT (for handler) • Check privilege levels (CPL, DPL) • If entering kernel mode, set kernel stack • Save eflags, cs, (original) eip on stack • -> Jump to appropriate handler • Assembly code prepares C stack, calls handler • On return (i.e. iret): • Restore registers from stack • If returning to user mode, restore user stack • Clear segment registers (if privileged selectors)
Nested Execution • Interrupts can be interrupted • By different interrupts; handlers need not be reentrant • No notion of priority in Linux • Small portions execute with interrupts disabled • Interrupts remain pending until acked by CPU • Exceptions can be interrupted • By interrupts (devices needing service) • Exceptions can nest two levels deep • Exceptions indicate coding error • Exception code (kernel code) shouldn’t have bugs • Page fault is possible (trying to touch user data)
IDT Initialization • Initialized once by BIOS in real mode • Linux re-initializes during kernel init • Must not expose kernel to user mode access • start by zeroing all descriptors • Linux lingo: • Interrupt gate (same as Intel; no user access) • Not accessible from user mode • System gate (Intel trap gate; user access) • Used for int, int3, into, bounds • Trap gate (same as Intel; no user access) • Used for exceptions
Exception Handling • Some exceptions push error code on stack • IDT points to small individual handlers (assembly) • handler_name: pushl $0 // placeholder if no error code pushl $do_handler_name jmp error_code • Common code sets up for C call • Pops handler address from stack, calls • All handlers check if kernel mode • Exceptions caused by touching bad syscall params • Return to userland with error code • Other exceptions-> die() // kernel Oops • Most handlers just generate signal for current • current->tss.error_code = error_code; • current->tss.trap_no = vector; • force_sig(sig_number, current);
Interrupt Handling • More complex than exceptions • Requires registry, deferred processing, etc. • Some issues: • IRQs are often shared; all handlers (ISRs) are executed so they must query device • IRQs are dynamically allocated to reduce contention • Example: floppy allocates when accessed • Three types of actions: • Critical: Top-half (interrupts disabled – briefly!) • Example: acknowledge interrupt • Non-critical: Top-half (interrupts enabled) • Example: read key scan code, add to buffer • Non-critical deferrable: Do it “later” (interrupts enabled) • Example: copy keyboard buffer to terminal handler process • Softirqs, tasklets, bottom halves (deprecated)
IRQ, Vector Assignment • PCI bus usually assigns IRQs at boot • Vectors usually IRQ# + 32 • Below 32 reserved for non-maskable, execeptions • Vector 128 used for syscall • Vectors 251-255 used for IPI • Some IRQs are fixed by architecture • IRQ0: interval timer • IRQ2: cascade pin for 8259A • See /proc/interrupts for assignments
IRQ Data Structures • irq_desc: array of IRQ descriptors • status (flags), lock, depth (for nested disables) • handler: PIC device driver! • action: linked list of irqaction structs (containing ISRs) • irqaction: ISR info • handler: actual ISR! • flags: • SA_INTERRUPT: interrupts disabled if set • SA_SHIRQ: sharing allowed • SA_SAMPLE_RANDOM: input for /dev/random entropy pool • name: for /proc/interrupts • dev_id, next • irq_stat: per-cpu counters (for /proc/interrupts)
Interrupt Processing • BUILD_IRQ macro generates: • IRQn_interrupt: • pushl $n-256 // negative to distinguish syscalls • jmp common_interrupt • Common code: • common_interrupt: • SAVE_ALL // save a few more registers than hardware • call do_IRQ • jmp $ret_from_intr • do_IRQ() is C code that handles all interrupts
Low-level IRQ Processing • do_IRQ(): • get vector, index into irq_desc for appropriate struct • grab per-vector spinlock, ack (to PIC) and mask line • set flags (IRQ_PENDING) • really process IRQ? (may be disabled, etc.) • call handle_IRQ_event() • some logic for handling lost IRQs on SMP systems • handle_IRQ_event(): • enable interrupts if needed (SA_INTERRUPT clear) • execute all ISRs for this vector: • action->handler(irq, action->dev_id, regs);
Deferrable Functions • Bottom-halves (deprecated): • Old static array of function pointers that are marked for execution (can be masked temporarily) • Executed on kernel to user transition • Executed serially (globally) on SMP system • Mostly for networking code: • Tasklets: Different tasklets can execute concurrently • Softirqs: The same softirq can execute concurrently • Layered implementation: • Bottom-halves implemented using tasklets • Tasklets implemented using softirqs • When executed? (pretty frequently) • When last (nested) interrupt handler terminates • When network packet receiver • When idle: per-cpu ksoftirqd kernel thread • Lot’s of detail in book; a bit complex …
Return Code Path • Interleaved assembly entry points: • ret_from_exception() • ret_from_inr() • ret_from_sys_call() • ret_from_fork() • See flowchart in text (Fig 4-5 page 158) • Things that happen: • Run scheduler if necessary • Return to user mode if no nested handlers • Restore context, user-stack, switch mode • Re-enable interrupts if necessary • Deliver pending signals • (Some DOS emulation stuff – VM86 Mode)
System Calls • Interface between user-level processes and hardware devices. • CPU, memory, disks etc. • Make programming easier: • Let kernel take care of hardware-specific issues. • Increase system security: • Let kernel check requested service via syscall. • Provide portability: • Maintain interface but change functional implementation.
Mode, Space, Context • Mode: hardware restricted execution state • restricted access, privileged instructions • user mode vs. kernel mode • “dual-mode architecture”, “protected mode” • Intel supports 4 protection “rings”: 0 kernel, 1 unused, 2 unused, 3 user • Space: kernel (system) vs. user (process) address space • requires MMU support (virtual memory) • “userland”: any process address space; there are many user address spaces • reality: kernel is often mapped into user process space • Context: kernel activity on “behalf” of ??? • process: on behalf of current process • system: unrelated to current process (maybe no process!) • example “interrupt context” • blocking not allowed! 35
POSIX APIs • API = Application Programmer Interface. • Function defn specifying how to obtain service. • By contrast, a system call is an explicit request to kernel made via a software interrupt. • Standard C library (libc) contains wrapper routines that make system calls. • e.g., malloc, free are libc routines that use the brk system call. • POSIX-compliant = having a standard set of APIs. • Non-UNIX systems can be POSIX-compliant if they offer the required set of APIs.
Interrupts and Exceptions • Interrupts - async device to cpu communication • example: service request, completion notification • aside: IPI – interprocessor interrupt (another cpu!) • system may be interrupted in either kernel or user mode • interrupts are logically unrelated to current processing • Exceptions - sync hardware error notification • example: divide-by-zero (AU), illegal address (MMU) • exceptions are caused by current processing • Software interrupts (traps) • synchronous “simulated” interrupt • allows controlled “entry” into the kernel from userland 37
Linux System Calls Invoked by executing int $0x80. • Programmed exception vector number 128. • CPU switches to kernel mode & executes a kernel function. • Calling process passes syscall number identifying system call in eax register (on Intel processors). • Syscall handler responsible for: • Saving registers on kernel mode stack. • Invoking syscall service routine. • Exiting by calling ret_from_sys_call().
Linux System Calls • System call dispatch table: • Associates syscall number with corresponding service routine. • Stored in sys_call_table array having up to NR_syscall entries (usually 256 maximum). • nth entry contains service routine address of syscall n.
Library Code System Call Interface Kernel system call table trap / interrupt table scheduler Devices Kernel Entry and Exit exceptions (error traps) trap 80h boot IPI: inter- processor interrupt device dialog interrupt page faults 40
Initializing System Calls • trap_init() called during kernel initialization sets up the IDT (interrupt descriptor table) entry corresponding to vector 128: • set_system_gate(0x80, &system_call); • A system gate descriptor is placed in the IDT, identifying address of system_call routine. • Does not disable maskable interrupts. • Sets the descriptor privilege level (DPL) to 3: • Allows User Mode processes to invoke exception handlers (i.e. syscall routines).
The system_call() Function • Saves syscall number & CPU registers used by exception handler on the stack, except those automatically saved by control unit. • Checks for valid system call. • Invokes specific service routine associated with syscall number (contained in eax): • call *sys_call_table(0, %eax, 4) • Return code of system call is stored in eax.
Parameter Passing • As the syscall number, user-space must relay the parameters to the kernel during the exception trap • The parameters are stored in registers: onx86, the registers ebx, ecx, edx, esi, and edi contain, in order, the first five arguments. • In the unlikely case of six or more arguments, a single register is used to hold a pointer to user-space where all the parameters reside • The return value is sent to user-space via register, eax on x86
Writing a system call for Linux • Define its purpose, i.e., exactly one purpose • Decide arguments, return value, and error codes • Design the interface with forward compatibility in mind • return appropriate error codes • Verifying the Parameters The pointer points to a region of memory in user-space The pointer points to a region of memory in the process’s address space If reading, the memory is marked readable. If writing, the memory is marked writable
copy_to_user(usr_dst, krnl_src, len); • copy_from_user(krnl_dst, usr_src, len); Asmlinkage long sys_scopy(unsigned long *src, unsigned long *dst, unsigned long len) { unsigned long buf; /*fail if the kernel wordsize and user wordsize do not match */ if (len != sizeof(buf)) return –EINVAL; if (copy_from_user(&buf, src, len)) return –EFAULT; if (copy_to_user(dst, &buf, len)) return –EFAULT; return len; /*return amount of data copied */ }
System Call Context • In process context, the kernel is capable of sleeping (e.g., blocked on a call or calling schedule()): make use of the majority of the kernel’s functionality; simplifying kernel programming • In process context, the kernel is preemptible: system calls must be reentrant (the current task may be preempted by another task that may then execute the same system call).
Blocking System Calls • system calls may block “in the kernel” • “slow” system calls may block indefinitely • reads, writes of pipes, terminals, net devices • some ipc calls, pause, some opens and ioctls • disk io is NOT slow (it will eventually complete) • blocking slow calls may be “interrupted” by a signal • returns EINTR • problem: slow calls must be wrapped in a loop • BSD introduced “automatic restart” of slow interrupted calls • POSIX didn’t specify semantics • Linux • no automatic restart by default • specify restart when setting signal handler (SA_RESTART) 47
Linux Files Relating to Syscalls • Main files: • arch/i386/kernel/entry.S • System call and low-level fault handling routines. • include/asm-i386/unistd.h • System call numbers and macros. • kernel/sys.c • System call service routines.
arch/i386/kernel/entry.S • Add system calls by appending entry to sys_call_table: .long SYMBOL_NAME(sys_my_system_call)
include/asm-i386/unistd.h • Each system call needs a number in the system call table: • e.g., #define __NR_write 4 • #define __NR_my_system_call nnn, where nnn is next free entry in system call table.