630 likes | 857 Views
Introduction to Linux Kernel Internals for OpenVMS Experts. Keith Parris HP. History. Unix heritage Richard M. Stallman and GNU project Berkeley Standard Distribution (BSD) Andy Tannenbaum and Minix Linus Torvalds and Linux 1991 Unix-like kernel for Intel 386. Philosophy. Open Source
E N D
Introduction toLinux Kernel Internalsfor OpenVMS Experts Keith Parris HP
History • Unix heritage • Richard M. Stallman and GNU project • Berkeley Standard Distribution (BSD) • Andy Tannenbaum and Minix • Linus Torvalds and Linux • 1991 Unix-like kernel for Intel 386
Philosophy • Open Source • Source can be: • Studied • Modified • Distributed without restrictions (except those of the GPL itself) • Do-it-yourself attitude • Rivalry with Microsoft Windows • Simplicity as Elegance
History • August 1991: Version 0.01 • October 1991: Version 0.02 • November 1993: First Slackware distribution, with kernel 0.99 • March 1994: Version 1.0 • June 1995: Ported to Alpha architecture • January 1999: Version 2.2 • January 2001: Version 2.4
Release Numbering • Major.Minor-Step format, e.g. 2.4-7 • Major: • 0 to 1: First stable 386 release • 1 to 2: SMP support • Minor: • Odd-numbered “dot” releases for new functionality, e.g. 0.9, 2.3, 2.5 • Even-numbered releases for stability, e.g. 1.0, 2.2, 2.4 • Step: Indicates roll-up of patches
Some Major Linux Distributions • Red Hat – leader in overall market share • SuSE – leader in the European market, making in-roads in the USA • Debian – favorite of Open Source purists • Slackware – first commercial Linux distribution; favorite of macho do-it-yourselfers • Mandrake – easy to install for novices; available at Wal-Mart • TurboLinux – family of distributions, including server, workstation, and cluster
Source Language • Linux is written in GNU C (gcc) • GNU C is a dialect of ANSI C (not K&R) • ANSI C has better type checking • So Linux is portable to any machine to which gcc has been ported
CPU Platform Support • 32-bit: • Intel x86, Crusoe, MIPS, ARM, SPARC, PowerPC, Motorola 68000 • 64-bit: • Alpha, MIPS, Itanium, PA-RISC, SPARC • Original version was on Intel 80386, and many artifacts remain from that legacy
Byte Ordering • Both types of platforms supported by Linux: • Little-Endian: IA32, Alpha, Itanium • Big-Endian: PA-RISC, PowerPC, SPARC64
Windowing Software • XFree86 • KDE – K Desktop Environment • GNOME – GNU Network Object Model Environment
Kernel Component Interaction Task Signals System Calls Processes & Scheduler File Systems Network protocols Virtual Memory Block device drivers Char device drivers Network device drivers Traps & Faults Physical Memory Interrupts CPU System Memory Terminal Disk Network Interface
Modes • User and Kernel Modes only • No Executive Mode • No Supervisor Mode • So no protection between different portions of the kernel
Linux Process Model • Basic execution unit on Linux is called a “task”, and is a thread of execution with an associated address space • Multiple tasks can share the same address space (for multi-threading) in a task group
Task data • Task data structure is about 1 KB in size • This is the most complex data structure in the Linux kernel • task_struct is placed at the end of the kernel stack area, so kernel stack pointer can be used to locate it efficiently • But if kernel stack overflows, it clobbers task structure
Task Creation • init task is Process ID (PID) #1 • fork() creates duplicate task • except Child is given different PID and stack • Address space is same (but with copy-on-write) • Child is independent of Parent • Typically exec() is then called in Child to load and run an executable image • clone() is used to create more threads • Address space, file descriptors, signal handlers can be shared • PID is the same for all threads in the same process
Symmetrical Multi-Processing • SMP support was introduced in 2.0 • But scalability was limited to about 4 CPUs • Version 2.4 and 2.5 are better in scalability • All CPUs can process interrupts and execute kernel functions • Kernel can be built with or without MP support (for efficiency on Uni-Processor machines)
SMP Synchronization • 2 types of spinlocks: • Adaptive: if lock holder is running, spin-wait; otherwise, block (sleep) • Spin: spin-wait until lock becomes free
Spinlocks • Conflicting priorities in SMP design: • Safety is easier to ensure with few spinlocks • Linux originally had just one, the Big Kernel Lock (BKL) • Performance is better with lots of individual spinlocks for more parallelism • Hierarchy is needed to avoid deadlocks with multiple spinlocks (Linux has only a few heirarchies) • Keeping per-CPU data separate avoids some spinlocks • Linux kernel presently uses > 100 different locks
Time keeping • 10 millisecond clock ticks on IA32 • 1024 per second on Alpha • Timers are available for processes (similar to VMS TQEs) • syscall gettimeofday() analogous to $GETTIM system service • except down to microsecond resolution instead of 10-millisecond
Interrupts • Interrupts are divided into two classes: • “top-half” (hardware) and • “bottom-half” (software)
Interrupts • Interrupt modes: • Critical: all interrupts masked, & uses kernel stack of current task, for lowest interrupt latency • Noncritical: only interrupt of same IRQ is masked; higher IRQs might pre-empt • Deferred: uses software interrupt to defer low-priority work
Interrupts • Interrupts can be directed to different CPUs in an SMP system via IRQ affinity • Kernel can be profiled, either by timer IRQ or (using gcc options) by function call
Virtual Memory • Linux kernel itself is not pageable • Despite terminology like “swap partition”, Linux actually does only paging, not swapping • Page replacement is LRU (for v2.4) • A linked list of all pages in memory is kept, with most-recently used pages at the front and least-recently used pages at the end
Virtual Memory • Linux has a 3-tier virtual address translation table model • IA32 has only 2 tiers, so 2nd level is mapped 1-to-1 • Linux page size is typically 4 KB • Virtual address format for IA32: • Page Directory Entry (PGD): 10 bits • Page Table Entry (PTE): 10 bits • Byte within page: 12 bits (4 KB pages)
Virtual Memory • Memory reclamation algorithms must also factor in the presence of: • Page cache: File system data presently in memory • Buffer cache: File system meta-data • Swap cache: Pages being written to swap space
Virtual Memory • The swap daemon kswapd tries to free memory when it is short, in this order: • Tries to free “clean” pages (page cache, buffer cache) • Shrinks the dentry cache • Shrinks the inode cache • Tries to page shared memory out • Tries to free “dirty” pages
Virtual Memory • Linux keeps Accessed and Dirty bits for pages in memory • VMS forgoes an Accessed bit and uses the Free Page List and Modified Page List as temporary caches, to rescue frequently-accessed pages before they are freed or written to disk
Kernel • Linux kernel is not pre-emptible (yet) • Linux kernel is monolithic • i.e., it is not a micro-kernel based OS • although most kernel components (such as drivers) can be built as Dynamically Loadable Kernel Modules (DLKMs), which are loaded on demand • DLKMs can be upgraded incrementally, so it is theoretically possible to improve the kernel without rebooting
Scheduler • Process priorities can be in the range of -20 to +20, with -20 being the “highest” priority • Idle process is PID #0, and is scheduled when nothing else is runable • Requires context switch; no equivalent of VMS loop in scheduler, or code to clear demand-zero pages in advance of need
Scheduler • Compute-bound tasks tend to be given lower priority than I/O bound tasks • Real-time priorities are static, ranging from 1 to 99. Two scheduling policies available: • FIFO: Threads run to completion in order • Round-robin: Thread runs for time slice, then next thread runs, and so forth
syscalls and Signals • syscalls are used by a program to request something of the operating system • Signals are used the by operating system to inform a task of errors or asynchronous events
syscalls • Each syscall has a unique number • There are presently about 200 • Parameters are passed in registers • Implication from 80386: limit of 6 parameters
Signals • Each signal has a default action associated with it: • Exit – forces the process to exit • Core – forces the process to exit and create a core file • Stop – stops the process (which then awaits a signal to continue) • Ignore – ignores the signal; no action taken • A process can define a signal handler for most signals, to override the default action • Behavior of System V and BSD differed when a task received a signal while performing a syscall, so Linux provides a choice
Signals • Examples: • SIGINT signal is sent on a keyboard interrupt (i.e. Control-C) • Default action: Terminate the process • SIGSEGV signal is sent when a segmentation violation (an attempt to access memory that one has no right to access) occurs • Default action: Write core dump, then terminate the process
File Systems • Linux Virtual File System (VFS) layer allows different file systems underneath • 32-bit interface causes some limitations • Using a different file system may require a kernel re-compile or even patches
File Systems • ext2 – standard out-of-the-box, traditional Unix file system • ReiserFS – first journaled file system publicly available for Linux • JFS from IBM– Journaled File System from OS/2 and AIX • XFS from SGI – journaled file system taken from Irix • ext3 – journaled file system compatible with ext2
File Systems withRemote Mirroring • enbd – Enhanded Network Block Device can be used with Software RAID to mirror over TCP/IP to a disk on a remote system • drbd – Disaster Recovery Block Device integrates network driver with RAID 1 and preserves write ordering as needed by journaled file systems
ext2 Meta-data writes are asynchronous, with no journaling must run fsck upon reboot damage may be unrepairable Slow linear search of unordered directory entries 32-bit meta-data design Limits file system and file sizes JFS, reiserfs, ext3, etc. A journal is used to log all changes to file system meta-data much faster restart times improved reliability for file system (but not data) B-tree data structures for faster lookup & access 64-bit meta-data designs But still some limits due to 32-bit VFS interface Why Journaling File Systems?
Storage Management • LVM – Logical Volume Manager • RAID
Logical Volume Manager • Adds layer between block I/O interface in the kernel and the physical disks • Volume Groups can consist of multiple Physical Volumes • Data can be spread across disks • Space can be added or removed dynamically • Data can be migrated between physical disks
Logical Volume Manager • “Snapshots” can be taken of data at a point in time and accessed as a read-only copy • But in practice, file system metadata may be inconsistent unless file system is unmounted or quiesced before the snapshot is taken
Software RAID • Supports RAID levels 0, 1, 4, 5, 10 • “Linear mode” (basically disk concatenation) is another option
Software RAID Levels RAID 0 – Disk Striping RAID 1 – Mirroring (shadowing) of 2 or more disks, with optional spare disk(s) RAID 4 – Striping across 2 or more disks, with Parity all on one disk RAID 5 – Striping with Parity distributed across sets of 3 or more disks, with optional spare disk(s) RAID 10 – RAID 1 array of two or more RAID 0 arrays (mirrored stripesets)
RAID 1 Recovery • After a crash or power loss, a utility needs to be run, preferably at boot time: • ckraid –fix • By default, chooses first working member as master copy and copies it to the other(s) • Then run fsck on the file system(s)
RAID 4 or 5 Recovery • After a crash or power loss, a utility needs to be run manually: • ckraid • To determine what changes need to be done, then: • ckraid –fix –suggest-failed-dsk-mask • Where the mask is a binary bit mask with one bit set, or • ckraid –fix –suggest-fix-parity • To recalculate the parity from the data disks • Then run fsck on the file system(s)
File System Implications • Prudent kernel development typically requires two separate Linux systems: • One to keep source code on, and do compiles • One to test new kernel code (in case code corrupts pages in buffer cache and makes file system unrepairable)
Future Development Directions • Lots of projects underway: • (but no guarantee they will all reach mainstream Linux): • Pre-emptible kernel • Hot plugging for USB and PCI • Finer privilege granularity than ‘user or root’, using POSIX Capabilities • User-Mode Linux for safer kernel debugging • Lots of cluster projects
Speaker Contact Info Keith Parris E-mail: keith.parris@hp.com or parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/
ext2 File System Weaknesses • Meta-data writes are asynchronous • Good for performance, but… • fsck required after a crash or power loss • Can take hours to complete • Linear search of unordered directory entries is slow • 32-bit meta-data design • Limits file system and file sizes