270 likes | 722 Views
LKCD – Linux Kernel Crash Dump. Harish K Motorola Inc. What is LKCD? Why LKCD?. The Journey. Introduction LKCD – Process Design Considerations Kernel Implementation User Level Analysis – (Lcrash). Introduction.
E N D
LKCD – Linux Kernel Crash Dump Harish K Motorola Inc.
What is LKCD? Why LKCD?
The Journey • Introduction • LKCD – Process • Design Considerations • Kernel Implementation • User Level Analysis – (Lcrash)
Introduction LKCD is a set of kernel and application code to configure, implement, and analyze system crash dumps Objectives: • Post-failure kernel analysis • Kernel problems are resolved more quickly • As the Linux kernel becomes more complex, the need for LKCD increases
LKCD – Kernel Design Considerations The biggest design considerations were: • Dump Save Mechanism • Raw I/O vs. Buffer Cache I/O • Kernel Code Location • Dump Storage
LKCD – Kernel Design Considerations 1. Dump Save Mechanism • PROM Save Method Crash, reset the system, and have the hardware's PROM save the memory image to disk. • Kernel Save Method Crash, save the memory image to disk, and then reset the system
LKCD – Kernel Design Considerations 1. Dump Save Mechanism Kernel save method chosen because: • PROM/BIOS is too architecture-specific • reset/power-off may clear memory • kernel disk driver restrictions • code can be modified in kernel; PROM code is difficult to make changes
LKCD – Kernel Design Considerations 2. Raw I/O vs. Buffer Cache I/O • Buffer cache locking prevents handling dump workaround without major performance hit on basic I/O • Raw I/O was not fully supported in Linux (in the kernel) • IDE, RAID, etc., drivers need raw I/O hooks (current plan is to create driver layer above to avoid necessary locking)
LKCD – Kernel Design Considerations 3. Kernel Code Location • Code changes are separated into generic and architecture-specific files • kernel/vmdump.c • arch/<arch>/kernel/vmdump.c • Additional modifications made to linux/include/sysctl.h, kernel/sysctl.c, and kernel crash hook functions
LKCD – Kernel Design Considerations 4. Dump Storage • Memory dumps are saved to swap space • Swapping during boot-up is an issue • Disk partition tables in memory -- could this cause a data corruption problem? • Cannot assume filesystem layer will be available during crash
LKCD - Kernel Implementation Dump Process Activation • Kernel Hooks for executing dump process: • The kernel directly calls panic() • A kernel exception occurs due to a system fault, calls die_if_kernel() • In both instances dump_execute is called, which in turn calls architecture specific __dump_execute() to save dump to disk
LKCD - Kernel Implementation Storing Crash Dumps Dump Header Dump Page Headers Dump pages
LKCD - Kernel Implementation Storing Crash Dumps • The first 64K of the crash dump contains the dump header, which show the system state at the time of the kernel failure • Memory pages are written next, each with a page header containing • virtual address of the page in memory • size of page (important if compressed) • page flags (compressed, raw, dump end) • page header with a special end marker is written and the dump process completes
Kernel Dump Tunables • The set of kernel dump tunable are listed in /etc/sysconfig/vmdump which configures the behavior of LKCD system • The tunables are • DUMP_ACTIVE • DUMPDEV • DUMPDIR • DUMP_LEVEL • DUMP_COMPRESS_PAGES • PANIC_TIMEOUT
User Level Analysis - LCrash lcrash is a utility that generates detailed kernel information about crash dumps. It contains many features for displaying information about the events leading up to a system crash in a clear, easy-to-read manner It basically operates in two modes: • Crash Dump Report Generation • Interactive Crash Dump Analysis
User Level Analysis - LCrash Crash Dump Report Generation: This report contains selected pieces of information from the kernel considered most useful when trying to identify the cause of a crash. The LCRASH report includes the following information: • General system information • Type of crash • Dump of system log_buf • CPU summary • Kernel stack trace leading up to the system PANIC • Disassembly of instructions before and after the instructions that caused the crash
User Level Analysis - LCrash LCRASH Interactive Commands • For a more detailed examination of the elements of a crash • Kernel data displayed in a clear, easy-to-read manner • Invoked via an ASCII command line user interface featuring command line editing and command history • Command output can be piped to utilities such as more and grep
User Level Analysis - LCrash LCRASH Interactive Commands example: • Stat Displays pertinent system information and the contents of the log_buf array. • Vtop Displays virtual to physical address mappings for both kernel and application virtual addresses • Symbol Maps kernel symbols to virtual addresses
User Level Analysis - LCrash LCRASH Interactive Commands example: • Dump Dumps the contents of system memory in a variety of bases (hexadecimal, decimal, or octal) and data sizes (byte, short, int, or long) • Task Displays relevant information for selected tasks or all tasks running at the time of the crash • Trace Displays a kernel stack backtrace for selected tasks, or for all tasks running on the system • Dis Disassembles one or more machine instructions
lcrash Example Output >> stat | head sysname : Linux nodename : crashme.atmyhouse.com release : 2.4.8 version : #9 SMP Mon Dec 10 00:05:19 PST 2001 machine : i686 domainname : (none) LOG_BUF: >> dump log_buf 10 0xc0332c60: 4c3e343c 78756e69 72657620 6e6f6973 : <4>Linux version 0xc0332c70: 342e3220 2820382e 746f6f72 74617740 : 2.4.8 (root@cra 0xc0332c80: 79657265 70612e65 : shme.atm
lcrash Example Output >> task ADDR UID PID PPID STATE FLAGS CPU NAME ====================================================================== 0xc02e4000 0 0 0 0 0 - swapper 0xdfffc000 0 1 0 0 0x100 - init 0xdfff2000 0 2 1 1 0x40 - keventd 0xdffee000 0 3 0 0 0x40 - ksoftirqd_CPU0 [ . . . ] 0xde47a000 0 867 1 1 0x100 - mingetty 0xda0fe000 0 1017 660 0 0x140 - sshd 0xd9c06000 0 1018 1017 1 0x100 - bash 0xde4b4000 0 1101 1018 0 0x100 0 insmod ====================================================================== 31 active task structs found
lcrash Example Output >> t 0xda0fe000 ========================================================= STACK TRACE FOR TASK: 0xda0fe000(sshd) 0 schedule+1040 [0xc0111250] 1 schedule_timeout+121 [0xc0110d89] 2 do_select+506 [0xc014251a] 3 sys_select+820 [0xc01428c4] 4 system_call+44 [0xc0106ed4] =========================================================
Reference: http:\\lkcd.sourceforge.net • Contact: harish@motorola.com