1.22k likes | 1.26k Views
Operating System Design - Linux. Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang. Introduction to Linux (Nov. 1991, Linus Torvalds). Multi-tasking Demand loading & Copy On Write Paging (not swapping) Shared Libraries POSIX 1003.1 Protected Mode
E N D
Operating System Design - Linux Instructor: Ching-Chi Hsu TA:Yung-Yu Chuang
Introduction to Linux (Nov. 1991, Linus Torvalds) • Multi-tasking • Demand loading & Copy On Write • Paging (not swapping) • Shared Libraries • POSIX 1003.1 • Protected Mode • Support different file systems and executable formats
Multitasking require service require service CPU idle CPU idle require service require service time interrupt for time-sharing require service require service time expire
Based on i386 and Linux 2.0.33 • Topics • initialization • memory management (free space management, virtual memory management) • process management (context switching, scheduling) • system call
Resources for Tracing Linux • http://odie.csie.ntu.edu.tw/~osd • TLK, KHG, Linux Kernel Internals • Source code browser • Intel Programmer’s manual
Source Tree for Linux /usr/src/linux modules net init lib include ipc kernel drivers fs linux arch nfs char ???? asm-i386 ext2 block i386 asm-???? proc net scsi …... kernel boot mm
How to compile Linux Kernel 1. make config (make manuconfig) 2. make depend 3. make boot (generate a compressed bootable linux kernel arch/i386/boot/zIamge) make zdisk (generate kernel and write to disk dd if=zImage of=/dev/fd0) make zlilo (generate kernel and copy to /vmlinuz) lilo: Linux Loader
i386 • Segmented Addressing (segment:offset) • Paging(Virtual Memory) • Call Gate (Protection) • TSS (Context Switching)
SELECTOR T I INDEX OFFSET GDTR LDTR + desc desc GDT LDT Linear Address
31 0 A GD0 V L D P P S L LIMIT 19:16 BASE 31:24 TYPE BASE 23:16 BASE 15:0 LIMIT 15:0 63 32 BASE+8 BASE LIMIT Desc., Call gate, TSS BASE+LIMIT
PTE PDE CR3 Page Addr. P 4K page yyyyy000 zzzzz000 ddd ttt ooo + zzzzzooo
Linear Address Space Physical memory Disk 4GB OS
3 2 1 0 Call Gate
Call TSS gate cause context switching CS,DS, ES… IP SP0, SP1, SP2, SP3 CR3 ….. in GDT TSS Gate TSS desc. CPU
i386 Initialization • #RESET • real-address mode • self-test • EAX contains error code • EDX contains CPU id • CR0 T S E M M P P E P G RESERVED 0
Register State 0XXXX0002H 0000FFF0H 0F000H 0000H 0000H 0000H 0000H 0000H 00000000H 03FFH 0000H EFLAGS EIP CS* DS** SS ES** FS GS IDTR(base) IDTR(limit) DR7 * invisible part: 0FFFF0000(base) 0FFFF(limit) ** invisible part: 0(base) 0FFFF(limit)
FFFF0H : ROM-BIOS address * do some test * initialize interrupt vector at physical address 0 * load the first sector of a bootable device to 0x7C00 (boot/bootsect.S) * jump to 0x7C00 and run
Linux Kernel on Disk (vmlinux, 1,133,665 bytes) /usr/src/linux/arch/i386/boot/zImage 1 sector 4 sectors Self-extracted Kernel Image bootsect.S Setup.S vmlinux (executable) Decompression module Compressed Kernel Image (vmlinux.out, 455,321)
boot disk CPU A20 1M I/O & BIOS A0000 90000 7C000 IP 64K
IP IP bootsect.S BIOS load 0.5K bytes 90000 0.5K bytes Bootsect.S 0.5K bytes 7C000 7C000
IP IP Setup.S Setup.S 2K bytes 2K bytes 90200 90200 0.5K bytes 0.5K bytes 90000 90000 0.5K bytes 0.5K bytes 7C000 7C000 vmlinux 508K bytes 10000
SETUPSECS = 4 ! nr of setup-sectors BOOTSEG = 0x07C0 ! original address of boot-sector INITSEG = DEF_INITSEG ! we move boot here - out of the way 0x9000 SETUPSEG = DEF_SETUPSEG ! setup starts here, 0x9020 SYSSEG = DEF_SYSSEG ! system loaded at 0x10000 (65536) < omitted> mov ax,#BOOTSEG mov ds,ax mov ax,#INITSEG mov es,ax mov cx,#256 sub si,si sub di,di cld rep movsw jmpi go,INITSEG ! Execute moved bootsect go: Copy bootsect.S to 0x90000
<omit> load_setup: xor dx, dx ! drive 0, head 0 mov cl,#0x02 ! sector 2, track 0 mov bx,#0x0200 ! address = 512, in INITSEG mov ah,#0x02 ! service 2, nr of sectors mov al,setup_sects ! (assume all on head 0, track 0) ! Setup_sects=4 int 0x13 ! read it (BIOS routine) jnc ok_load_setup ! ok - continue push ax ! dump error code call print_nl mov bp, sp call print_hex pop ax jmp load_setup ok_load_setup: Try to load setup.S from (drive 0, head 0, sector 2, track 0) to memory 0x90200
<omit> ! Print some inane message mov ah,#0x03 ! read cursor pos xor bh,bh int 0x10 mov cx,#9 mov bx,#0x0007 ! page 0, attribute 7 (normal) mov bp,#msg1 ! .byte 13,10 .ascii “Loading” mov ax,#0x1301 ! write string, move cursor int 0x10 ! BIOS routine ! ok, we've written the message, now ! we want to load the system (at 0x10000) mov ax,#SYSSEG mov es,ax ! segment of 0x010000 call read_it ! Read 508K to 0x10000 (64K), one . per track call kill_motor ! Stop floopy motor call print_nl <omit> jmpi 0, SETUPSEG ! Jump to 0x90200 (setup.S) Print “/nLoading”
setup.S • Check memory size • set keyboard, video adapter, get HD data • switch to protected mode • set GDT • set IDT • set PE bit (flush pipe)
start: jmp start_of_setup ! ------------------------ start of header -------------------------------- ! ! SETUP-header, must start at CS:2 (old 0x9020:2) ! .ascii "HdrS" ! Signature for SETUP-header .word 0x0201 ! Version number of header format ! (must be >= 0x0105 ! else old loadlin-1.5 will fail) <omit> start_of_setup: …………… (check signature) good_sig: mov ax,cs ! aka #SETUPSEG sub ax,#DELTA_INITSEG ! aka #INITSEG mov ds,ax ! DS=9000
loader_ok: ! Get memory size (extended mem, kB) mov ah,#0x88 int 0x15 mov [2],ax ! Store memory size in 0x90002 (bootsect.S) <omit> (disable interrupts) (move kernel image to 1000) end_move_self: lidt idt_48 ! load idt with 0,0 lgdt gdt_48 ! load gdt with whatever appropriate idt_48: .word 0 .word 0, 0 gdt_48: .word 0x800 .word 512+gdt, 0x9
BASE Limit idt_48 0,0 0 gdt_48 0x9, 512+gdt 0x800 (2048) gdt: .word 0,0,0,0 ! dummy .word 0,0,0,0 ! unused .word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9A00 ! code read/exec .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit) .word 0xFFFF ! 4Gb - (0x100000*0x1000 = 4Gb) .word 0x0000 ! base address=0 .word 0x9200 ! data read/write .word 0x00CF ! granularity=4096, 386 (+5th nibble of limit)
31 0 A GD0 V L D P P S L LIMIT 19:16 BASE 31:24 TYPE BASE 23:16 BASE 15:0 LIMIT 15:0 63 32 null Not used code data BASE=0x00000000, LIMIT=FFFFFF G=1 (4G) DPL=0 type=1010 (code, non-conforming, r/x, not accessed) BASE=0x00000000, LIMIT=FFFFFF G=1 (4G) DPL=0 type=1010 (code, non-conforming, r/x, not accessed)
! that was painless, now we enable A20, no wrapped call empty_8042 mov al,#0xD1 ! command write out #0x64,al call empty_8042 mov al,#0xDF ! A20 on out #0x60,al call empty_8042 <omit> mov ax,#1 ! protected mode (PE) bit lmsw ax ! This is it! Load into CR0 jmp flush_instr ! Flush pipe flush_instr: xor bx,bx ! Flag to indicate a boot
! NOTE: For high loaded big kernels we need a ! jmpi 0x100000,KERNEL_CS ! ! but we yet haven't reloaded the CS register, so the default size ! of the target offset still is 16 bit. ! However, using an operant prefix (0x66), the CPU will properly ! take our 48 bit far pointer. (INTeL 80386 Programmer's Reference ! Manual, Mixing 16-bit and 32-bit code, page 16-6) db 0x66,0xea ! prefix + jmpi-opcode code32: dd 0x1000 ! will be set to 0x100000 for big kernels dw KERNEL_CS ! KERNEL=0x10 0001 0000 0 0 T I INDEX RPL 15 2 1 0 0:GDT 1:LDT
Decompress Kernel startup_32: (gcc entry point) cld cli movl $(KERNEL_DS),%eax # KERNEL_DS=0x18 mov %ax,%ds mov %ax,%es mov %ax,%fs mov %ax,%gs <omit> lss SYMBOL_NAME(stack_start),%esp xorl %eax,%eax 1: incl %eax # check that A20 really IS enabled movl %eax,0x000000 # loop forever if it isn't cmpl %eax,0x100000 je 1b
( clear BSS ) /* * Do the decompression, and jump to the new kernel.. */ subl $16,%esp # place for structure on the stack pushl %esp # address of structure as first arg call SYMBOL_NAME(decompress_kernel) # decompress kernel to 100000 orl %eax,%eax # gunzip 1.0.3 jnz 3f xorl %ebx,%ebx ljmp $(KERNEL_CS), $0x100000 # jump to decompressed kernel
head.S gdt idt stack 106000 (copy parameters from 0x90000) empty_zero_page 105000 empty_bad_page_table 104000 empty_bad_page 103000 pg0 102000 swapper_pg_dir 101000 EIP 100000
gdt idt stack 106000 empty_zero_page 105000 empty_bad_page_table 104000 empty_bad_page 103000 PG0 102000 PG_DIR 101000 100000 Setup Paging Table & Enable Paging Physical Memory 0 CR3 4M 768
gdt idt stack 106000 empty_zero_page 105000 empty_bad_page_table 104000 empty_bad_page 103000 PG0 102000 PG_DIR 101000 100000 Setup GDT NULL GDTR 0 0x10 C0000000 1G DPL=0 code 0x18 C0000000 1G DPL=0 data 0x23 00000000 3G DPL=3 code 0x2b 00000000 3G DPL=3 data 0 0 2*NR_TASKS
gdt idt stack 106000 empty_zero_page 105000 empty_bad_page_table 104000 empty_bad_page 103000 PG0 102000 PG_DIR 101000 100000 Setup IDT GDT 0 IDTR ignore_int 255
call setup_paging setup_paging: movl $1024*2,%ecx /* 2 pages - swapper_pg_dir+1 page table */ xorl %eax,%eax movl $ SYMBOL_NAME(swapper_pg_dir),%edi /* swapper_pg_dir is at 0x1000 */ cld;rep;stosl /* Identity-map the kernel in low 4MB memory for ease of transition */ /* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir) /* But the real place is at 0xC0000000 */ /* set present bit/user r/w */ movl $ SYMBOL_NAME(pg0)+7,SYMBOL_NAME(swapper_pg_dir)+3072 movl $ SYMBOL_NAME(pg0)+4092,%edi movl $0x03ff007,%eax /* 4Mb - 4096 + 7 (r/w user,p) */ std 1: stosl /* fill the page backwards - more efficient :-) */ subl $0x1000,%eax jge 1b cld
movl $ SYMBOL_NAME(swapper_pg_dir),%eax movl %eax,%cr3 /* cr3 - page directory start */ movl %cr0,%eax orl $0x80000000,%eax movl %eax,%cr0 /* set paging (PG) bit */ ret /* this also flushes the prefetch-queue */ Format of PDE & PTE 31 12 6 5 2 1 0 U / S R / W Page Address D P A
lgdt gdt_descr gdt_descr: .word (8+2*NR_TASKS)*8-1 .long 0xc0000000+SYMBOL_NAME(gdt) ENTRY(gdt) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* not used */ .quad 0xc0c39a000000ffff /* 0x10 kernel 1GB code at 0xC0000000 */ .quad 0xc0c392000000ffff /* 0x18 kernel 1GB data at 0xC0000000 */ .quad 0x00cbfa000000ffff /* 0x23 user 3GB code at 0x00000000 */ .quad 0x00cbf2000000ffff /* 0x2b user 3GB data at 0x00000000 */ .quad 0x0000000000000000 /* not used */ .quad 0x0000000000000000 /* not used */ .fill 2*NR_TASKS,8,0 /* space for LDT's and TSS's etc */
(setup data segments and clear BSS) call setup_idt setup_idt: lea ignore_int,%edx movl $(KERNEL_CS << 16),%eax movw %dx,%ax /* selector = 0x0010 = cs */ movw $0x8E00,%dx /* interrupt gate - dpl=0, present */ lea SYMBOL_NAME(idt),%edi mov $256,%ecx rp_sidt: movl %eax,(%edi) movl %edx,4(%edi) addl $8,%edi dec %ecx jne rp_sidt ret
OFFSET 8 E 0 0 SELECTOR OFFSET interrupt gate ignore_int: just print “Unknown Interrupt”
lidt idt_descr ljmp $(KERNEL_CS),$1f 1: movl $(KERNEL_DS),%eax # reload all the segment registers mov %ax,%ds # after changing gdt. mov %ax,%es mov %ax,%fs mov %ax,%gs call SYMBOL_NAME(start_kernel) # jump to C main routine
start_kernel asmlinkage void start_kernel(void) { setup_arch(&command_line, &memory_start, &memory_end); memory_start = paging_init(memory_start,memory_end); trap_init(); init_IRQ(); <-------------- omit ----------------> memory_start = console_init(memory_start,memory_end); memory_start = kmalloc_init(memory_start,memory_end); sti(); # enable interrupt
memory_start = inode_init(memory_start,memory_end); memory_start = file_table_init(memory_start,memory_end); memory_start = name_cache_init(memory_start,memory_end); mem_init(memory_start,memory_end); <---------- omit -------------> printk(linux_banner); sysctl_init(); kernel_thread(init, NULL, 0); cpu_idle(NULL); }
setup_arch memory_end memory_end = (1<<20) + (EXT_MEM_K<<10); memory_end &= PAGE_MASK; #define PARAM empty_zero_page #define EXT_MEM_K (*(unsigned short *) (PARAM+2)) memory_start memory_start = (unsigned long) &_end; kernel 1M
init_task.mm->start_code = TASK_SIZE; /* 0xC0000000 */ init_task.mm->end_code = TASK_SIZE + (unsigned long) &_etext; init_task.mm->end_data = TASK_SIZE + (unsigned long) &_edata; init_task.mm->brk = TASK_SIZE + (unsigned long) &_end; / * "mem=XXX[kKmM]" overrides the BIOS-reported memory size */ if (c == ' ' && *(const unsigned long *)from == *(const unsigned long *)"mem=") memory_end = simple_strtoul(from+4, &from, 0); if ( *from == 'K' || *from == 'k' ) { memory_end = memory_end << 10; from++; } else if ( *from == 'M' || *from == 'm' ) { memory_end = memory_end << 20; from++; }
paging_init pg0 pg1 4M pgn pg2 0 1 n 4M pg2 memory_start pg1 pgn 768 769 pg0 kernel pg_dir 1M
start_mem = PAGE_ALIGN(start_mem); address = 0; pg_dir = swapper_pg_dir; while (address < end_mem) { /* map the memory at virtual addr 0xC0000000 */ pg_table = (pte_t *) (PAGE_MASK & pgd_val(pg_dir[768])); if (!pg_table) { pg_table = (pte_t *) start_mem; start_mem += PAGE_SIZE; } /* also map it temporarily at 0x0000000 for init */ pgd_val(pg_dir[0]) = _PAGE_TABLE | (unsigned long) pg_table; pgd_val(pg_dir[768]) = _PAGE_TABLE | (unsigned long) pg_table; pg_dir++;
for (tmp = 0 ; tmp < PTRS_PER_PTE ; tmp++,pg_table++) { if (address < end_mem) set_pte(pg_table, mk_pte(address, PAGE_SHARED)); else pte_clear(pg_table); address += PAGE_SIZE; } } local_flush_tlb(); /* move cr3, r?; mov r?, cr3; */ return free_area_init(start_mem, end_mem);