OS Technologies by Linux

OS Technologies by Linux • Linux Processes • Linux Memory Management • Linux File System • Linux Device • Linux Source Code Highlights

Linux Processes lKernel Level(Privilege level 0, for Intel CPU) lUser Level(Privilege level 3, for Intel CPU) Intel microprocessors (both IA-32 and Itanium® architecture) provide protection based on the concept of a 2–bit privilege level, using 0 for most-privileged software and 3 for least-privileged. The privilege level determines whether privileged instructions, which control basic CPU functionality, can execute without fault. It also controls address-space accessibility based on the configuration of the processor's page tables and, for IA-32, segment registers. Most IA software uses only privilege levels 0 and 3. • SystemCalls and Interrupts are channels by which Linux kernel services are provided

Linux Processes (Cont) • Process Control Block • Process States • Process Scheduling • Interrupt • System Call • Process Creation & Termination • Program Loading & Execution • Process Communication

Process Control Block • task_struct in “include/linux/sched.h” • Scheduling Data • Signal Processing • Pointers for Process Queues • Process IDs • Clock & Timers • Process Context • File System Managing • Memory Managing

TASK_RUNNING TASK_UNINTERRUPTIBLE TASK_INTERRUPTIBLE TASK_RUNNING & Holding CPU TASK_STOPPED TASK_ZOMBIE Process State

Policy Scheduling Algorithm SCHED_RR Real-Time, Priority Based Round Robin SCHED_FIFO Real-Time, First in First Out SCHED_OTHER Not Real-Time, Priority Based Round Robin Process Scheduling • Scheduler in “/usr/src/linux/kernel/sched.c” • Scheduling Policy

Interrupt • Interrupt Handling Flow by Intel • Clock, jiffies & HZ • Timer Interrupt Handler in “arch/i386/kernel/time.c” • bottom half & top half • Wait Queues

System Call • Data Structure & Function Prefix of function name for System Call Handler, sys_* System Call No(include/asm-i386/unistd.h), __NR_* System Call Table (arch/i386/kernel/entry.S) • From Syscall function to INT 0X80 request • Initialize to System Call processing • How Linux kernel serves System Calls system_call • ret_from_sys_call(arch/i386/kernel/entry.S)

Linux Memory Management • i386 MMU • Virtual Memory Organization(3 ways) • Memory Protection, “mm/mprotect.c”“mm/mlock.c” • Physical Space Management • Free Physical Space Management • Kernel-mode Real Memory, Allocate & Free • Kernel-mode Virtual Memory, Allocate & Free • Page Replacement Process & Swap out • Page Fault & Swap in • Cache

i386 MMU

The Selector • TI=0, the segment desc pointed by the Selector is in GDT • TI=1, the segment desc pointed by the Selector is in LDT • RPL: Privilege level that the applicant holds, 0 is the highest, • while 3 is the lowest • INDEX: the index to the segment desc within GDT/LDT 15 1 0 3 2 INDEX TI RPL

Location Description bit 15 – bit 00 Segment limit, bits 19 : 16 bit 31 – bit 16 Segment addr., bits 15 : 00 bit 39 – bit 32 Segment addr., bits 23 : 16 bit 47 – bit 40 access rights bit 51 – bit 48 Segment limit, bits 19 : 16 bit 52 Defined by user bit 53 0 (reserved) bit 54 D bit 55 G bit 63 – bit 56 Segment addr., bits 31 : 24 Segment Desc format

G, granularity • G=0, byte as a unit • G=1, 4K as a unit

0 NULL descriptor 1 Not used 2 Code in Kernel Mode Virtual addr. Starts at 0XC0000000 1GB Privilege level 0 3 Data in Kernel Mode Virtual addr. Starts at 0XC0000000 1GB Privilege level 0 4 Code in User Mode Virtual addr. Starts at 0X00000000 3GB Privilege level 3 5 Data in User Mode Virtual addr. Starts at 0X00000000 3GB Privilege level 3 6 Not used 7 Not used 2i+6 LDT of Process i 2i+7 TSS of Process i Linux GDT Arrangement

0 NULL descriptor 1 Code in User Mode Virtual addr. Starts at 0X00000000 3GB Privilege level 3 2 Data in User Mode Virtual addr. Starts at 0X00000000 3GB Privilege level 3 Linux LDT Arrangement

控制寄存器 • CR3指示页目录表的起始地址 • CR0寄存器的PE位(位0)用于控制段机制。PE=1则处理器工作于保护模式下；PE=0则处理器工作于实模式下，等同于8086 • CR0的PG位(位31)用于控制分页机制。PG=1则启用分页机制，32位线性地址通过页表转换为物理地址；PG=0则禁止分页机制，32位线性地址直接寻址物理地址 • 当PE=1且PG=0时，处理器工作于保护模式下，但禁止分页机制。此时没有内存和磁盘之间的页面交换，也就不存在虚拟内存 • 当PE=1且PG=1时，处理器工作于保护模式下，但启用分页机制。此时有内存和磁盘之间的页面交换，磁盘起到虚拟内存的作用 • CR2指示引起缺页中断的地址

32位线性地址 31 12 11 0 22 21 Page Directory Page offset

Linux分页管理

页目录项和页表项 31 12 6 5 2 1 0 页表或页帧的物理地址第31位至第12位 D A U/S R/W P P=1 则地址转换有效；P=0则地址转换无效 R/W=1则该页可写，可读，且可执行；R/W=0则该页可读，可执行，但不可写 U/S=1则该页可在任何特权级下访问； U/S=0则该页只能在特权级0、1和2下访问 A：访问位 D：已写标志位

PCB对存储空间的管理 struct mm_struct { int count; pgd_t * pgd; /* 进程页目录的起始地址 */ unsigned long context; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack, start_mmap; unsigned long arg_start, arg_end, env_start, env_end; unsigned long rss, total_vm, locked_vm; unsigned long def_flags; struct vm_area_struct * mmap; /* 指向vma双向链表的指针 */ struct vm_area_struct * mmap_avl; /* 指向vma AVL树的指针 */ struct semaphore mmap_sem; }

虚存段vma

一个进程的虚拟空间

AVL树（Adelson-Velskii and Landis） (1000,2000) vm_avl_left (2500,3000) vm_avl_right (500,670) vm_avl_left vm_avl_left vm_avl_right vm_avl_right (350,450) (800,900) vm_avl_left vm_avl_left vm_avl_right vm_avl_right

初始化后物理存储分布 0X000000(0K) Empty_Zero_Page 由mem_init初始化 0X001000(4K) swapper_pg_dir 核心态访问空间的页目录 0X002000(8K) pg0 0X003000(12K) bad_pages 0X004000(16K) bad_pg_table 0X005000(20K) floppy_track_buffer 0X006000(24K) kernel_code+text FREE 0X0A0000(640K RESERVED 0X100000(1M) pg_tables(4K) swap_cache_mem mem_map bitmap FREE

物理页面 typedef struct page { struct page *next, *prev; /* 由于搜索算法的约定，这两项必须首先定义 */ struct inode *inode; /* 若该页帧的内容是文件，则inode和offset unsigned long offset; /* 指出文件的inode和偏移位置 */ struct page *next_hash; atomic_t count; /* 访问此页帧的进程记数，大于1表示由多个进程共享 */ unsigned flags; /* atomic flags, some possibly updated asynchronously */ unsigned dirty:16, /* 页帧修改标志 */ age:8; /* 页帧的年龄，越小越先换出 */ struct wait_queue *wait; struct page *prev_hash; struct buffer_head * buffers; /* 若该页帧作为缓冲区，则指示地址 */ unsigned long swap_unlock_entry; unsigned long map_nr; /* 页帧在mem_map表中的下标， page->map_nr == page - mem_map */ } mem_map_t; mem_map_t * mem_map = NULL; /* 页帧描述表的首地址 */

空闲物理内存管理 • bitmap表在物理内存低端，紧跟mem_map表的bitmap表以位示图方式记录了所有物理内存的空闲状况。与mem_map一样，bitmap在系统初始化时由free_area_init()函数创建(mm/page_alloc.c)。与一般性位图不同，bitmap表分割成NR_MEM_LISTS组(缺省值6)。 • buddy算法 buddy算法分配空闲块，由_get_free_pages()和free_pages()函数执行。change_bit()函数根据bitmap的对应组，判断回收块的前后邻居是否也为空。

空闲物理内存管理 • LINUX用free_area数组记录空闲的物理页帧 struct free_area_struct { struct page *next; /* 此结构的next,prev指针与struct page匹配 */ struct page *prev; unsigned int * map; /* 指向bitmap */ }; static struct free_area_struct free_area[NR_MEM_LISTS];

空闲物理内存管理 free_area next 每1 bit对应20页 prev map next prev 每1 bit对应21页 map next prev 每1 bit对应22页 map ...... ............ page page 2 pages 2 pages 4 pages 4 pages

内核态实存的申请与释放 • kmalloc()和kfree() • kmalloc_cache • 在新版本里已经被SLAB替换 • SLAB可供用户使用的函数有： kmem_cache_create(), kmem_cache_alloc(), kmalloc(), kmem_cache_free, kfree()/kfree_s(), kmem_cache_shrink(), kmem_cache_reap()

32 64 ...... 4096-16 ...... ...... 131072-16 bh_flags nfree order firstfree bh_flags 老版本算法(已经被SLAB替换) sizes表 blocksize表

内核态虚存的申请与释放 • 可分配的虚拟空间在3G+high_memory+HOLE_8M以上高端，由vmlist链表管理 • 用户态内存的申请与释放(mm/vmalloc.c) vmalloc()和vfree() • vmlist链表的节点类型(include/linux/vmalloc.h) struct vm_struct { unsigned long flags; /* 虚拟内存块的占用标志 */ void * addr; /* 虚拟内存块的起始地址 */ unsigned long size; /* 虚拟内存块的长度 */ struct vm_struct * next; /* 下一个虚拟内存块 */ }; static struct vm_struct * vmlist = NULL;

3G+high_memory+HOLE_8M以上高端空间 vmlist flags size flags size 3G+high_memory+HOLE_8M

页交换进程和页面换出 • 内核态交换进程kswapd 一个是内核态线程（kernel thread）：没有虚拟存储空间，运行在内核态，直接使用物理地址空间。它不仅能将页面换出到交换空间（交换区或交换文件），它也保证系统中有足够的空闲页面以保持存储系统高效地运行。在系统初启时由核心态线程init创建，并等待系统交换定时器swap_tick周期性地唤醒。 • 系统的空闲页面不够时，kswapd依次从三条途径缩减系统使用的物理页面缩减Page Cache和Buffer Cache；换出System V共享系统的内存页面；换出或丢弃页面。

缺页中断和页面换入 • 产生缺页的虚存地址（CR2）传递给内核的缺页中断服务程序 • 如果没有找到与缺页相对应的vm_area_struct结构，那么说明进程访问了一个非法存储区，LINUX向进程发送信号SIGSEGV • 接着检测缺页时，该访问是否合法 • 经过以上两步检查，可以确定的确是缺页中断

Cache • Swap Cache Swap Cache是一个页表项的列表，每一项与内存中的一帧相对应。被换出的页面在Swap Cache中占有一表项，该表项描述页面在交换空间中的位置。当该页面修改时，该表项将被清除。 • Page Cache 文件被映射到内存中，每次读取一页。而这些页就保存于Page Cache中。 • Buffer Cache

Linux File System • 文件、目录、文件系统 • VFS 为什么需要VFS VFS的体系结构 VFS的files_struct, p583, sched.h VFS的file, p576, fs.h VFS的dentry, p579, dcache.h VFS的super block, p569, fs.h VFS的inode, p571, fs.h • 文件系统类型 • 文件系统安装mount

Linux File System (Cont) • 路径定位和查找 • ext2文件系统 ext2体系结构 ext2的超级块, ext2_super_block, p602 ext2的group descriptor, ext2_group_desc,p603 ext2的inode ext2的操作 • 典型文件操作

etc home mnt dev usr lsp include lib dos proc Linux文件系统

为什么需要VFS

VFS的体系结构

文件系统类型 static struct file_system_type *file_systems = (struct file_system_type *) NULL; struct file_system_type { struct super_block *(*read_super)(); /* 读出该文件系统在外存的super_block */ const char *name; /* 文件系统的类型名 */ int requires_dev; /* 支持文件系统的设备 */ struct file_system_type * next; /* 文件系统类型链表的后续指针 */ };

文件系统类型 (Cont) • 文件系统类型的注册和注销函数 int register_filesystem(struct file_system_type * fs) int unregister_filesystem(struct file_system_type * fs) file_systems file_system_type file_system_type file_system_type

root 下挂文件系统安装点vfsmount i_sb mnt_mountpoint mnt_root d_mounted!=0 Mount文件系统

已安装文件系统的描述 static LIST_HEAD(vfsmntlist); struct vfsmount { struct list_head mnt_hash; struct vfsmount *mnt_parent; /* fs we are mounted on */ struct dentry *mnt_mountpoint; /* dentry of mountpoint */ struct dentry *mnt_root; /* root of the mounted tree */ struct super_block *mnt_sb; /* pointer to superblock */ struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ atomic_t mnt_count; int mnt_flags; char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list; };

mnt_sb mnt_sb s_type 已安装文件系统的描述 file_system_type vfsmount super_block vfsmntlist file_systems

文件系统安装mount • sys_mount(), p586 • do_mount(), p587

路径定位和查找 (p592) • open_namei() path_init() path_walk(); 以及 link_path_walk(); • struct nameidata总是反映当前目录

ext2文件系统 • 支持UNIX所有标准的文件系统特征，包括正文（regular files）、目录、设备文件和连接文件等，这使得它很容易被UNIX程序员接受。事实上，ext2的绝大多数的数据结构和系统调用与经典的UNIX一致 • 能够管理海量存储介质。支持多达4TB的数据，即一个分区的容量最大可达4TB • 支持长文件名，最多可达255个字符，并且可扩展到1012个字符 • 允许通过文件属性改变内核的行为；目录下的文件继承目录的属性 • 支持文件系统数据“即时同步”特性，即内存中的数据一旦改变，立即更新硬盘上的数据使之一致 • 实现了“快速连接”（fast symbolic links）的方式，使得连接文件只需要存放inode的空间 • 允许用户定制文件系统的数据单元（block）的大小，可以是 1024、2048 或 4096 个字节，使之适应不同环境的要求 • 使用专用文件记录文件系统的状态和错误信息，供下一次系统启动时决定是否需要检查文件系统

ext2体系结构

内存中的ext2 inode • ext2_inode_info, include/linux/ext2_fs_i.h struct ext2_inode_info { __u32 i_data[15]; __u32 i_flags; __u32 i_faddr; __u8 i_frag_no; __u8 i_frag_size; __u16 i_osync; __u32 i_file_acl; __u32 i_dir_acl; __u32 i_dtime; __u32 i_block_group; __u32 i_next_alloc_block; __u32 i_next_alloc_goal; __u32 i_prealloc_block; __u32 i_prealloc_count; __u32 i_dir_start_lookup; int i_new_inode:1; /* Is a freshly allocated inode */ };

OS Technologies by Linux

OS Technologies by Linux

Presentation Transcript

Linux OS Concepts

LINUX System : Lecture 2 OS and UNIX summary

By Burapha Linux

10 Reasons To Install Zorin Linux OS 9

OS Concepts Using Linux

Embedded Computer Linux OS porting

Essential OS Commands for Linux

Case Study – The Linux OS Part II

LINUX OS ON ARM PROCESSOR

RUNNING RECONFIGME OS OVER PETA LINUX OS

Drupal Installation on Debian OS using LINUX commands

Android OS by DallasTechnologies

Android OS by Lara Technologies

Laptop with Linux Os

Linux OS Installation Guide

Cooperative Linux… “A treaty between two OS giants”

best linux os for cyber security