310 likes | 462 Views
Learning the Data Management in Linux Kernel v2.6. Yonggang Liu University of Florida. Layout. Layout. Picture of Today’s Topics. Provides an uniform file system interface to the processes. Virtual File System (VFS). Keeps the most recently accessed data in RAM.
E N D
Learning the Data Management in Linux Kernel v2.6 Yonggang Liu University of Florida
Picture of Today’s Topics Provides an uniform file system interface to the processes. Virtual File System (VFS) Keeps the most recently accessed data in RAM. Includes page cache, dentry cache and inode cache. Disk Caches Ext3 FAT UFS Specific file systems determine the physical location of the data on disk. Mapping Layer Offers an abstract view of the block devices. I/O operation is “block I/O”. Generic Block Layer Groups requests of data that lie near each other on the physical medium. I/O scheduler layer Takes care of the actual data transfer by sending suitable commands to the hardware. Block Device Driver Block Device Driver Hard Disk Hard Disk
Uniform System Calls - VFS Calls Process 1 Process 2 Process 3 VFS defines the uniform System Calls: mount(), umount(), sysfs(),statfs(), chroot(), chdir(), fchdir, getcwd(), mkdir(), rmdir, getdents(), link(), rename(), readlink(), chown(), chmod(), stat(), open(), close(), creat(), dup(), fcntl(), select(), poll(), truncate(), lseek(), read(), write() … Virtual File System Disk-based file systems: Ext3, NTFS, ReiserFS, UDF DVD FS … Network file systems: NFS, Coda, AFS, CIFS, NCP … Special file systems: root FS, sysfs, tmpfs, usbfs, sockfs …
Common File Model - VFS Objects File Object Describes how a process interacts with a file it has opened. Created when the file is opened. Has no image on disk. Some fields: f_dentry f_op f_pos f_version f_mapping DentryObject A directory entry object associates a pathname to its inode. Copied to memory during the path-name look ups. Some fields: d_inode d_parent d_name d_subdirs d_sb Inode Object Includes all information needed by the file system to handle a file. Copied to memory when the file attribute is accessed. Some fields: i_ino i_size i_atime i_sb i_mapping Superblock Object Each file system has a superblock recording the information of the file system; it is copied to memory when used. Some fields: s_blocksize s_type s_root s_inode s_bdev
Interaction between Processes and VFS Objects In the example, 3 processes have opened the same file, 2 of them using the same hard link. disk file Superblock object Inode object i_sb fd f_dentry d_inode Process 1 File object Dentry object Dentry object fd f_dentry Process 2 File object fd Process 3 File object dentry cache f_dentry
Files Associated with a Process fs_struct Stores current working directory and its own root directory, etc. Process Descriptor root fs pwd files File object fd files_struct stdin 0 stdout 1 File object stderr 2 3 fd_array File object Stores which files are currently opened by the process.
Three Kinds of Disk Cache • Page Cache • The main disk cache used by the Linux kernel. Stores the pages containing: • Data of regular files • Directories • Data directly read from block device files • Data of User Mode processes swapped out on disk • Special file systems (e.g., tmpfs) Dentry Cache Stores dentry objects representing file system pathnames. Inode Cache Stores inode objects representing disk inodes. Disk Cache
Page descriptors • Page descriptors are used by the kernel to keep track of the status of each page frame. • Size: 32 bytes. • All page descriptors are stored in mem_map, which takes about 1% of RAM. Reserved (HD) Reserved (kernel) mem_map array Pages A View of Memory Address Space (Abbreviated)
Find a Page in Page Cache • Each inode object owns an address_space object, which has a pointer to a radix tree. • A radix tree is a tree for looking for a page in the page cache. • An offset in the file will lead to a page descriptor position in the radix_tree. radix_tree root address_space object inode object node page_tree node node node i_mapping Page descriptor Page descriptor Page descriptor Page descriptor
Typical Layout of a Page • Sector (typically 512B): The smallest unit of data when accessing the block device. • Block (a multiple of sector size, be a power of 2, no larger than a page frame): The smallest unit of data transfer for the VFS and the file systems. It corresponds to one or more ADJASENT sectors. • Segment (a multiple of block size): If some blocks in a page holds the data adjacent on disk, they belong to one segment. Segment is used because each block I/O takes a group of adjacent blocks on disk. Sector Block Sector Sector Block Sector Page Segment Sector Block Sector Sector Block Sector
Buffer Pages • Buffer pages are used to address individual blocks in a page on the disk. • Buffer page = a regular page + several buffer heads • Bufferpages are created only when necessary, two common cases: • When reading/writing pages of a file that are not stored in contiguous disk blocks. • When accessing a single disk block (e.g., supoerblock or inode). Buffer (block) Buffer head Buffer (block) Buffer head Page Page descriptor Buffer (block) Buffer head Buffer (block) Buffer head
I/O Modes Canonical Mode O_SYNC and O_DIRECT are cleared. Read() is blocking, write() terminates as soon as the data is copied to the page cache. Synchronous Mode O_SYNC is set. The flag affects only the write operation, which blocks the calling process until the data is effectively written to disk. Memory Mapping Mode The application issues and mmap() system call to map the file to memory. So the file appears as an array of byte in RAM. Direct I/O Mode O_DIRECT is set. Any read or writer operation transfers data directly from User Mode address space to disk , or vise versa, bypassing the page cache. Asynchronous Mode The requests for data never block the calling process; rather, they are carried on “in the background” while the application continues its normal execution.
Read-ahead Read-ahead consists of reading several adjacent pages of data before they are actually requested. In most cases, read-ahead significantly enhances disk performance. Tune the read-ahead size for an opened file: modify the ra_pagesfield of file->f_ra object. POSIX_FADV_NORMAL: 32 pages (default) POSIX_FADV_SEQUENTIAL: 2NORMAL POSIX_FADV_RANDOM: 0 page
When to Flush Dirty Pages The pdflush kernel thread is responsible for writing out dirty pages in the background. Each time, pdflushtries to flush 1024 dirty pages. A pdflushthread is waken when: A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_background_ratio (typically 10%). The User Mode process issues a sync() system call. The kernel fails to allocate a new buffer page or memory pool element. The page reclaiming algorithm (LRU) wants to free more memory. A process itself may invoke system callto write back a few tens of pageswhen: A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_ratio(typically 40%).
Ext2 Block Groups Boot Block Block group 0 … Block group n Super Block Group Descriptors Data block Bitmap inode Bitmap inode Table Data blocks 1 block n blocks 1 block 1 block n blocks n blocks • Ext2 file system partitions the disk blocks into block groups of the same size. • The maximum number of blocks in a block group is 8b blocks, b is the block size in bytes, because data block bitmap must be in one block. • The kernel tries to keep the data blocks belonging to a file in the same block group, if possible.
Data Blocks Addressing Given an offset f inside a file, how to derive the logical block number of this block on disk? Get the file block number by dividing f with the block size. Translate the file block number to the corresponding logical block number by “Data Blocks Addressing”. (b/4)3 +(b/4)2 +(b/4)+11 b/4+12 (Blocks numbered with file block number) 1 3 11+b/4 12 … … … … … … “Address Mapping Table” 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Inode -> i_block
Allocating a Data Block To reduce file fragmentation, when allocating a block for a file, Ext2 follows this order: Get a new block for a file near the block already allocated for the file. Get a new block in the block group that includes the file’s inode. Get a new block from one of the other block groups. Preallocation of data blocks To reduce file fragmentation, each time, the file does not get only the requested block, but rather 8 adjacent blocks. When the file is closed, all the unused preallocated blocks will be freed.
The Ext3 Journaling File System Goal of Journaling file systems When doing a consistency check, the file system only needs to look in the journal part of disk which contains the most recent disk write operations, instead of checking the whole file system. This saves large amount of time after a system failure. Discard the changes, still constant. Two Steps in Ext3 Journaling Process A copy of the blocks to be written is stored in the journal. Apply the changes, constant. When the I/O data transfer to the journal is completed, the blocks are written in the file system. When finish, the copies in journal are discarded.