1 / 34

Advanced Filesystem Implementation - Ext2 Disk Structures and Performance Optimization

This article explains the design and implementation of the Ext2 filesystem, covering disk structures such as block groups, inode tables, and data block addressing. It also discusses disk performance optimization techniques and the trade-off between latency and throughput.

Download Presentation

Advanced Filesystem Implementation - Ext2 Disk Structures and Performance Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 660: Advanced OS Filesystem Implementation CSC 660: Advanced Operating Systems

  2. Topics • Disks • Ext2 Filesystem Layout • Inode Allocation • Block Addressing • Block Allocation • e2fsck • Journaling • Stackable Filesystems CSC 660: Advanced Operating Systems

  3. Filesystem Layering CSC 660: Advanced Operating Systems

  4. Hard Drive Components CSC 660: Advanced Operating Systems

  5. Hard Drive Components Actuator Moves arm across disk to read/write data. Arm has multiple read/write heads (often 2/platter.) Platters Rigid substrate material + magnetic coating. Divided into many concentric tracks. Spindle Motor Spins platters from 3600-15,000 rpm. Speed determines disk latency. Cache 2-16MB of cache memory Reliability: write-back vs. write-through CSC 660: Advanced Operating Systems

  6. Disk Information: hdparm # hdparm -i /dev/hde /dev/hde: Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signifies the current active mode CSC 660: Advanced Operating Systems

  7. Disk Performance Seek Time Time to move head to desired track (3-8 ms) Rotational Delay Time until head over desired block (8ms for 7200) Latency Seek Time + Rotational Delay Throughput Data transfer rate (20-80 MB/s) CSC 660: Advanced Operating Systems

  8. Latency vs. Throughput Which is more important? Depends on the type of load. Sequential access – Throughput Multimedia on a single user PC Random access – Latency Most servers How to improve performance Faster disks Caching More spindles (disks). More disk controllers. CSC 660: Advanced Operating Systems

  9. Ext2 Disk Data Structures CSC 660: Advanced Operating Systems

  10. Block Groups • Each block group contains • Block bitmap • Inode bitmap • Inode blocks • Data blocks • How many blocks in a group? • Bitmaps are only 1 block in size. • Block bitmap can map 8 x blocksize blocks. • 4Kbyte blocks => # data blocks = 32K (128M) CSC 660: Advanced Operating Systems

  11. Inode Table • Consecutive set of inode blocks. • Inodes are 128 bytes in size. • A 4K block contains 32 inodes. • Extended Inode Attributes • Problem: inodes are fixed size, must be 2n • Solution: i_file_acl attribute points to non-inode block containing extended attributes (immutable bit, ACLs) • System calls: setxattr(), getxattr() • ACLs are most common application and have their own system calls: setfacl(), getfacl() CSC 660: Advanced Operating Systems

  12. Disk Block Usage • Regular files • Zero-size files use no blocks. • Symlinks • Pathnames < 60 chars stored in i_block field of inode. • Directories • Format: • Dev/IPC Files • No data blocks CSC 660: Advanced Operating Systems

  13. Creating an ext2 Filesystem • Initializes superblock + group descriptors. • Checks for bad blocks and creates list. • For each block group • Reserves space: super,desc,inodes,bitmaps. • Initializes inode and block bitmaps to zero. • Initializes inode table. • Creates root (/) directory. • Creates lost+found directory for e2fsck. • Updates bitmaps with two directories. • Groups bad blocks in lost+found. CSC 660: Advanced Operating Systems

  14. Managing Disk Space • How to avoid file fragmentation? • If file blocks aren’t contiguous on disk, expensive seeks are required • How to access blocks quickly? • The kernel should be able to quickly convert file offset into a logical block number on disk with few disk accesses. CSC 660: Advanced Operating Systems

  15. Creating Inodes • Allocate VFS inode with new_inode() • If inode is a dir, find a suitable block group • If subdirectory of /, find block group with above average free inodes + free blocks. • If not subdir of /, use block group of parent dir if • Group does not have too many directories. • Group has enough free inodes. • Group has small “debt” value (+dirs, -files) • Else use 1st block group with free inodes > avg. CSC 660: Advanced Operating Systems

  16. Creating Inodes • If new inode not a directory • Logarithmic search for free inodes starting with block group of parent directory. • Ex: searches i % n, (i+1)%n, (i+1+2)%n • If log search fails, perform linear search. • Reads bitmap of selected block group. • Searches for 1st unused bit to get inode #. • Allocates disk inode. • Sets bit, marks inode bitmap block dirty. • Sets inode fields and writes to disk. CSC 660: Advanced Operating Systems

  17. Data Block Addressing • Block Numbers • File: relative position of block within file. • File Offset -> File Block: (int) (offset / blocksize). • Logical: position within disk partition. • File Block -> Logical Block: use inode to translate. • Inodes • Direct blocks: 12 logical block numbers (48K) • Indirect: points to block of block #s (4M) • Double-indirect (4G) • Triple-indirect (4T) CSC 660: Advanced Operating Systems

  18. Inode Block Addressing CSC 660: Advanced Operating Systems

  19. File Holes • Portion of file not stored on disk. • Can contain only null bytes. • echo –n “x” | dd of=/tmp/hole bs=1024 seek=6 • Used by databases and similar hashing apps. • How big is a file with a hole? • i_size includes null bytes in hole. • i_blocks stores data blocks actually used. CSC 660: Advanced Operating Systems

  20. File Holes CSC 660: Advanced Operating Systems

  21. Allocating Data Blocks • Goal parameter • Preferred logical block number. • If prev 2 blocks consecutive, goal = prev+1 • Else if at least 1 block alloc, goal = prev • Else goal = first logical block of group • ext2_alloc_block() • If goal block pre-allocated to file, allocates. • Else, discards remaining pre-allocated and invokes ext2_new_block(). CSC 660: Advanced Operating Systems

  22. ext2_new_block() • If goal is free, allocates. • Otherwise checks for nearby free blocks. • If no nearby free blocks, checks all groups • Starts with block group of goal block. • Searches for group of 8+ adjacent free blocks. • If no such group, looks for single free block. • Will allocate up to 8 adjacent free blocks. CSC 660: Advanced Operating Systems

  23. e2fsck • Checks validity of all inodes. Is file mode valid? Are blocks valid? Are blocks used by multiple inodes? • Checks validity of all directories. Valid format? Do all entries refer to inodes from 1? • Checks directory connectivity. Is there a path from / to each directory? • Checks inode reference counts. Compares link counts with values calculated in 1+2. Moves undeleted 0 link count files to /lost+found. • Checks filesystem summary validity. Do on-disk inode/block bitmaps match e2fsck ones? CSC 660: Advanced Operating Systems

  24. ext3 = ext2 + journaling • ext3 adds a journal to the filesystem. • Journal (log) does sequential writes. • Just blocks, no inodes, bitmaps, etc. • Kernel thread writes log blocks to ext2 format. • Why? Eliminate need for e2fsck after crash. CSC 660: Advanced Operating Systems

  25. Journaling Perform system call-level changes by: • Write blocks to journal. • Wait for write to be committed to journal. • Write blocks to filesystem. • Discard blocks from journal. CSC 660: Advanced Operating Systems

  26. System Failure Resolution • Failure before journal commit • Ignore missing or incomplete journal blocks. • Change is lost, but filesystem is consistent. • Failure after commit • Journal blocks are written to filesystem. CSC 660: Advanced Operating Systems

  27. Journal Types • Journal • All data and metadata logged to journal. • Safest and slowest ext3 mode. • Ordered (default) • Only metadata changes are logged. • Ensures data blocks written before metadata. • Guarantees writes that enlarge are safe. • Writeback • Only metadata logged, no re-ordered. • Fastest and least safe ext3 mode. CSC 660: Advanced Operating Systems

  28. Stackable Filesystems • Filesystems useful for enhancing OS. • File encryption. • Secure deletion. • Virus detection. • File versioning. • UnionFS. • But, filesystems are difficult to develop. • 10,000+ lines of C code is typical. • Most of which you don’t want to change. CSC 660: Advanced Operating Systems

  29. Stackable Filesystems • Solution #1 • Copy ext3fs + add your code. • Problem: maintenance, keeping up with ext3. • Solution #2 • Add a layer of indirection: stackable filesystems. CSC 660: Advanced Operating Systems

  30. Stackable Filesystems CSC 660: Advanced Operating Systems

  31. File Data API • encode_data • Called by write calls before data sent to lower-level filesystem. • decode_data • Called by read calls after data received from lower-level filesystem. • Arguments • I/O blocks: cannot change size. • File attributes, user credentials. CSC 660: Advanced Operating Systems

  32. Filename API • encode_filename • Modifies filename from user system call that is sent to lower-level filesystem. • decode_filename • Modifies filename from filesystem before returning to user. • Arguments • Filenames: can change length, but no invalid chars • File’s vnode, user credentials. CSC 660: Advanced Operating Systems

  33. File Attributes API • No specific API. • Must modify wrapfs calls directly. CSC 660: Advanced Operating Systems

  34. References • Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd edition, O’Reilly, 2005. • Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994. • Robert Love, Linux Kernel Development, 2nd edition, Prentice-Hall, 2005. • Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005. • Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991. • Andrew S. Tanenbaum, Modern Operating Systems, 3rd edition, Prentice-Hall, 2005. • Erek Zadok and Jason Nieh, “FIST: A Language for Stackable Filesystems,” http://www.filesystems.org/docs/fist-lang/fist.pdf, 2000. CSC 660: Advanced Operating Systems

More Related