340 likes | 348 Views
This article explains the design and implementation of the Ext2 filesystem, covering disk structures such as block groups, inode tables, and data block addressing. It also discusses disk performance optimization techniques and the trade-off between latency and throughput.
E N D
CSC 660: Advanced OS Filesystem Implementation CSC 660: Advanced Operating Systems
Topics • Disks • Ext2 Filesystem Layout • Inode Allocation • Block Addressing • Block Allocation • e2fsck • Journaling • Stackable Filesystems CSC 660: Advanced Operating Systems
Filesystem Layering CSC 660: Advanced Operating Systems
Hard Drive Components CSC 660: Advanced Operating Systems
Hard Drive Components Actuator Moves arm across disk to read/write data. Arm has multiple read/write heads (often 2/platter.) Platters Rigid substrate material + magnetic coating. Divided into many concentric tracks. Spindle Motor Spins platters from 3600-15,000 rpm. Speed determines disk latency. Cache 2-16MB of cache memory Reliability: write-back vs. write-through CSC 660: Advanced Operating Systems
Disk Information: hdparm # hdparm -i /dev/hde /dev/hde: Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signifies the current active mode CSC 660: Advanced Operating Systems
Disk Performance Seek Time Time to move head to desired track (3-8 ms) Rotational Delay Time until head over desired block (8ms for 7200) Latency Seek Time + Rotational Delay Throughput Data transfer rate (20-80 MB/s) CSC 660: Advanced Operating Systems
Latency vs. Throughput Which is more important? Depends on the type of load. Sequential access – Throughput Multimedia on a single user PC Random access – Latency Most servers How to improve performance Faster disks Caching More spindles (disks). More disk controllers. CSC 660: Advanced Operating Systems
Ext2 Disk Data Structures CSC 660: Advanced Operating Systems
Block Groups • Each block group contains • Block bitmap • Inode bitmap • Inode blocks • Data blocks • How many blocks in a group? • Bitmaps are only 1 block in size. • Block bitmap can map 8 x blocksize blocks. • 4Kbyte blocks => # data blocks = 32K (128M) CSC 660: Advanced Operating Systems
Inode Table • Consecutive set of inode blocks. • Inodes are 128 bytes in size. • A 4K block contains 32 inodes. • Extended Inode Attributes • Problem: inodes are fixed size, must be 2n • Solution: i_file_acl attribute points to non-inode block containing extended attributes (immutable bit, ACLs) • System calls: setxattr(), getxattr() • ACLs are most common application and have their own system calls: setfacl(), getfacl() CSC 660: Advanced Operating Systems
Disk Block Usage • Regular files • Zero-size files use no blocks. • Symlinks • Pathnames < 60 chars stored in i_block field of inode. • Directories • Format: • Dev/IPC Files • No data blocks CSC 660: Advanced Operating Systems
Creating an ext2 Filesystem • Initializes superblock + group descriptors. • Checks for bad blocks and creates list. • For each block group • Reserves space: super,desc,inodes,bitmaps. • Initializes inode and block bitmaps to zero. • Initializes inode table. • Creates root (/) directory. • Creates lost+found directory for e2fsck. • Updates bitmaps with two directories. • Groups bad blocks in lost+found. CSC 660: Advanced Operating Systems
Managing Disk Space • How to avoid file fragmentation? • If file blocks aren’t contiguous on disk, expensive seeks are required • How to access blocks quickly? • The kernel should be able to quickly convert file offset into a logical block number on disk with few disk accesses. CSC 660: Advanced Operating Systems
Creating Inodes • Allocate VFS inode with new_inode() • If inode is a dir, find a suitable block group • If subdirectory of /, find block group with above average free inodes + free blocks. • If not subdir of /, use block group of parent dir if • Group does not have too many directories. • Group has enough free inodes. • Group has small “debt” value (+dirs, -files) • Else use 1st block group with free inodes > avg. CSC 660: Advanced Operating Systems
Creating Inodes • If new inode not a directory • Logarithmic search for free inodes starting with block group of parent directory. • Ex: searches i % n, (i+1)%n, (i+1+2)%n • If log search fails, perform linear search. • Reads bitmap of selected block group. • Searches for 1st unused bit to get inode #. • Allocates disk inode. • Sets bit, marks inode bitmap block dirty. • Sets inode fields and writes to disk. CSC 660: Advanced Operating Systems
Data Block Addressing • Block Numbers • File: relative position of block within file. • File Offset -> File Block: (int) (offset / blocksize). • Logical: position within disk partition. • File Block -> Logical Block: use inode to translate. • Inodes • Direct blocks: 12 logical block numbers (48K) • Indirect: points to block of block #s (4M) • Double-indirect (4G) • Triple-indirect (4T) CSC 660: Advanced Operating Systems
Inode Block Addressing CSC 660: Advanced Operating Systems
File Holes • Portion of file not stored on disk. • Can contain only null bytes. • echo –n “x” | dd of=/tmp/hole bs=1024 seek=6 • Used by databases and similar hashing apps. • How big is a file with a hole? • i_size includes null bytes in hole. • i_blocks stores data blocks actually used. CSC 660: Advanced Operating Systems
File Holes CSC 660: Advanced Operating Systems
Allocating Data Blocks • Goal parameter • Preferred logical block number. • If prev 2 blocks consecutive, goal = prev+1 • Else if at least 1 block alloc, goal = prev • Else goal = first logical block of group • ext2_alloc_block() • If goal block pre-allocated to file, allocates. • Else, discards remaining pre-allocated and invokes ext2_new_block(). CSC 660: Advanced Operating Systems
ext2_new_block() • If goal is free, allocates. • Otherwise checks for nearby free blocks. • If no nearby free blocks, checks all groups • Starts with block group of goal block. • Searches for group of 8+ adjacent free blocks. • If no such group, looks for single free block. • Will allocate up to 8 adjacent free blocks. CSC 660: Advanced Operating Systems
e2fsck • Checks validity of all inodes. Is file mode valid? Are blocks valid? Are blocks used by multiple inodes? • Checks validity of all directories. Valid format? Do all entries refer to inodes from 1? • Checks directory connectivity. Is there a path from / to each directory? • Checks inode reference counts. Compares link counts with values calculated in 1+2. Moves undeleted 0 link count files to /lost+found. • Checks filesystem summary validity. Do on-disk inode/block bitmaps match e2fsck ones? CSC 660: Advanced Operating Systems
ext3 = ext2 + journaling • ext3 adds a journal to the filesystem. • Journal (log) does sequential writes. • Just blocks, no inodes, bitmaps, etc. • Kernel thread writes log blocks to ext2 format. • Why? Eliminate need for e2fsck after crash. CSC 660: Advanced Operating Systems
Journaling Perform system call-level changes by: • Write blocks to journal. • Wait for write to be committed to journal. • Write blocks to filesystem. • Discard blocks from journal. CSC 660: Advanced Operating Systems
System Failure Resolution • Failure before journal commit • Ignore missing or incomplete journal blocks. • Change is lost, but filesystem is consistent. • Failure after commit • Journal blocks are written to filesystem. CSC 660: Advanced Operating Systems
Journal Types • Journal • All data and metadata logged to journal. • Safest and slowest ext3 mode. • Ordered (default) • Only metadata changes are logged. • Ensures data blocks written before metadata. • Guarantees writes that enlarge are safe. • Writeback • Only metadata logged, no re-ordered. • Fastest and least safe ext3 mode. CSC 660: Advanced Operating Systems
Stackable Filesystems • Filesystems useful for enhancing OS. • File encryption. • Secure deletion. • Virus detection. • File versioning. • UnionFS. • But, filesystems are difficult to develop. • 10,000+ lines of C code is typical. • Most of which you don’t want to change. CSC 660: Advanced Operating Systems
Stackable Filesystems • Solution #1 • Copy ext3fs + add your code. • Problem: maintenance, keeping up with ext3. • Solution #2 • Add a layer of indirection: stackable filesystems. CSC 660: Advanced Operating Systems
Stackable Filesystems CSC 660: Advanced Operating Systems
File Data API • encode_data • Called by write calls before data sent to lower-level filesystem. • decode_data • Called by read calls after data received from lower-level filesystem. • Arguments • I/O blocks: cannot change size. • File attributes, user credentials. CSC 660: Advanced Operating Systems
Filename API • encode_filename • Modifies filename from user system call that is sent to lower-level filesystem. • decode_filename • Modifies filename from filesystem before returning to user. • Arguments • Filenames: can change length, but no invalid chars • File’s vnode, user credentials. CSC 660: Advanced Operating Systems
File Attributes API • No specific API. • Must modify wrapfs calls directly. CSC 660: Advanced Operating Systems
References • Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd edition, O’Reilly, 2005. • Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994. • Robert Love, Linux Kernel Development, 2nd edition, Prentice-Hall, 2005. • Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005. • Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991. • Andrew S. Tanenbaum, Modern Operating Systems, 3rd edition, Prentice-Hall, 2005. • Erek Zadok and Jason Nieh, “FIST: A Language for Stackable Filesystems,” http://www.filesystems.org/docs/fist-lang/fist.pdf, 2000. CSC 660: Advanced Operating Systems