340 likes | 500 Views
CSC 660: Advanced OS. Filesystem Implementation. Topics. Disks Ext2 Filesystem Layout Inode Allocation Block Addressing Block Allocation e2fsck Journaling Stackable Filesystems. Filesystem Layering. Hard Drive Components. Hard Drive Components. Actuator
E N D
CSC 660: Advanced OS Filesystem Implementation CSC 660: Advanced Operating Systems
Topics • Disks • Ext2 Filesystem Layout • Inode Allocation • Block Addressing • Block Allocation • e2fsck • Journaling • Stackable Filesystems CSC 660: Advanced Operating Systems
Filesystem Layering CSC 660: Advanced Operating Systems
Hard Drive Components CSC 660: Advanced Operating Systems
Hard Drive Components Actuator Moves arm across disk to read/write data. Arm has multiple read/write heads (often 2/platter.) Platters Rigid substrate material + magnetic coating. Divided into many concentric tracks. Spindle Motor Spins platters from 3600-15,000 rpm. Speed determines disk latency. Cache 2-16MB of cache memory Reliability: write-back vs. write-through CSC 660: Advanced Operating Systems
Disk Information: hdparm # hdparm -i /dev/hde /dev/hde: Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signifies the current active mode CSC 660: Advanced Operating Systems
Disk Performance Seek Time Time to move head to desired track (3-8 ms) Rotational Delay Time until head over desired block (8ms for 7200) Latency Seek Time + Rotational Delay Throughput Data transfer rate (20-80 MB/s) CSC 660: Advanced Operating Systems
Latency vs. Throughput Which is more important? Depends on the type of load. Sequential access – Throughput Multimedia on a single user PC Random access – Latency Most servers How to improve performance Faster disks Caching More spindles (disks). More disk controllers. CSC 660: Advanced Operating Systems
Ext2 Disk Data Structures CSC 660: Advanced Operating Systems
Block Groups • Each block group contains • Block bitmap • Inode bitmap • Inode blocks • Data blocks • How many blocks in a group? • Bitmaps are only 1 block in size. • Block bitmap can map 8 x blocksize blocks. • 4Kbyte blocks => # data blocks = 32K (128M) CSC 660: Advanced Operating Systems
Inode Table • Consecutive set of inode blocks. • Inodes are 128 bytes in size. • A 4K block contains 32 inodes. • Extended Inode Attributes • Problem: inodes are fixed size, must be 2n • Solution: i_file_acl attribute points to non-inode block containing extended attributes (immutable bit, ACLs) • System calls: setxattr(), getxattr() • ACLs are most common application and have their own system calls: setfacl(), getfacl() CSC 660: Advanced Operating Systems
Disk Block Usage • Regular files • Zero-size files use no blocks. • Symlinks • Pathnames < 60 chars stored in i_block field of inode. • Directories • Format: • Dev/IPC Files • No data blocks CSC 660: Advanced Operating Systems
Creating an ext2 Filesystem • Initializes superblock + group descriptors. • Checks for bad blocks and creates list. • For each block group • Reserves space: super,desc,inodes,bitmaps. • Initializes inode and block bitmaps to zero. • Initializes inode table. • Creates root (/) directory. • Creates lost+found directory for e2fsck. • Updates bitmaps with two directories. • Groups bad blocks in lost+found. CSC 660: Advanced Operating Systems
Managing Disk Space • How to avoid file fragmentation? • If file blocks aren’t contiguous on disk, expensive seeks are required • How to access blocks quickly? • The kernel should be able to quickly convert file offset into a logical block number on disk with few disk accesses. CSC 660: Advanced Operating Systems
Creating Inodes • Allocate VFS inode with new_inode() • If inode is a dir, find a suitable block group • If subdirectory of /, find block group with above average free inodes + free blocks. • If not subdir of /, use block group of parent dir if • Group does not have too many directories. • Group has enough free inodes. • Group has small “debt” value (+dirs, -files) • Else use 1st block group with free inodes > avg. CSC 660: Advanced Operating Systems
Creating Inodes • If new inode not a directory • Logarithmic search for free inodes starting with block group of parent directory. • Ex: searches i % n, (i+1)%n, (i+1+2)%n • If log search fails, perform linear search. • Reads bitmap of selected block group. • Searches for 1st unused bit to get inode #. • Allocates disk inode. • Sets bit, marks inode bitmap block dirty. • Sets inode fields and writes to disk. CSC 660: Advanced Operating Systems
Data Block Addressing • Block Numbers • File: relative position of block within file. • File Offset -> File Block: (int) (offset / blocksize). • Logical: position within disk partition. • File Block -> Logical Block: use inode to translate. • Inodes • Direct blocks: 12 logical block numbers (48K) • Indirect: points to block of block #s (4M) • Double-indirect (4G) • Triple-indirect (4T) CSC 660: Advanced Operating Systems
Inode Block Addressing CSC 660: Advanced Operating Systems
File Holes • Portion of file not stored on disk. • Can contain only null bytes. • echo –n “x” | dd of=/tmp/hole bs=1024 seek=6 • Used by databases and similar hashing apps. • How big is a file with a hole? • i_size includes null bytes in hole. • i_blocks stores data blocks actually used. CSC 660: Advanced Operating Systems
File Holes CSC 660: Advanced Operating Systems
Allocating Data Blocks • Goal parameter • Preferred logical block number. • If prev 2 blocks consecutive, goal = prev+1 • Else if at least 1 block alloc, goal = prev • Else goal = first logical block of group • ext2_alloc_block() • If goal block pre-allocated to file, allocates. • Else, discards remaining pre-allocated and invokes ext2_new_block(). CSC 660: Advanced Operating Systems
ext2_new_block() • If goal is free, allocates. • Otherwise checks for nearby free blocks. • If no nearby free blocks, checks all groups • Starts with block group of goal block. • Searches for group of 8+ adjacent free blocks. • If no such group, looks for single free block. • Will allocate up to 8 adjacent free blocks. CSC 660: Advanced Operating Systems
e2fsck • Checks validity of all inodes. Is file mode valid? Are blocks valid? Are blocks used by multiple inodes? • Checks validity of all directories. Valid format? Do all entries refer to inodes from 1? • Checks directory connectivity. Is there a path from / to each directory? • Checks inode reference counts. Compares link counts with values calculated in 1+2. Moves undeleted 0 link count files to /lost+found. • Checks filesystem summary validity. Do on-disk inode/block bitmaps match e2fsck ones? CSC 660: Advanced Operating Systems
ext3 = ext2 + journaling • ext3 adds a journal to the filesystem. • Journal (log) does sequential writes. • Just blocks, no inodes, bitmaps, etc. • Kernel thread writes log blocks to ext2 format. • Why? Eliminate need for e2fsck after crash. CSC 660: Advanced Operating Systems
Journaling Perform system call-level changes by: • Write blocks to journal. • Wait for write to be committed to journal. • Write blocks to filesystem. • Discard blocks from journal. CSC 660: Advanced Operating Systems
System Failure Resolution • Failure before journal commit • Ignore missing or incomplete journal blocks. • Change is lost, but filesystem is consistent. • Failure after commit • Journal blocks are written to filesystem. CSC 660: Advanced Operating Systems
Journal Types • Journal • All data and metadata logged to journal. • Safest and slowest ext3 mode. • Ordered (default) • Only metadata changes are logged. • Ensures data blocks written before metadata. • Guarantees writes that enlarge are safe. • Writeback • Only metadata logged, no re-ordered. • Fastest and least safe ext3 mode. CSC 660: Advanced Operating Systems
Stackable Filesystems • Filesystems useful for enhancing OS. • File encryption. • Secure deletion. • Virus detection. • File versioning. • UnionFS. • But, filesystems are difficult to develop. • 10,000+ lines of C code is typical. • Most of which you don’t want to change. CSC 660: Advanced Operating Systems
Stackable Filesystems • Solution #1 • Copy ext3fs + add your code. • Problem: maintenance, keeping up with ext3. • Solution #2 • Add a layer of indirection: stackable filesystems. CSC 660: Advanced Operating Systems
Stackable Filesystems CSC 660: Advanced Operating Systems
File Data API • encode_data • Called by write calls before data sent to lower-level filesystem. • decode_data • Called by read calls after data received from lower-level filesystem. • Arguments • I/O blocks: cannot change size. • File attributes, user credentials. CSC 660: Advanced Operating Systems
Filename API • encode_filename • Modifies filename from user system call that is sent to lower-level filesystem. • decode_filename • Modifies filename from filesystem before returning to user. • Arguments • Filenames: can change length, but no invalid chars • File’s vnode, user credentials. CSC 660: Advanced Operating Systems
File Attributes API • No specific API. • Must modify wrapfs calls directly. CSC 660: Advanced Operating Systems
References • Daniel P. Bovet and Marco Cesati, Understanding the Linux Kernel, 3rd edition, O’Reilly, 2005. • Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994. • Robert Love, Linux Kernel Development, 2nd edition, Prentice-Hall, 2005. • Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005. • Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991. • Andrew S. Tanenbaum, Modern Operating Systems, 3rd edition, Prentice-Hall, 2005. • Erek Zadok and Jason Nieh, “FIST: A Language for Stackable Filesystems,” http://www.filesystems.org/docs/fist-lang/fist.pdf, 2000. CSC 660: Advanced Operating Systems