260 likes | 692 Views
CSC 660: Advanced OS. Filesystem Case Studies. Topics. Early Filesystems (FS, FFS) Journaling Filesystems B Tree Filesystems Network Filesystems GoogleFS Common Problems. Filesystem History. FS (1974) Fast Filesystem (FFS) / UFS (1984) Log-structured Filesystem (1991) ext2 (1993)
E N D
CSC 660: Advanced OS Filesystem Case Studies CSC 660: Advanced Operating Systems
Topics • Early Filesystems (FS, FFS) • Journaling Filesystems • B Tree Filesystems • Network Filesystems • GoogleFS • Common Problems CSC 660: Advanced Operating Systems
Filesystem History • FS (1974) • Fast Filesystem (FFS) / UFS (1984) • Log-structured Filesystem (1991) • ext2 (1993) • ext3 (2001) • WAFL (1994) • XFS (1994) • Reiserfs (1998) • ZFS (2004) CSC 660: Advanced Operating Systems
FS • First UNIX filesystem (1974) • Simple • Layout: superblock, inodes, then data blocks. • Unused blocks stored in free linked list, not bitmap. • 512 byte blocks, no fragments. • Short filenames. • Slow: 2% of raw disk bandwidth. • Disk seeks consume most file access time due to small block size and high fragmentation. • Later doubled perf by using 1KB blocks. CSC 660: Advanced Operating Systems
FFS • BSD (1984), basis for SYSV UFS • More complex • Cylinder groups: inodes, bitmaps, data blocks. • Larger blocks (4K) with 1K fragments. • Block layout based on physical disk parameters. • Long filenames, symlinks, file locks, quotas. • 10% space reserved by default. • Faster: 14-47% of raw disk bandwidth. • Creating a new file requires 5 seeks. • 2 inode seeks, 1 file data, 1 dir data, 1 dir inode • User/kernel memory copies take 40% of disk op time. CSC 660: Advanced Operating Systems
Log-structured Filesystem (LFS) • All data stored as sequential log entries. • Divided into large log segments. • Cleaner defragments, produces new segments. • Fast recovery: checkpoint + roll forward. • Performance: 70% of raw disk bandwidth. • Large sequential writes vs multiple writes/seeks. • Inode map tracks dynamic locations of inodes. CSC 660: Advanced Operating Systems
ext2 and ext3 • FFS + performance features. • Variable block size (1K-4K), no fragments. • Partitions disk into block groups. • Data block preallocation + read ahead. • Fast symlinks (stored in inode.) • 5% space reserved by default. • Very fast. • ext3 adds journaling capabilities. CSC 660: Advanced Operating Systems
WAFL • Network Appliance (1994) • Metadata in files • Root inode points to inode file. • Filesystem is tree of blocks with inode file. • Write metadata anywhere faster with RAID. • Allows filesystem to be expanded on fly. CSC 660: Advanced Operating Systems
WAFL Copy on write snapshots • Hourly (4/day, keep 2d), Daily (keep 7d) • Users can get deleted files from .snapshot dirs. • Snapshots created by just copying root inode. • Creates consistency point snapshot every few seconds. • Writes only to unused blocks between consistency snaps. • Recovery = last consistency point + replay NVRAM log. CSC 660: Advanced Operating Systems
XFS • SGI (1994) • Complex • Uses B+ trees to track free space, index dirs, locate file blocks and inodes. • Dynamic inode allocation, metadata journaling, volume manager, multithreaded, allocate on flush. • 64-bit filesystem (filesystems up to 263 bytes.) • Fast: 90-95% of raw disk bandwidth. CSC 660: Advanced Operating Systems
Reiserfs • Multiple different versions (v1-4) • Complex • Uses B+ trees (v3) or dancing trees (v4). • Journaling, allocate on flush, COW, tail-packing • High perf with small files, large directories. • Second to ext2 in perf (v3.) CSC 660: Advanced Operating Systems
ZFS • Sun (2004) • Complex • Variable block size + compression. • Built-in volume manager (striping, pooling.) • Self-healing with 64-bit checksums + mirroring. • COW transactional model (live data never overwritten) • Fast snapshots (just don’t release old blocks.) • 128-bit filesystem. CSC 660: Advanced Operating Systems
Network Filesystems • Idea: Use filesystem to transparently share files between computers. • Solution: • Client mounts network fs as normal. • Client filesystem code sends packets to server(s). • Server responds with data stored on a regular on-disk filesystem. CSC 660: Advanced Operating Systems
NFS • Sun • v2 (1984) • v3 (1992) TCP + 64-bit. • Implementation • System calls via Sun RPC calls. • Stateless: client obtains filesystem ID on mount, then uses filesystem ID (like filehandle) in subsequent reqs. • UNIX-centric (UIDs, GIDs, permissions) • Server authenticates by client IP address. • Client UIDs mapped to server w/ root quashing. • Danger: Client root user can su to any desired UID. CSC 660: Advanced Operating Systems
CIFS • Microsoft (1998) • Derived from 1980s IBM SMB net filesystem. • Implementation • Originally ran over NetBIOS, not TCP/IP. • \\svr\share\path Universal Naming Convention • Auth: NTLM (insecure), NTLMv2, Kerberos • MS Windows-centric (filenames, ACLs, EOLs) CSC 660: Advanced Operating Systems
AFS • CMU (1988) • Implementation • Distributed filesystem: merges fs of multiple svrs. • Cells are administrative domains within AFS. • Cells contain multiple servers. • Each server provides multiple volumes. • Global namespace: /afs/abc.com • Security: Kerberos + ACLs. • Better caching with callbacks from server. • Volume replication with RO copies on other svrs. CSC 660: Advanced Operating Systems
NFSv4 • IETF (2000) • Based on 1998 Sun draft. • New Features • Only one protocol. • Global namespace. • Security (ACLs, Kerberos, encryption) • Cross platform + internationalized. • Better caching via delegation of files to clients. CSC 660: Advanced Operating Systems
GoogleFS Assumptions • High rate of commodity hardware failures. • Small number of huge files (multi-GB +). • Reads: large streaming + small random. • Most modifications are appends. • High bandwidth >> low latency. • Applications / filesystem co-designed. CSC 660: Advanced Operating Systems
GoogleFS Architecture CSC 660: Advanced Operating Systems
GoogleFS Architecture • Master server • Metadata: namespace, ACL, chunk mapping. • Chunk lease management, garbage collection, chunk migration. • Chunk servers • Serve chunks (64MB + checksum) of files. • Chunks replicated on multiple (3) servers. CSC 660: Advanced Operating Systems
GoogleFS Writing • Client asks master which chunksvr has lease. • Master responds: leaseholder + replicas. • Client pushes data to all replicas. • Client sends write to primary replica. • Primary forwards req. • Secondaries reply to primary on completion. • Primary replies to client. CSC 660: Advanced Operating Systems
Common Problems • Consistency after crash. • Large contiguous allocations. • Metadata allocation. CSC 660: Advanced Operating Systems
Consistency • Detect + Repair • Use fsck to repair. • Journal replay. • Always Consistent • Copy on write. CSC 660: Advanced Operating Systems
Large Contiguous Allocations • Pre-allocation. • Block groups. • Multiple block sizes. CSC 660: Advanced Operating Systems
Metadata Allocation • Fixed number in one location. • Fixed number spread across disk. • Dynamically allocated in files. CSC 660: Advanced Operating Systems
References • Florian Buchholz, “The structure of the Reiser file system,” http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php, 2006. • Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994. • Sanjay Ghemawat et. al., “The Google File System,” SOSP, 2003. • Christopher Hertel, Implementing CIFS, Prentice Hall, 2003. • Val Henson, “A Brief History of UNIX Filesystems,” http://infohost.nmt.edu/~val/fs_slides.pdf • Dave Hitz, James Lau, Michael Malcolm, “File System Design for an NFS File Server Appliance,” Proceedings of the USENIX Winter 1994 Technical Conference, http://www.netapp.com/library/tr/3002.pdf • John Howard et. al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems 6(1), 1988. • Marshall K. McKusick, “A Fast File System for Unix,” Transactions on Computer Systems 2(3), 1984. • Brian Powlowski et. a., “The NFS Version 4 Protocol,” SANE 2000. • Daniel Robbins, “Advanced File System Implementor’s Guide,” IBM Developer Works, http://www-128.ibm.com/developerworks/linux/library/l-fs9.html, 2002. • Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005. • Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991. • R. Sandberg, “Design and Implementation of the Sun Network Filesystem,” Proceedings of the USENIX 1985 Summer Conference, 1985. • Adam Sweeney et. al., “Scalability in the XFS File System,” Proceedings of the USENIX 1996 Annual Technical Conference, 1996. • Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_file_systems CSC 660: Advanced Operating Systems