390 likes | 405 Views
CSC 660: Advanced OS. Distributed Filesystems. Topics. Filesystem History Distributed Filesystems AFS GoogleFS Common filesystem issues. Filesystem History. FS (1974) Fast Filesystem (FFS) / UFS (1984) Log-structured Filesystem (1991) ext2 (1993) ext3 (2001) WAFL (1994) XFS (1994)
E N D
CSC 660: Advanced OS Distributed Filesystems CSC 660: Advanced Operating Systems
Topics • Filesystem History • Distributed Filesystems • AFS • GoogleFS • Common filesystem issues CSC 660: Advanced Operating Systems
Filesystem History • FS (1974) • Fast Filesystem (FFS) / UFS (1984) • Log-structured Filesystem (1991) • ext2 (1993) • ext3 (2001) • WAFL (1994) • XFS (1994) • Reiserfs (1998) • ZFS (2004) CSC 660: Advanced Operating Systems
FS • First UNIX filesystem (1974) • Simple • Layout: superblock, inodes, then data blocks. • Unused blocks stored in free linked list, not bitmap. • 512 byte blocks, no fragments. • Short filenames. • Slow: 2% of raw disk bandwidth. • Disk seeks consume most file access time due to small block size and high fragmentation. • Later doubled perf by using 1KB blocks. CSC 660: Advanced Operating Systems
FFS • BSD (1984), basis for SYSV UFS • More complex • Cylinder groups: inodes, bitmaps, data blocks. • Larger blocks (4K) with 1K fragments. • Block layout based on physical disk parameters. • Long filenames, symlinks, file locks, quotas. • 10% space reserved by default. • Faster: 14-47% of raw disk bandwidth. • Creating a new file requires 5 seeks. • 2 inode seeks, 1 file data, 1 dir data, 1 dir inode • User/kernel memory copies take 40% of disk op time. CSC 660: Advanced Operating Systems
Log-structured Filesystem (LFS) • All data stored as sequential log entries. • Divided into large log segments. • Cleaner defragments, produces new segments. • Fast recovery: checkpoint + roll forward. • Performance: 70% of raw disk bandwidth. • Large sequential writes vs multiple writes/seeks. • Inode map tracks dynamic locations of inodes. CSC 660: Advanced Operating Systems
ext2 and ext3 FFS + performance features. • Variable block size (1K-4K), no fragments. • Partitions disk into block groups. • Data block preallocation + read ahead. • Fast symlinks (stored in inode.) • 5% space reserved by default. • Very fast. ext3 adds journaling capabilities. CSC 660: Advanced Operating Systems
WAFL Network Appliance (1994) Metadata in files • Root inode points to inode file. • Filesystem is tree of blocks with inode file. • Write metadata anywhere faster with RAID. • Allows filesystem to be expanded on fly. CSC 660: Advanced Operating Systems
WAFL Copy on write snapshots • Hourly (4/day, keep 2d), Daily (keep 7d) • Users can get deleted files from .snapshot dirs. • Snapshots created by just copying root inode. • Creates consistency point snapshot every few seconds. • Writes only to unused blocks between consistency snaps. • Recovery = last consistency point + replay NVRAM log. CSC 660: Advanced Operating Systems
XFS SGI (1994) Complex journaling filesystem • Uses B+ trees to track free space, index dirs, locate file blocks and inodes. • Dynamic inode allocation, metadata journaling, volume manager, multithreaded, allocate on flush. • 64-bit filesystem (filesystems up to 263 bytes.) • Fast: 90-95% of raw disk bandwidth. CSC 660: Advanced Operating Systems
Reiserfs Multiple different versions (v1-4) Complex tree-based filesystem • Uses B+ trees (v3) or dancing trees (v4). • Journaling, allocate on flush, COW, tail-packing • High perf with small files, large directories. • Second to ext2 in perf (v3.) CSC 660: Advanced Operating Systems
ZFS Sun (2004) Copy-on-write + volume management • Variable block size + compression. • Built-in volume manager (striping, pooling.) • Self-healing with 64-bit checksums + mirroring. • COW transactional model (live data never overwritten) • Fast snapshots (just don’t release old blocks.) • 128-bit filesystem. CSC 660: Advanced Operating Systems
Distributed Filesystems Use filesystem to transparently share data between computers. Accessing files via a distributed filesystem: • Client mounts network filesystem. • Client makes a request for file access. • Client kernel sends network request to server. • Server performs file ops on physical disk. • Server sends response across network to client. CSC 660: Advanced Operating Systems
Naming Mapping between logical and physical objects. UNIX filenames mapped to inodes. Network filenames map to hostname, vnode pairs. Location independent names Filename is a dynamic one-to-many mapping. Files can migrate to other servers w/o renaming. Files can be replicated across multiple servers. CSC 660: Advanced Operating Systems
Naming Implementation Location-dependent (non-transparent) filename -> <system,disk,inode> Location-independent (transparent) filename -> file_identifier -> <system,disk,inode> Identifiers must be unique. Identifiers must be updated to point to a new physical location when a file is moved. CSC 660: Advanced Operating Systems
Caching Problem: Every file access uses network. Solution: Store remote data on local system. Cache can be memory or disk based. Read-ahead can reduce accesses further. CSC 660: Advanced Operating Systems
Cache Update Policies Write Through Write data to server and cache at once. Return to program when server write complete. High reliability, poor performance. Delayed Write Write data to cache, then return to program. Modifications written through to server later. High performance, poor reliability. CSC 660: Advanced Operating Systems
NFS with Cachefs CSC 660: Advanced Operating Systems
Cache Consistency Problem Cache Consistency Problem Keeping cached copies consistent with server. Consistency overhead can decrease performance if too many writes done on same set of files. Client-initiated consistency Client asks server if data is consistent. When: every file access, periodically. Server-initiated consistency Server detects conflicts and invalidates client caches. Server has to maintain state of what is cached where. CSC 660: Advanced Operating Systems
Stateful File Access Stateful process: • Client sends open request to server. • Server opens file, inserts into open file table. • Server returns file identifier to client. • Client uses identifier to read/write file. • Client closes file. • Server removes file from open file table. Features High performance, because fewer disk accesses. Problem of clients that crash without closing files. CSC 660: Advanced Operating Systems
Stateless File Service Every request is self contained. Must specify filename, position in every request. Server doesn’t know which files are open. Server crashes have minimal effect. Stateful servers must poll clients to recover state. CSC 660: Advanced Operating Systems
NFS Sun v2 (1984) v3 (1992) TCP + 64-bit. Implementation • System calls via Sun RPC calls. • Stateless: client obtains filesystem ID on mount, then uses filesystem ID (like filehandle) in subsequent reqs. • UNIX-centric (UIDs, GIDs, permissions) • Server authenticates by client IP address. • Client UIDs mapped to server w/ root quashing. • Danger: Client root user can su to any desired UID. CSC 660: Advanced Operating Systems
CIFS Microsoft (1998) Derived from 1980s IBM SMB net filesystem. Implementation Originally ran over NetBIOS, not TCP/IP. \\svr\share\path Universal Naming Convention Auth: NTLM (insecure), NTLMv2, Kerberos MS Windows-centric (filenames, ACLs, EOLs) CSC 660: Advanced Operating Systems
AFS CMU (1983) • Sold by Transarc/IBM, then free as OpenAFS. Features • Uniform /afs name space. • Location-independent file sharing. • Whole file caching on client. • Secure authentication via Kerberos. CSC 660: Advanced Operating Systems
AFS Global namespace divided into cells • Cells separate authorization domains. • Cells included in pathname: /afs/CELL/ • Ex: cmu.edu, intel.com Cells contain multiple servers • Location independence managed via volume db. • Files are located on volumes. • Volumes can migrate between servers. • Volumes can be replicated in read-only fashion. CSC 660: Advanced Operating Systems
NFSv4 IETF (2000) Based on 1998 Sun draft. New Features • Only one protocol. • Global namespace. • Security (ACLs, Kerberos, encryption) • Cross platform + internationalized. • Better caching via delegation of files to clients. CSC 660: Advanced Operating Systems
GoogleFS Assumptions • High rate of commodity hardware failures. • Small number of huge files (multi-GB +). • Reads: large streaming + small random. • Most modifications are appends. • High bandwidth >> low latency. • Applications / filesystem co-designed. CSC 660: Advanced Operating Systems
GoogleFS Architecture CSC 660: Advanced Operating Systems
GoogleFS Architecture • Master server • Metadata: namespace, ACL, chunk mapping. • Chunk lease management, garbage collection, chunk migration. • Chunk servers • Serve chunks (64MB + checksum) of files. • Chunks replicated on multiple (3) servers. CSC 660: Advanced Operating Systems
GoogleFS Writing • Client asks master which chunksvr has lease. • Master responds: leaseholder + replicas. • Client pushes data to all replicas. • Client sends write to primary replica. • Primary forwards req. • Secondaries reply to primary on completion. • Primary replies to client. CSC 660: Advanced Operating Systems
GoogleFS Consistency File regions can be Consistent: all clients see the same data. Defined: consistent + clients will see entire write. Inconsistent: different clients see different data. Files can be modified by Random write: data written at specified offset. Record append: data is appended atomically at least once. Padding or record duplicates may be inserted as part of an append operation. CSC 660: Advanced Operating Systems
GoogleFS Consistency Writers deal with consistency issues by • Preferring appends to random writes. • Application-level checkpoints. • Self-identifying records with checksums. Readers deal with consistency issues by • Processing file only up until checkpoint. • Ignoring padding. • Discarding records with duplicate checksums. CSC 660: Advanced Operating Systems
Chunk Replication New Chunks • Replicate new chunks on servers with below-average disk utilization. • Limit the number of recent chunk creations on each server, due to iminent writes. Re-replication • Prioritize chunks based on how far chunk is away from replication goal. • Master clones chunk by choosing a server and telling it to replicate chunk from closest replica. • Master re-balances chunk distribution periodically. CSC 660: Advanced Operating Systems
GoogleFS Reliability Chunk level reliability Incremental checksums on each chunk Chunks replicated by default across 3 servers. Single master server Metadata stored in memory, operation log. Metadata recovered by polling chunk servers. Shadow masters provide ro access if primary down. CSC 660: Advanced Operating Systems
Common Problems • Consistency after crash. • Large contiguous allocations. • Metadata allocation. CSC 660: Advanced Operating Systems
Consistency • Detect + Repair • Use fsck to repair. • Journal replay. • Always Consistent • Copy on write. CSC 660: Advanced Operating Systems
Large Contiguous Allocations • Pre-allocation. • Block groups. • Multiple block sizes. CSC 660: Advanced Operating Systems
Metadata Allocation • Fixed number in one location. • Fixed number spread across disk. • Dynamically allocated in files. CSC 660: Advanced Operating Systems
References • Jerry Breecher, “Distributed Filesystems,” http://cs.clarku.edu/~jbreecher/os/lectures/Section17-Dist_File_Sys.ppt • Florian Buchholz, “The structure of the Reiser file system,” http://homes.cerias.purdue.edu/~florian/reiser/reiserfs.php, 2006. • Remy Card, Theodore T’so, Stephen Tweedie, “Design and Impementation of the Second Extended Filesystem,” http://web.mit.edu/tytso/www/linux/ext2intro.html, 1994. • Sanjay Ghemawat et. al., “The Google File System,” SOSP, 2003. • Christopher Hertel, Implementing CIFS, Prentice Hall, 2003. • Val Henson, “A Brief History of UNIX Filesystems,” http://infohost.nmt.edu/~val/fs_slides.pdf • Dave Hitz, James Lau, Michael Malcolm, “File System Design for an NFS File Server Appliance,” Proceedings of the USENIX Winter 1994 Technical Conference, http://www.netapp.com/library/tr/3002.pdf • John Howard et. al., “Scale and Performance in a Distributed File System,” ACM Transactions on Computer Systems 6(1), 1988. • Marshall K. McKusick, “A Fast File System for Unix,” Transactions on Computer Systems 2(3), 1984. • Brian Powlowski et. a., “The NFS Version 4 Protocol,” SANE 2000. • Daniel Robbins, “Advanced File System Implementor’s Guide,” IBM Developer Works, http://www-128.ibm.com/developerworks/linux/library/l-fs9.html, 2002. • Claudia Rodriguez et al, The Linux Kernel Primer, Prentice-Hall, 2005. • Mendel Rosenblum and John K. Osterhout, “The Design and Implementation of a Log-structured Filesystem,” 13th ACM SOSP, 1991. • R. Sandberg, “Design and Implementation of the Sun Network Filesystem,” Proceedings of the USENIX 1985 Summer Conference, 1985. • Adam Sweeney et. al., “Scalability in the XFS File System,” Proceedings of the USENIX 1996 Annual Technical Conference, 1996. • Wikipedia, http://en.wikipedia.org/wiki/Comparison_of_file_systems CSC 660: Advanced Operating Systems