310 likes | 383 Views
File System Topics. Lei Xu. Agenda. Introduction VFS Optimizations Examples F&Q. Introduction.
E N D
File System Topics Lei Xu
Agenda • Introduction • VFS • Optimizations • Examples • F&Q
Introduction • “A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device(s) which contain it.” – from Wikipedia • Store data • Organize data • Access data • Manage storage resources (e.g. hard drive)
Relationship to Architecture Course Acknowledge to the slides from 830 course
Relationship to Architecture Course • File system is designed between memory and secondary storage (or remote servers) • One of the most complex part in an operating system • Main R&D focuses: • Performance: throughput, latency, scalability • Reliability and availability • Management: snapshot and etc. Acknowledge to the slides from 830 course
Different types of file systems • Local file systems • Stored data on local hard drives, SSDs, floppy drives, optical disks or etc. • Examples: NTFS, EXT4, HFS+, ZFS • Network/distributed file systems • Stored data on remote file server(s) • Example: NFS, CIFS/Samba, AFP, Hadoop DFS, Ceph • Pseudo file systems • Example: procfs, devfs, tmpfs • “List of file systems” • http://en.wikipedia.org/wiki/List_of_file_systems
Agenda • Introduction • VFS • Optimizations • Examples • F&Q
Overall Architecture of Linux file system components Acknowledgement: “Anatomy of the Linux file system”, IBM developerWorks.
Virtual File System (VFS) • VFS is the essential concept in UNIX-like FS • Specify an interface between the kernel and a concrete file system • Introduced by SUN in 1985 • Pass system calls to the underlying file systems • E.g. pass sys_write() to Ext4 (i.e. ext4_write()) • Three major metadata in VFS • Metadata: the data about data (wikipedia) • Super block, dentryand inode • OO design • Each component defines a set of data members and the functions to access them
Super block • A segment of metadata that describes a file system • Is constructed when mount a file system • Usually, a persistent copy of super block is stored in the beginning of a storage device • Describes: • File system type, size, status (e.g. dirty bit, read only bit) • Block size, max file bytes, device size.. • How to find other metadata and data. • How to manipulates these data (i.e. sb_ops)
Inode • “Index-node” in Unix-style file system • All information about one file (or directory) • Except its name • In UNIX-like system, file names are stored in the directory file: the content of it is an “array” of file names • E.g. owner, access rights, mode, size, time and etc. • Pointers to data
Directory Entry (dentry) • Dentry conceptually points a file name to its corresponding Inode • Each file/directory has a dentry presenting it • File systems use dentry to lookup a file in the hierarchical namespace • Each dentry has a pointer to the dentry of its parent directory • Each dentry of a directory has a list of dentries of its sub-directories and sub-files
Agenda • Introduction • VFS • Optimizations • Examples • F&Q
Optimizations • Most of file system optimizations are designed based on the characteristics of the memory hierarchy and storage devices. • Recall: • RAM 50-100 ns • Disks: 5-10 ms • 2-3 orders of magnitude difference • Almost all widely used local file systems are designed for hard disk drives, which have their unique characteristics
Hard Disk Drive (HDD) • Stores data on one or more rotating disks, coated with magnetic material • Introduce by IBM in 1956 • Use magnetic head to read data
The very early HDD….. Acknowledge to:
HDD (Cont’d) • The essential structure of HDD has not changed too much… • Constitute with several disks • Each disk is divided to tracks, each of which then is divided to sectors • The single most significant factor: • Seek time
Why seek time matters • When access a data (sector), the HDD head must first move to the track (seek time), then rotates the disk to the sector (rotational time) • Seek time: 3 ms on high-end server disks, 12 ms on desktop-level disks [1] • Rotational time: 5.56ms on 5400 RPM HDD, 4.17ms on 7200 RPM HDD [1] • As a result, sequential IO is much faster than random IO, because there is no seek /rotational time [1], http://en.wikipedia.org/wiki/Disk-drive_performance_characteristics
General Optimizations • Based on two principles: • RAM access is much faster than the access on disk • Sequential IOs is much faster than random IOs on disk • So we design file systems that • Largely utilizes CPU/RAM to reduce IO to disks (various caches/write buffers) • Prefers sequential IOs • Computes disk layout to arrange related data sequentially located on disks
Dcache • Dentry cache (dcache) • Directories are stored as files on disks. • For each file lookup, we want obtain the inode from the given full file path • OS looks the dentries from the root to all parent directories in the path. • E.g. for looking up file “/Users/john/Documents/course.pdf”, OS needs traverse the dentriesthat presents “/”, “Users”, “john”, “Documents”, and “course.pdf” • To accelerate this: • We use a global hash table (dcache) to map “file path” -> dentry • A two-list solution: one for active dentries, and one for “recent unused dentries” (LRU).
Inode cache • Similar to the dcache, OS maintains a cache for inode objects. • Each inode object has 1-to-1 relation to a dentry • If the dentry object is evicted, this inode is evicted
Page Cache • …a “transparent” buffer for disk-backed pages kept in RAM for fast access… [wikipedia] • A write-back cache • Main purpose: reducing the # of IOs to disks • Access based on page (usually 4KB). • Page cache is per-file based. • A Redix-tree in inode object. • Prefetch pages to serve future read • Absorb writes to reduce # of IOs • The dirty pages (modified) are flushed to disks for : 1) each 30s or 5s, or 2) OS wants to reclaim RAMs • Also can be forced to flush by calling “fsync()” system call
Agenda • Introduction • VFS • Optimizations • Examples • F&Q
Examples • Several concrete file system designs • Ext4, classic UNIX-like file system concepts • NTFS, advanced Windows file system • ZFS, “the last word of file system” • NFS, a standard network file system • Google File System, a special distributed file system for special requirements
Ext4 • The latest version of the “extended file system” (Ext2/3/4) • The standard Linux file system for a long time • Inspired from UFS from BSD/Solaris • Group files to block groups • Keep file data near to inodes Ack: http://bit.ly/tjipWY
NTFS • “New Technology File System” (NTFS) • The standard file system in Windows world. • A Master File Table (MFT) contains all metadata. • Directory is also a file
ZFS • ZFS: “the last word of file system” • The most advanced local file system in production • 128 bits space (2128 bytes in theory) • larger the # of sand in the earth… • A lot of advanced features: • E.g. transactional commits, end-to-end integration, snapshot, volume management and much more… • Will never lose data and always be consistent. • Every OS community wants to clone or copy its features… • Btrfs on Linux, ReFS on Windows, ZFS on FreeBSD
NFS • “Network File System (NFS)” • Aprotocol developed by SUN in 1984 • A set of RPC calls • IETF standard • Supported by all major OSs • Simple and efficient
Google File System (GFS) • A large distributed file system specially designed for MapReduce framework • High throughput • High availability • Special designed. Not compatible to VFS/POSIX API. • Requires clients linked to the GFS library. • Hadoop DFS clones the concepts of GFS
More File Systems • Interesting file systems that are worth to explore • Btrfs (B-tree FS) from oracle, expected to be the next standard Linux file system. Many concepts are shared with ZFS. • ReFS: The file system for Windows 8 (from Microsoft). Many concepts are shared with ZFS (too!). • WAFL (Write Anywhere File Layout) file system from NetApp. • FUSE (Filesystem in Userspace): a cross-platform library that allows developers to write file system running in user mode
FAQ? Thanks