580 likes | 707 Views
File Systems: Design and Implementation. Operating Systems Fall 2002. What is it all about?. File system is a service which supports an abstract representation of the secondary storage Supported by OS Why is a file system needed?
E N D
File Systems:Design and Implementation Operating Systems Fall 2002 OS Fall’02
What is it all about? • File system is a service which supports an abstract representation of the secondary storage • Supported by OS • Why is a file system needed? • What is so special about the secondary storage (as opposed to the main memory)? OS Fall’02
Memory Hierarchy OS Fall’02
Small (MB/GB) Expensive Fast (10-6/10-7 sec) Volatile Directly accessible by CPU Interface: (virtual) memory address Large (GB/TB) Cheap Slow (10-2/10-3 sec) Persistent Cannot be directly accessed by CPU Data should be first brought into the main memory Main memory vs. Secondary storage OS Fall’02
Some numbers… • 1GB=230 ~109 Bytes • 1TB=240 ~1012 (terabyte) • 1PB=250 ~1015 (petabyte) • 1EB=260 ~1018 (exabyte) • 232 ~ 4 x 109: Genome base pairs • 264 ~ 16 x 1018: Brain electrons • 2256 ~ 65,536 x 1072: Particles in Universe OS Fall’02
Secondary storage structure • A number of disks directly attached to the computer • Network attached disks accessible through a fast network • Storage Area Network (SAN) • Simple disks • Smart disks OS Fall’02
Internal disk structure OS Fall’02
Data Access • Sector size is the minimum read/write unit of data (usually 1KB) • Access: (#surface, #track, #sector) • Smart disk drives hide out the internal disk layout • Access: (#sector) • Moving arm assembly (Seek) is expensive • Sequential access is x100 times faster than the random access OS Fall’02
Overview • File system services • File system interface • File system implementation • Finding files and their data • Reading and writing • Other issues • Performance is the paramount issue for the file system implementation OS Fall’02
File System services • File system is a layer between the secondary storage and the application • Presents the secondary storage as a collection of persistent objects with unique names, called files • Provides mechanisms for mapping the data between the secondary storage and the main memory OS Fall’02
What is a file (קובץ) • File is a named persistent collection of data • Unstructured, sequential (UNIX) • Data is accessed by specifying the offset • Collection of records (database systems) • Supports associative access • give me all records with “Name=Yossi” • Attributes: owner, permissions, modification time, size, etc… OS Fall’02
File system interface • File data access • READ: Bring a specified chunk of data from file into the process virtual address space • WRITE: Write a specified chunk of data from the process virtual address space to the file • CREATE, DELETE, SEEK, TRUNCATE • open, close, set_attributes OS Fall’02
Accessing File Data: File Control Block • A control structure, File Control Block (FCB), is associated with each file in the file system • Each FCB has a unique identifier (FCB ID) • UNIX: i-node, identified by i-node number • FCB structure: • File attributes • A data structure for accessing the file’s data OS Fall’02
Accessing File Data • Given the file name • Get to the file’s FCB using the file system catalog • Use the FCB to get to the desired offset within the file data OS Fall’02
Accessing File Data: Catalog • The catalog maps a file name to the FCB • Checks permissions • This can be done for each file data access • Inefficient: Do this once when the file is first referenced • file_handle=open(file_name): • search the catalog and bring FCB into the memory • UNIX: in-memory FCB: in-core i-node • close(file_handle): release FCB from memory OS Fall’02
The Catalog Organization • FCBs are stored in predefined locations on the disk • UNIX: i-node list • Hierarchical structure: • Some FCBs are just a list of pointers to other FCBs • Directories • UNIX: directory is a file whose data is an array of (file_name, i-node#) pairs • Recursive mapping OS Fall’02
Searching the UNIX catalog • /a/b/c => i-node of /a/b/c • Get the root i-node: • The i-node number of ‘/’ is pre-defined (2) • Use the root i-node to get to the ‘/’ data • Search (a, i-node#) in the root’s data • Get the a’s i-node • Get to the a’s data and search for (b, i-node#) • Get the b’s i-node • Etc… • Permissions are checked all along the way • Each dir in the path must be (at least) executable OS Fall’02
Allocating disk blocks to file data • Assume unstructured files • Array of bytes • Efficient offset -> disk block mapping • Efficient disk access for both sequential and random patterns • Minimizing number of seeks • Efficient space utilization • Minimizing external/internal fragmentation OS Fall’02
Static and Contiguous Allocation • Allocate each file a fixed number of blocks at the creation time • Efficient offset lookup • Only the block # of the offset 0 is needed • Efficient disk access • Inefficient space utilization • Internal, external fragmentation • No support for dynamic extension OS Fall’02
Static and Contiguous Allocation Catalog OS Fall’02
Extent-based allocation • File get blocks in contiguous chunks called extents • Multiple contiguous allocations • For large files, B-tree is used for efficient offset lookup OS Fall’02
Extent-based allocation OS Fall’02
Extent-based allocation • Efficient offset lookup and disk access • Support for dynamic growth/shrink • Dynamic memory allocation techniques are used (e.g., first-fit) • Suffers from external fragmentation • Use compaction OS Fall’02
Single-block allocation • Extent-based allocation with a fixed extent size of one disk block • File blocks are scattered anywhere on the disk • Inefficient sequential access • UNIX block allocation • Linked allocation • MS-DOS File Allocation Table (FAT) OS Fall’02
Block Allocation in UNIX • 10 direct pointers • 1 single indirect pointer: points to a block of N pointers to blocks • 1 double indirect pointer: points to a block of N pointers each of which points to a block of N pointers to blocks • 1 triple indirect pointer… • Overall addresses 10+N+N2+N3 disk blocks OS Fall’02
Block Allocation in UNIX OS Fall’02
Block Allocation in UNIX • Optimized for small files • Outdated empirical studies indicate that 98% of all files are under 80 KB • Poor performance for random access of large files • No external fragmentation • Wasted space in pointer blocks for large sparse files • Modern UNIX implementations use the extent-based allocation OS Fall’02
Linked Allocation • Each file is a linked list of disk blocks • Offset lookup: • Efficient for sequential access • Inefficient for random access • Access to large files may be inefficient as the blocks are scattered • Solution: block clustering • No fragmentation, wasted space for pointers in each block OS Fall’02
Linked Allocation Catalog OS Fall’02
File Allocation Table (FAT) • A section at the beginning of the disk is set aside to contain the table • Indexed by the block numbers on disk • An entry for each disk block (or for a cluster thereof) • Blocks belonging to the same file are chained • The last file block, unused blocks and bad blocks have special markings OS Fall’02
FAT Catalog entry OS Fall’02
FAT Pros and Cons • Improved random access • just search a small table instead of the whole disk • Inefficient sequential access • Seek back to the table and forth to the block for each file block! • Block allocation is easy • just find the first 0 marked block OS Fall’02
Free space management • Disk bitmap: represent the disk block allocation as an array of bits • Bit for each disk block: 1 - non-allocated block, 0 - allocated block • Simple and efficient in finding free blocks • Wastes space on disk • Linked list of free blocks (UNIX) • Efficient for finding a single free block OS Fall’02
Next: File System continued • File I/O • Organization, performance • Atomicity and consistency • Etc... OS Fall’02
File I/O • CPU cannot access the file data directly • Must be first brought to the main memory • How to do this efficiently? • Read/Write mapping using buffer cache • Memory mapped files OS Fall’02
Read/Write Mapping • File data is made available to applications via a pre-allocated main memory region • Buffer cache • The file systems transfers data between the buffer cache and disk in granularity of disk blocks • The data is explicitly copied from/to buffer cache to/from the application address space OS Fall’02
Read/Write Mapping OS Fall’02
Reading data (Disk block=1K) OS Fall’02
Writing data (Disk block=1K) OS Fall’02
Buffer Cache management • All disk I/O goes through the buffer cache • Both user data and control data (e.g., i-node) are cached • LRU replacement • Dirty (modified) marker to indicate whether write-back is needed OS Fall’02
Advantages • Strict separation of concerns • Hiding disk access peculiarities from the user • Block size, memory alignment, memory allocation in multiples of the block size, etc… • Disk blocks are cached • Aggregation for small transfers (locality) • Block re-use across processes • Transient data might be never written to disk OS Fall’02
Disadvantages • Extra copying • Disk->buffer cache->user space • Vulnerability to failures • Does not care about the user data blocks • The control data blocks (metadata) is the real problem • E.g., i-nodes, pointer blocks can be in cache when a failure occurs • As a result the file system internal state might be corrupted OS Fall’02
A complete UNIX example OS Fall’02
Memory mapped files • A file (or a portion thereof) is mapped into a contiguous region of the process virtual memory • UNIX: mmap system call • Mapping operation is very efficient: • just marking • The access to file is governed by the virtual memory subsystem OS Fall’02
Mmapped files: Pros and Cons • Advantages: • reduce copying • no need for a pre-allocated buffer cache in the main memory • Disadvantages: • less or no control over the actual disk writing: the file data becomes volatile • A mapped area must fit the virtual address space OS Fall’02
Reliability and Recovery • File system data consists of • Control data (metadata), user data • Failures can cause data loss and corruption • Cached data • Power failure during the sector write may corrupt physically the data stored in the sector OS Fall’02
Metadata vs. User data • Lost or corruption of the metadata might lead to a massive user data loss • File systems must care about the metadata • File systems usually do not care much about the user data • Operation semantics? • Users must care about their data themselves (e.g., backups) OS Fall’02
Reliability and caching • Caching affects the WRITE semantics • The write operation returns • Is it guaranteed that the requested data is indeed written on disk? • What if some data blocks in cache are the metadata blocks? • Solutions • write-through: writes bypass cache • write-back: dirty blocks are written asynchronously OS Fall’02
User data reliability in UNIX • Based on write-back policy • User data is written back to disk periodically • POSIX compatible semantics • Commands like sync and fsync are used for forced write of the dirty blocks OS Fall’02
Metadata reliability • Based on write-through policy • updates are written to disk immediately • Some data is not written in-place • Can go back to the last consistent version • Some data is replicated • UNIX superblock • File system goes through consistency check/repair cycle at the boot time • fsck, ScanDisk OS Fall’02