300 likes | 601 Views
File Systems Reliability. File systems are meant to store data persistently Meaning they are particularly sensitive to errors that screw things up Other elements can sometimes just reset and restart But if a file is corrupted, that’s really bad
E N D
File Systems Reliability • File systems are meant to store data persistently • Meaning they are particularly sensitive to errors that screw things up • Other elements can sometimes just reset and restart • But if a file is corrupted, that’s really bad • How can we ensure our file system’s integrity is not compromised?
Causes of System Data Loss • OS or computer stops with writes still pending • .1-100/year per system • Defects in media render data unreadable • .1 – 10/year per system • Operator/system management error • .01-.1/year per system • Bugs in file system and system utilities • .01-.05/year per system • Catastrophic device failure • .001-.01/year per system
Dealing With Media Failures • Most media failures are for a small section of the device, not huge extents of it • Don't use known bad sectors • Identify all known bad sectors (factory list, testing) • Assign them to a “never use” list in file system • Since they aren't free, they won't be used by files • Deal promptly with newly discovered bad blocks • Most failures start with repeated “recoverable” errors • Copy the data to another block ASAP • Assign new block to file in place of failing block • Assign failing block to the “never use” list
Problems Involving System Failure • Delayed writes lead to many problems when the system crashes • Other kinds of corruption can also damage file systems • We can combat some of these problems using ordered writes • But we may also need mechanisms to check file system integrity • And fix obvious problems
Deferred Writes – Promise and Dangers • Deferring disk writes can be a big performance win • When user updates files in small increments • When user repeatedly updates the same data • It may also make sense for meta-data • Writing to a file may update an indirect block many times • Unpacking a zip creates many files in same directory • It also allocates many consecutive inodes • But deferring writes can also create big problems • If the system crashes before the writes are done • Some user data may be lost • Or even some meta-data updates may be lost
Performance and Integrity • It is very important that file system be fast • File system performance drives system performance • It is absolutely vital that they be robust • Files are used to store important data • E.g., student projects, grades, video games, … • We must know that our files are safe • That the files will not disappear after they are written • That the data will not be corrupted
Deferred Writes – A Worst Case Scenario • Process allocates a new block for file A • We get a new block (x) from the free list • We write the updated inode for file A • Including a pointer to x • We defer free-list write-back (which happens all the time) • The system crashes, and after it reboots • A new process wants a new block for file B • We get block x from the (stale) free list • Two different files now contain the same block • When file A is written, file B gets corrupted • When file B is written, file A gets corrupted
File System Corruption Why isn’t this problem solved by simply making the write an atomic action? • Several common types • Missing data (user data not found in file) • Missing space (neither allocated to file, nor free) • Un-referenced files (not found in any directory) • Same space allocated to multiple files • Usually result from writes that don't complete • Assign block to file, but block not written out to disk • Assign block to file, but free-list not updated on disk • All of these are aggravated by deferred writes
Ordering Writes • Many file system corruption problems can be solved by carefully ordering related writes • Write out data before writing pointers to it • Unreferenced objects can be garbage collected • Pointers to incorrect data/meta-data are much more serious • Write out deallocations before allocations • Disassociate resources from old files ASAP • Free list can be corrected by garbage collection • Improperly shared blocks more serious than unlinked ones • But it may reduce disk I/O efficiency • Creating more head motion than elevator scheduling
Auditing and File System Checks Modern machines may go months without being shut down. Thus, file systems might not get mounted again for a long time, allowing much corruption between audits. How about auditing while the file system is on-line? • Design file system structures to allow for audit and repair • Keep redundant information in multiple distinct places • Maintain reference counts in each object • Children have pointers back to their parents • Transaction logs of all updates • All resources can be garbage collected • Discover and recover unreferenced objects • Audit file system for correctness (prior to mount) • All object well formatted • All references and free-lists correct and consistent • Use redundant info to enable automatic repair
Backup – The Ultimate Solution • All files should be regularly backed up • Permits recovery from catastrophic failures • Complete vs. incremental back-ups • Desirable features • Ability to back-up a running file system • Ability to restore individual files • Ability to back-up w/o human assistance • Should be considered as part of FS design • I.e., make file system backup-friendly
Miscellaneous File System Issues RAID Journaling file systems
RAID • Disks are the weak point of any computer system • Reliability: disk drives are subject to mechanical wear • Mis-seeks: resulting in corrupted or unreadable data • Head crashes: resulting in catastrophic data loss • Performance: limited by seek and transfer speeds • These limitations are inherent in the technology • Moving heads and rotating media • Don’t try to build super-fast or reliable disks • Instead, build RAID! • Redundant Array of Independent Disks • Combine multiple cheap disks for better performance
Basics of RAID • Buy several cheap commodity disks • Treat them as one integrated storage resource • Use multiplicity of them to hide reliability and performance problems • Since the follow common block I/O interface, no need to alter higher level OS code • Several different RAID “levels” • With different benefits and costs
RAID-0 (Striping) • Combine them to get a larger virtual drive • Striping: alternate tracks are on alternate physical drives • Concatenation: 1st 500 cylinders on drive 1, 2nd 500 on drive 2 • Benefits • Increased capacity (file systems larger than a physical disk) • Read/write throughput (spread traffic out over multiple drives) • Cost • Increased susceptibility to failure • Failure of either drive likely to cause effective loss of most or all files stored on both drive • Especially for striping
So Why Not Use the Concatenation Approach? • Lower effective I/O throughput than striping • Why? • File systems try to pack related data and metadata into a single cylinder or a few nearby cylinders • Striping puts each track on a different drive • So different drives are likely to receive requests • And we will see an improvement in throughput • With concatenated volumes, all activity in given file/set of files will go to one drive • Less likely to be parallelism
RAID-1 (Mirroring) • Two copies of everything • All writes are sent to both disks • Reads can be satisfied from either disk • Benefits • Redundancy (data survives failure of one disk) • Read throughput (can be doubled) • Cost • Requires twice as much disk
Mirroring and Throughput Why should the writes be simultaneous? • Mirroring can improve read throughput • Any desired block can be found on either volume • So we can distribute our reads evenly over both drives • Giving us two DMA channels and two head assemblies. • But not write throughput • Writes must be simultaneously done to both drives • So both tied up and can’t handle other requests
RAID-5 (Blockwise Striping With Parity) disk 1 disk 3 disk 4 disk 2 1A 1B 1C XOR 1A-1C 2A 2B XOR 2A-2C 2C 3A XOR 3A-3C 3B 3C • Dedicate 1/Nth of the space to parity • Write data on N-1 corresponding blocks • Nth block contains XOR of the N-1 data blocks • Benefits • Data can survive loss of any one drive • Much more space efficient than mirroring • Cost • Slower and more complex write performance
RAID-5 Operations • How many writes does it take to do a one block write in a four-drive RAID-5 group? • Two • One to the primary drive for that block • Another to the parity drive for that block. • We might have to do two other reads to recompute the parity block • How many reads does it take to do a one block read in a (working) four-drive RAID-5 group? • Only one, from the primary drive for that block • How many reads does it take to do a one block read in a four-drive RAID-5 group with one drive down? • One if the primary copy is on a working drive • Otherwise three (the two corresponding blocks and the parity block)
RAID Implementation • RAID is implemented in many different ways • As part of the disk driver • These were the original implementations • Between block I/O and the disk drivers (e.g. Veritas) • Making it independent of disks and controllers • Absorbed into the file system (e.g., zfs) • Permitting smarter implementation • Built into disk controllers • Potentially more reliable • Significantly off-loads the host OS • Exploit powerful Storage Area Networking (SAN) fabric
Advantages of RAID in the Disk Controller • As opposed to in the operating system or device driver • A disk controller doesn’t necessarily crash when the OS does • It can continue completing its I/O as long as it is powered • A disk controller might store pending disk writes in NVRAM • So that they are not lost, even after a power failure • But likely to be extra hardware costs • And you’re limited to what manufacturers offer
Journaling File Systems • Crashes can cause loss of file system data and metadata • Updates that were scheduled but not completed • Each update typically involves changes to multiple objects • Incomplete updates result in corrupted file systems • What if we knew what updates hadn’t happened? • Journaling file system keep that information
Augmenting a File System With a Journal • Maintain a circular intent journal in each file system • Log each operation to this journal before performing it • Don’t acknowledge an operation until it has been journaled • If the system crashes, replay the journal upon restart • Has potential to eliminate lost data, corruption
Why Does This Help? • We still need to end up writing everything • The journal is yet another write, so? • Many operations involve multiple disk updates • Directory entries,inodes, free-list, data blocks • Each of those is written to a different place • All information describing those writes can be combined • So the journal write can be a single block to one place
What Happens in Case of a Failure? • If before the journal write, nothing is there • And the application wasn’t told the write succeeded • If after all operations complete, no problem • In between, • If only the journal was written, it describes everything else you need to do • So do those things • If some other operations completed, determine which ones and take care of the others
Journaling and Persistent RAM • Journaling systems often write the journal to persistent RAM • Such as flash memory • Quicker than writing to disk • Most journaled operations will be performed successfully • After which you can get rid of the journal entries • So you don’t need that much NVRAM • Which is good, since it’s more expensive than disk
Conclusion • All file systems provide same basic functions • Allocating space to files, maintaining free space • Associating names with files, managing name space • Support for multiple independent volumes • Different file systems offer different abstractions • Access methods (sequential, random stream) • They tend to be optimized for different applications • They are all judged on same basic criteria • Performance, robustness, ease of use • These implementations can teach us many tricks