290 likes | 434 Views
Swaminathan Sundararaman , Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift. Why panic () ? Improving Reliability through Restartable File Systems. Data Availability. Slave Nodes. GFS Maste r. GFS Maste r. Slave Nodes.
E N D
Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift Why panic()? Improving Reliability through Restartable File Systems
Data Availability Slave Nodes GFS Master GFS Master Slave Nodes • Applications require data • Use FS to reliably store data • Both hardware and software can fail • Typical Solution • Large clusters for availability • Reliability through replication
User Desktop Environment App App App OS FS Disk Raid Controller Disks • Replication infeasible for desktop environments • Wouldn’t RAID work? • Can only tolerate H/W failures • FS crash are more severe • Services/applications are killed • Requiring OS reboot and recovery • Need: better reliability in the event of file system failures
Outline Motivation Background Restartable file systems Advantages and limitations Conclusions
Failure Handling in File Systems • Exception paths not tested thoroughly • Exceptions: failed I/O, bad arguments, null pointer • On errors: call panic,BUG,BUG_ON • After failure: data becomes inaccessible • Reason for no recovery code • Hard to apply corrective measures • Not straightforward to add recovery
Realworld Example: Linux 2.6.15 ReiserFS int journal_mark_dirty(….){ struct reiserfs_journal_cnode *cn = NULL; if (!cn) { cn = get_cnode(p_s_sb); if (!cn) { reiserfs_panic(p_s_sb, "get_cnode failed!\n"); }} } File systems already detect failures void reiserfs_panic(struct super_block *sb, ...) { BUG(); /* this is not actually called, but makes reiserfs_panic() "noreturn" */ panic("REISERFS: panic %s\n“, error_buf);} Recovery: simplified by generic recovery mechanism
Possible Solutions Lightweight Heavyweight CuriOS EROS Stateful Nooks/Shadow Xen, Minix L4, Nexus SafeDrive Singularity Stateless • Code to recover from all failures • Not feasible in reality • Restart on failure • Previous work have taken this approach FS need: stateful & lightweight recovery
Restartable File Systems FS Failures: completely transparent to applications Goal: build lightweight & stateful solution to tolerate file-system failures Solution: single generic recovery mechanism for any file system failure • Detect failures through assertions • Cleanup resources used by file system • Restore file-system state before crash • Continue to service new file system requests
Challenges • Transparency • Multiple applications using FS upon crash • Intertwined execution • Fault-tolerance • Handle a gamut of failures • Transform to fail-stop failures • Consistency • OS and FS could be left in an inconsistent state
Guarantying FS Consistency • Not all FS support crash-consistency • FS state constantly modified by applications • Periodically checkpoint FS state • Markdirty blocks as Copy-On-Write • Ensure each checkpoint is atomically written • On Crash: revert back to the last checkpoint FS consistency required to prevent data loss
Overview of Our Approach 5 3 Periodically create checkpoints Open (“file”) write() read() write() write() Close() 1 Application File System Crash 2 VFS 6 Unwind in-flight processes 3 checkpoint File System 2 Move to recent checkpoint 4 4 1 Epoch 0 Epoch 1 Replay completed operations 5 time Re-execute unwound process Legend: Completed In-progress Crash 6
Checkpoint Mechanism • File systems constantly modified • Hard to identify a consistent recovery point • Naïve Solution: Prevent any new FS operation and call sync • Inefficient and unacceptable overhead
Key Insight App App App All requests go through the VFS layer VFS File System ext3 VFAT Control requests to FS and dirty pages to disk Page Cache File Systems write to disk through Page Cache Disk
Generic COW based Checkpoint App App App VFS VFS VFS File System File System File System 1 1 Page Cache Page Cache Page Cache Disk Disk Disk STOP STOP At Checkpoint After Checkpoint Regular Membrane
Interaction with Modern FSes • Have built-in crash consistency mechanism • Journaling or Snapshotting • Seamlessly integrate with these mechanism • Need FSes to indicate beginning and end of an transaction • Works for data and ordered journaling mode • Need to combine writeback mode with COW
Light-weight Logging • Log operations at the VFS level • Need not modify existing file systems • Operations: open, close, read, write, symlink, unlink, seek, etc. • Read: • Logs are thrown away after each checkpoint • What about logging writes?
Page Stealing Mechanism Write (fd, buf, offset, count) VFS VFS VFS File System File System File System Page Cache Page Cache Page Cache Before Crash During Recovery After Recovery • Mainly used for replaying writes • Goal: Reduce the overhead of logging writes • Soln: Grab data from page cache during recovery
Evaluation Setup
Recovery Time Restart ext2 during random-read micro benchmark
Advantages • Improves tolerance to file system failures • Build trust in new file systems (e.g., ext4, btrfs) • Quick-fix bug patching • Developer transform corruptions to restart • Restart instead of extensive code restructuring • Encourage more integrity checks in FS code • Assertions could be seamlessly transformed to restart • File systems more robust to failures/crashes
Limitations Inode# Mismatch File1: inode# 15 File1: inode# 12 create (“file1”) stat (“file1”) write (“file1”, 4k) create (“file1”) write (“file1”, 4k) stat (“file1”) Application VFS File System File : file1 Inode# : 12 File : file1 Inode# : 15 Epoch 0 Epoch 0 After Crash Recovery Before Crash • Only tolerate fail-stop failures • Not address-space based • Faults could corrupt other kernel components • FS restart may be visible to application • e.g., Inode numbers could be changed after restart
Conclusions • Failures are inevitable in file systems • Learn to cope and not hope to avoid them • Generic recovery mechanism for FS failures • Improves FS reliability availability of data • Users: Install new FSes with confidence • Developers: Ship FS faster; as not all exception cases are now show-stoppers
Thank You! Advanced Systems Lab (ADSL) University of Wisconsin-Madison http://www.cs.wisc.edu/adsl Questions and Comments