Membrane: Operating System Support for Restartable File Systems

Membrane: Operating System Support for Restartable File Systems Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift Computer Sciences Department, University of Wisconsin, Madison FAST 10 Speaker: T.C Huang 15/Mar/10

Membrane Recover file system state in following fashion • Lightweight • Transparent • User will not be aware of FS fault • FS restart and continue handling to request

STEP • File system detects a fault • Membrane rollback the FS state to last checkpoint (last trusted state) • Replay FS writes that occurred after checkpoint • Release kernel lock, free allocated memory • Proceed later requests without fault

Fault Model • Membrane does NOT to handle all fault types • Best handle • Transient : race condition / bit flip • Fail-Stop : BUG() • Bad handle – wild write

The major drawback of membrane is that the boundary we use is soft: some file system bugs can still corrupt kernel state outside the file system and recovery will not succeed.

Fault Detection • Hardware : hardware detectable faults • e.g. divide-by-zero, null pointer… • check faulting instruction handler • Software 1: Use existed checks in FS code • Re-define panic() / BUG() / assert() in FS code forward Membrane recovery code • Software 2: • add parameter check on any call from FS into kernel

Fault Anticipation • Checkpoint • divide FS operations into epochs ( or transactions) • ensure on-disk checkpoint image is consistent • State Tracking • Track all updates / lock / states • log to in-memory log and parallel stacks

Checkpointing • checkpoint is a consistent FS state • No operations are in-flight • For Journaling FS / shadow-paging FS • Use in-built checkpoint • For others, build a generic checkpoint mechanism at VFS layer

Checkpoint for VFAT,ext2

Tracking State with Log/Stack Track all changes after checkpoint • FS operations log • write, read, unlink, attribute change… • Application-visible session log • file ID, file position, open epoch number, …

Low-cost Op-Logging

Tracking State with Log/Stack • Mallocs table • Add a new GFP_RESTARTABLE to mem-alloc layer recognization • Lock stack • Modified lock functions to track • Only global locks are saved • Execution state stack • also called unwind stack • function calls , registers… • wrap all calls from kernel to FS to save information

Fault Recovery • Halt in-flight threads and park incoming / in-flight threads • mark all code pages of FS as non-executable • code pages of FS are recorded when FS registration • all threads involves in FS will trap into faults • Unwind in-flight threads • use execution state stack • skip/trust unwind protocol • also unlock related locks by referencing lock stack • Commit dirty pages of previous epoch to storage • only ext2/VFAT or other simple FSs

Fault Recovery • Unmount file system • free memory by referencing memalloc table • Remount file system • Roll-forward • use VFS interface • restore active session information by referencing session log • replay FS operations in FS operations log • Restart execution • wake up all parked threads

AMD 2.2GHz CPU 2 x 80GB disks 2GB memory Linux 2.6.15 Ext2 , VFAT Ext3 with full data journaling Evaluation

How Detect? • o : kernel oops • G : general protection fault • i : invalid opcode • d : fault detected • Application? •  : keep working • X : killed by OS • s : operation fault only • e : application return fault • Footnotes • a : FS usable, but can’t unmount • b : late oops or fault

Fault Studies • 91% faults incurs kernel oops • 1/3 cases need reboot and fsck • Parameter check (boundary) catches faults but FS can not handle returned error code properly • With Menbrane, faults can be detected and applications didn’t notice faults

Benchmarks

Recovery Time • when data=0, open session=0, log record=0, recovery time= 8.6ms

Recovery Overhead under Rand-read

Generality

Conclusions • File systems fail frequently (?) • Membrane transforms file system failure from a show-stopping event into a small performance issue.

How Detect? • o : kernel oops • G : general protection fault • i : invalid opcode • d : fault detected • Application? •  : keep working • X : killed by OS • s : operation fault only • e : application return fault • Footnotes • a : FS usable, but can’t unmount • b : late oops or fault

Membrane: Operating System Support for Restartable File Systems