400 likes | 661 Views
Improving File System Reliability with I/O Shepherding. Haryadi S. Gunawi , Vijayan Prabhakaran + , Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison. +. Complex Storage Subsystem Mechanical/electrical failures, buggy drivers
E N D
Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran+, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison +
Complex Storage Subsystem Mechanical/electrical failures, buggy drivers Complex Failures: Intermittent faults, latent sector errors, corruption, lost writes, misdirected writes, etc. FS Reliability is important Managing disk and individual block failures Device Driver Transport Firmware Media Mechanical Electrical Storage Reality File System
File System Reality • Good news: • Rich literature • Checksum, parity, mirroring • Versioning, physical/logical identity • Important for single and multiple disks setting • Bad news: • File system reliability is broken[SOSP’05] • Unlike other components (performance, consistency) • Reliability approaches hard-to understand and evolve
Broken FS Reliability • Lack of good reliability strategy • No remapping, checksumming, redundancy • Existing strategy is coarse-grained • Mount read-only, panic, retry • Inconsistent policies • Different techniques in similar failure scenarios • Bugs • Ignored write failures Let’s fix them! With current Framework? Not so easy …
Diffused Handle each fault in each I/O location Different developers might increase diffusion Inflexible Fixed policies, hard to change But, no policy that fits all diverse settings Less reliable vs. more reliable drives Desktop workload vs. web-server apps The need for new framework Reliability is a first-class file system concern No Reliability Framework Reliability Policy File System Disk Subsystem
Localized • I/O Shepherd • Localized policies, … • More correct, less bug, simpler reliability management File System Shepherd Disk Subsystem
Add Mirror Check- sum More Retry More Protection Less Protection ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Flexible • I/O Shepherd • Localized, flexible policies, … File System Shepherd Disk Subsystem
Powerful • I/O Shepherd • Localized, flexible, andpowerful policies File System Shepherd Add Mirror Check- sum More Retry More Protection Add Mirror Check- sum More Retry More Protection Less Protection Compo- sable Policies Disk Subsystem ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Custom Drive
Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion
Primitives SanityCheck Lookup Location OnlineFsck Checksum … Write Read … Policy Metadata Mirror-Map Remap-Map Checksum-Map Architecture • Building reliability framework • How to specify reliability policies? • How to make powerful policies? • How to simplify reliability management? • I/O Shepherd layer • Four important components • Policy table • Policy code • Policy primitives • Policy Metadata File System I/O Shepherd Policy Code DynMirrorWrite(DiskAddr D, MemAddr A) DiskAddr copyAddr; IOS_MapLookup(MMap, D, ©Addr); if (copyAddr == NULL) PickMirrorLoc(MMap, D, ©Addr); IOS_MapAllocate(MMap, D, copyAddr); return (IOS_Write(D, A, copyAddr, A)); Disk Subsystem
File System Shepherd /tmp /boot /lib /archive High-level reliability No protection Policy Table • How to specify reliability policies? • Different block types, different levels of importance • Different volumes, different reliability levels • Needfine-grainedpolicy • Policy table • Different policies across different block types • Different policy tables across different volumes
What support is needed to make powerful policies? Remapping: track bad block remapping Mirroring: allocate new block Sanity check: need on-disk structure specification Integration with file system Runtime allocation Detailed knowledge of on-disk structures I/O Shepherd Maps Managed by the shepherd Commonly used maps: Mirror-map Checksum-map Remap-map File System I/O Shepherd Remap Mirror-Map Csum-Map 1001 1001 1001 null 1010 2001 1002 1002 1002 1010 2002 null 1003 1003 1003 1010 null 3003 … … … … … … Policy Metadata
Policy Primitives and Code • How to make reliability management simple? • I/O Shepherd Primitives • Rich set and reusable • Complexities are hidden • Policy writer simply composes primitives into Policy Code Policy Primitives Maps Computation Map Update Checksum Map Lookup Parity FS-Level Layout Sanity Check Allocate Near Stop FS Allocate Far Policy Code MirrorData(Addr D) Addr M; MapLookup(MMap, D, M); if (M == NULL) M = PickMirrorLoc(D); MapAllocate(MMap, D, M); Copy(D, M); Write(D, M);
Mirror-Map D R Mirror-Map D NULL D R … … File System D D I/O Shepherd Policy Code MirrorData(Addr D) Addr R; R = MapLookup(MMap, D); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); Disk Subsystem D R D
Summary • Interposition simplifies reliability management • Localized policies • Simple and extensible policies • Challenge: Keeping new data and metadata consistent
Outline • Introduction • I/O Shepherd Architecture • Implementation • Consistency Management • Evaluation • Conclusion
Implementation • CrookFS • (named for the hooked staff of a shepherd) • An ext3 variant with I/O shepherding capabilities • Implementation • Changes in Core OS • Semantic information, layout and allocation interface, allocation during recovery • Consistency management (data journaling mode) • ~900 LOC (non-intrusive) • Shepherd Infrastructure • Shepherd primitives, thread support, maps management, etc. • ~3500 LOC (reusable for other file systems) • Well-integrated with the file system • Small overhead
TB D I TC D I Data Journaling Mode Memory Bm I D Sync (intent is logged) Tx Release Journal Fixed Location Checkpoint (intent is realized)
When to run policies? Policies (e.g. mirroring) are executed during checkpoint Is current journaling approach adequate to support reliability policy? Could we run remapping/mirroring during checkpoint? No – Problem of failed intentions Cannot react to checkpoint failures Reliability Policy + Journaling
Failed Intentions Example Policy: Remapping Crash Memory I D RMDR Impossible R I Tx Release Journal Inconsistencies: 1) Pointer ID invalid 2) No reference to R TB D I TC Fixed Location RMD0 D I R RMD0 Remap-Map RMDR Checkpoint completes Checkpoint (failed intent)
Journal: log intent to the journal If journal write failure occurs? Simply abort the transaction Checkpoint: intent is realized to final location If checkpoint failure occurs? No solution! Ext3, IBM JFS: ignore ReiserFS: stop the FS (coarse-grained recovery) Flawin current journaling approach No consistency for any checkpoint recovery that changes state Too late, transaction has been committed Crash could occur anytime Hopes checkpoint writes always succeed (wrong!) Consistent reliability + current journal = impossible Journaling Flaw
Chained Transactions • Contains all recent changes (e.g. modified shepherd’s metadata) • “Chained” with previous transaction • Rule: Only after the chained transaction commits, can we release the previous transaction
Chained Transactions Example Policy: Remapping Memory I D RMDR RMDR New: Tx Release after CTx commits Old : Tx Release Journal TB D I TC TB TC Fixed Location D I R RMD0 Checkpoint completes
Summary • Chained Transactions • Handles failed-intentions • Works for all policies • Minimal changes in the journaling layer • Repeatable across crashes • Idempotent policy • An important property for consistency in multiple crashes
Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion
Evaluation • Flexible • Change ext3 to all-stop or more-retrypolicies • Fine-Grained • Implement gracefully-degrade RAID[TOS’05] • Composable • Perform multiple lines of defense • Simple • Craft8policies in a simple manner
No Recovery Retry Stop Propagate Not applicable Flexibility • Modify ext3 inconsistent read recovery policies Workload Failed Block Type Failed Block:Indirect block Workload: Path traversal cd /mnt/fs2/test/a/b/ Policy observed: Detect failure and propagate failure to app Propagate Retry Ignore failure Stop ext3
Flexibility • Modify ext3 policies to all-stop policies ext3 All-Stop No Recovery Retry Stop AllStopRead(Block B) if (Read(B) == OK) return OK; else Stop(); Propagate
Flexibility • Modify ext3 policies to retry-more policies ext3 Retry-More No Recovery Retry RetryMoreRead (Block B) for (int i = 0; i < RETRY_MAX; i++) if (Read(B) == SUCCESS) return SUCCESS; return FAILURE; Stop Propagate
File System RAID-0 file1.pdf /root/… /root File System Shepherd + DGRAID RAID-0 Fine-Granularity • RAID problem • Extreme unavailability • Partially available data • Unavailable root directory • DGRAID[TOS’05] • Degrade gracefully • Fault isolate a file to a disk • Highly replicate metadata f1.pdf f2.pdf
Fine-Granularity F: 1 A: 90% F: 2 A: 80% 10-way Linear X = 1, 5, 10 F: 3 A: ~40%
Composability ReadInode(Block B) { C = Lookup(Ch-Map, B); Read(B,C); if ( CompareChecksum(B, C) == OK ) return OK; M = Lookup(M-Map, B); Read(M); if ( CompareChecksum(M, C) == OK ) B = M; return OK; if ( SanityCheck(B) == OK ) return OK; if ( SanityCheck(M) == OK ) B = M; return OK; RunOnlineFsck(); return ReadInode(B); } Time (ms) • Multiple lines of defense • Assemble both low-level and high-level recovery mechanism
Simplicity • Writing reliability policy is simple • Implement 8 policies • Using reusable primitives • Complex one < 80 LOC
Conclusion • Modern storage failures are complex • Not only fail-stop, but also exhibit individual block failures • FS reliability framework does not exist • Scattered policy code – can’t expect much reliability • Journaling + Block Failures Failed intentions (Flaw) • I/O Shepherding • Powerful • Deploy disk-level, RAID-level, FS-level policies • Flexible • Reliability as a function of workload and environment • Consistent • Chained-transactions
ADvanced Systems Laboratorywww.cs.wisc.edu/adsl Thanks to: I/O Shepherd’s shepherd – Frans Kaashoek ScholarshipSponsor: ResearchSponsor:
Mirror-Map Mirror-Map D D Q R Mirror-Map D NULL D Q … … D Policy Code RemapMirrorData(Addr D) Addr R, Q; MapLookup(MMap, D, R); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); if (Fail(R)) Deallocate(R); Q = PickMirrorLoc(D); MapAllocate(MMap, D, Q); Write(Q); Disk Subsystem D R Q
Chained Transactions (2) Example Policy: RemapMirrorData Memory I D MDR1 MDR2 MDR2 Journal TB D I TC TB TC Fixed Location MD0 D I R1 R2 MD0 Checkpoint completes
Existing Solution Enough? • Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)? • Not pervasive in home environment (store photos, tax returns) • New trend: commodity storage clusters (Google, EMC Centera) • Is RAID enough? • Requires more than one disk • Does not protect faults above disk system • Focus on whole disk failure • Does not enable fine-grained policies