1 / 39

Improving File System Reliability with I/O Shepherding

Improving File System Reliability with I/O Shepherding. Haryadi S. Gunawi , Vijayan Prabhakaran + , Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison. +. Complex Storage Subsystem Mechanical/electrical failures, buggy drivers

ronalee
Download Presentation

Improving File System Reliability with I/O Shepherding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran+, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison +

  2. Complex Storage Subsystem Mechanical/electrical failures, buggy drivers Complex Failures: Intermittent faults, latent sector errors, corruption, lost writes, misdirected writes, etc. FS Reliability is important Managing disk and individual block failures Device Driver Transport Firmware Media Mechanical Electrical Storage Reality File System

  3. File System Reality • Good news: • Rich literature • Checksum, parity, mirroring • Versioning, physical/logical identity • Important for single and multiple disks setting • Bad news: • File system reliability is broken[SOSP’05] • Unlike other components (performance, consistency) • Reliability approaches hard-to understand and evolve

  4. Broken FS Reliability • Lack of good reliability strategy • No remapping, checksumming, redundancy • Existing strategy is coarse-grained • Mount read-only, panic, retry • Inconsistent policies • Different techniques in similar failure scenarios • Bugs • Ignored write failures Let’s fix them! With current Framework? Not so easy …

  5. Diffused Handle each fault in each I/O location Different developers might increase diffusion Inflexible Fixed policies, hard to change But, no policy that fits all diverse settings Less reliable vs. more reliable drives Desktop workload vs. web-server apps The need for new framework Reliability is a first-class file system concern No Reliability Framework Reliability Policy File System Disk Subsystem

  6. Localized • I/O Shepherd • Localized policies, … • More correct, less bug, simpler reliability management File System Shepherd Disk Subsystem

  7. Add Mirror Check- sum More Retry More Protection Less Protection ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Flexible • I/O Shepherd • Localized, flexible policies, … File System Shepherd Disk Subsystem

  8. Powerful • I/O Shepherd • Localized, flexible, andpowerful policies File System Shepherd Add Mirror Check- sum More Retry More Protection Add Mirror Check- sum More Retry More Protection Less Protection Compo- sable Policies Disk Subsystem ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Custom Drive

  9. Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion

  10. Primitives SanityCheck Lookup Location OnlineFsck Checksum … Write Read … Policy Metadata Mirror-Map Remap-Map Checksum-Map Architecture • Building reliability framework • How to specify reliability policies? • How to make powerful policies? • How to simplify reliability management? • I/O Shepherd layer • Four important components • Policy table • Policy code • Policy primitives • Policy Metadata File System I/O Shepherd Policy Code DynMirrorWrite(DiskAddr D, MemAddr A) DiskAddr copyAddr; IOS_MapLookup(MMap, D, &copyAddr); if (copyAddr == NULL) PickMirrorLoc(MMap, D, &copyAddr); IOS_MapAllocate(MMap, D, copyAddr); return (IOS_Write(D, A, copyAddr, A)); Disk Subsystem

  11. File System Shepherd /tmp /boot /lib /archive High-level reliability No protection Policy Table • How to specify reliability policies? • Different block types, different levels of importance • Different volumes, different reliability levels • Needfine-grainedpolicy • Policy table • Different policies across different block types • Different policy tables across different volumes

  12. What support is needed to make powerful policies? Remapping: track bad block remapping Mirroring: allocate new block Sanity check: need on-disk structure specification Integration with file system Runtime allocation Detailed knowledge of on-disk structures I/O Shepherd Maps Managed by the shepherd Commonly used maps: Mirror-map Checksum-map Remap-map File System I/O Shepherd Remap Mirror-Map Csum-Map 1001 1001 1001 null 1010 2001 1002 1002 1002 1010 2002 null 1003 1003 1003 1010 null 3003 … … … … … … Policy Metadata

  13. Policy Primitives and Code • How to make reliability management simple? • I/O Shepherd Primitives • Rich set and reusable • Complexities are hidden • Policy writer simply composes primitives into Policy Code Policy Primitives Maps Computation Map Update Checksum Map Lookup Parity FS-Level Layout Sanity Check Allocate Near Stop FS Allocate Far Policy Code MirrorData(Addr D) Addr M; MapLookup(MMap, D, M); if (M == NULL) M = PickMirrorLoc(D); MapAllocate(MMap, D, M); Copy(D, M); Write(D, M);

  14. Mirror-Map D R Mirror-Map D NULL D R … … File System D D I/O Shepherd Policy Code MirrorData(Addr D) Addr R; R = MapLookup(MMap, D); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); Disk Subsystem D R D

  15. Summary • Interposition simplifies reliability management • Localized policies • Simple and extensible policies • Challenge: Keeping new data and metadata consistent

  16. Outline • Introduction • I/O Shepherd Architecture • Implementation • Consistency Management • Evaluation • Conclusion

  17. Implementation • CrookFS • (named for the hooked staff of a shepherd) • An ext3 variant with I/O shepherding capabilities • Implementation • Changes in Core OS • Semantic information, layout and allocation interface, allocation during recovery • Consistency management (data journaling mode) • ~900 LOC (non-intrusive) • Shepherd Infrastructure • Shepherd primitives, thread support, maps management, etc. • ~3500 LOC (reusable for other file systems) • Well-integrated with the file system • Small overhead

  18. TB D I TC D I Data Journaling Mode Memory Bm I D Sync (intent is logged) Tx Release Journal Fixed Location Checkpoint (intent is realized)

  19. When to run policies? Policies (e.g. mirroring) are executed during checkpoint Is current journaling approach adequate to support reliability policy? Could we run remapping/mirroring during checkpoint? No – Problem of failed intentions Cannot react to checkpoint failures Reliability Policy + Journaling

  20. Failed Intentions Example Policy: Remapping Crash Memory I D RMDR Impossible R I Tx Release Journal Inconsistencies: 1) Pointer ID invalid 2) No reference to R TB D I TC Fixed Location RMD0 D I R RMD0 Remap-Map RMDR Checkpoint completes Checkpoint (failed intent)

  21. Journal: log intent to the journal If journal write failure occurs? Simply abort the transaction Checkpoint: intent is realized to final location If checkpoint failure occurs? No solution! Ext3, IBM JFS: ignore ReiserFS: stop the FS (coarse-grained recovery) Flawin current journaling approach No consistency for any checkpoint recovery that changes state Too late, transaction has been committed Crash could occur anytime Hopes checkpoint writes always succeed (wrong!) Consistent reliability + current journal = impossible Journaling Flaw

  22. Chained Transactions • Contains all recent changes (e.g. modified shepherd’s metadata) • “Chained” with previous transaction • Rule: Only after the chained transaction commits, can we release the previous transaction

  23. Chained Transactions Example Policy: Remapping Memory I D RMDR RMDR New: Tx Release after CTx commits Old : Tx Release Journal TB D I TC TB TC Fixed Location D I R RMD0 Checkpoint completes

  24. Summary • Chained Transactions • Handles failed-intentions • Works for all policies • Minimal changes in the journaling layer • Repeatable across crashes • Idempotent policy • An important property for consistency in multiple crashes

  25. Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion

  26. Evaluation • Flexible • Change ext3 to all-stop or more-retrypolicies • Fine-Grained • Implement gracefully-degrade RAID[TOS’05] • Composable • Perform multiple lines of defense • Simple • Craft8policies in a simple manner

  27. No Recovery Retry Stop Propagate Not applicable Flexibility • Modify ext3 inconsistent read recovery policies Workload Failed Block Type Failed Block:Indirect block Workload: Path traversal cd /mnt/fs2/test/a/b/ Policy observed: Detect failure and propagate failure to app Propagate Retry Ignore failure Stop ext3

  28. Flexibility • Modify ext3 policies to all-stop policies ext3 All-Stop No Recovery Retry Stop AllStopRead(Block B) if (Read(B) == OK) return OK; else Stop(); Propagate

  29. Flexibility • Modify ext3 policies to retry-more policies ext3 Retry-More No Recovery Retry RetryMoreRead (Block B) for (int i = 0; i < RETRY_MAX; i++) if (Read(B) == SUCCESS) return SUCCESS; return FAILURE; Stop Propagate

  30. File System RAID-0 file1.pdf /root/… /root File System Shepherd + DGRAID RAID-0 Fine-Granularity • RAID problem • Extreme unavailability • Partially available data • Unavailable root directory • DGRAID[TOS’05] • Degrade gracefully • Fault isolate a file to a disk • Highly replicate metadata f1.pdf f2.pdf

  31. Fine-Granularity F: 1 A: 90% F: 2 A: 80% 10-way Linear X = 1, 5, 10 F: 3 A: ~40%

  32. Composability ReadInode(Block B) { C = Lookup(Ch-Map, B); Read(B,C); if ( CompareChecksum(B, C) == OK ) return OK; M = Lookup(M-Map, B); Read(M); if ( CompareChecksum(M, C) == OK ) B = M; return OK; if ( SanityCheck(B) == OK ) return OK; if ( SanityCheck(M) == OK ) B = M; return OK; RunOnlineFsck(); return ReadInode(B); } Time (ms) • Multiple lines of defense • Assemble both low-level and high-level recovery mechanism

  33. Simplicity • Writing reliability policy is simple • Implement 8 policies • Using reusable primitives • Complex one < 80 LOC

  34. Conclusion • Modern storage failures are complex • Not only fail-stop, but also exhibit individual block failures • FS reliability framework does not exist • Scattered policy code – can’t expect much reliability • Journaling + Block Failures  Failed intentions (Flaw) • I/O Shepherding • Powerful • Deploy disk-level, RAID-level, FS-level policies • Flexible • Reliability as a function of workload and environment • Consistent • Chained-transactions

  35. ADvanced Systems Laboratorywww.cs.wisc.edu/adsl Thanks to: I/O Shepherd’s shepherd – Frans Kaashoek ScholarshipSponsor: ResearchSponsor:

  36. Extra Slides

  37. Mirror-Map Mirror-Map D D Q R Mirror-Map D NULL D Q … … D Policy Code RemapMirrorData(Addr D) Addr R, Q; MapLookup(MMap, D, R); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); if (Fail(R)) Deallocate(R); Q = PickMirrorLoc(D); MapAllocate(MMap, D, Q); Write(Q); Disk Subsystem D R Q

  38. Chained Transactions (2) Example Policy: RemapMirrorData Memory I D MDR1 MDR2 MDR2 Journal TB D I TC TB TC Fixed Location MD0 D I R1 R2 MD0 Checkpoint completes

  39. Existing Solution Enough? • Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)? • Not pervasive in home environment (store photos, tax returns) • New trend: commodity storage clusters (Google, EMC Centera) • Is RAID enough? • Requires more than one disk • Does not protect faults above disk system • Focus on whole disk failure • Does not enable fine-grained policies

More Related