Improving File System Reliability with I/O Shepherding

Improving File System Reliability with I/O Shepherding Haryadi S. Gunawi, Vijayan Prabhakaran+, Swetha Krishnan, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison +

Complex Storage Subsystem Mechanical/electrical failures, buggy drivers Complex Failures: Intermittent faults, latent sector errors, corruption, lost writes, misdirected writes, etc. FS Reliability is important Managing disk and individual block failures Device Driver Transport Firmware Media Mechanical Electrical Storage Reality File System

File System Reality • Good news: • Rich literature • Checksum, parity, mirroring • Versioning, physical/logical identity • Important for single and multiple disks setting • Bad news: • File system reliability is broken[SOSP’05] • Unlike other components (performance, consistency) • Reliability approaches hard-to understand and evolve

Broken FS Reliability • Lack of good reliability strategy • No remapping, checksumming, redundancy • Existing strategy is coarse-grained • Mount read-only, panic, retry • Inconsistent policies • Different techniques in similar failure scenarios • Bugs • Ignored write failures Let’s fix them! With current Framework? Not so easy …

Diffused Handle each fault in each I/O location Different developers might increase diffusion Inflexible Fixed policies, hard to change But, no policy that fits all diverse settings Less reliable vs. more reliable drives Desktop workload vs. web-server apps The need for new framework Reliability is a first-class file system concern No Reliability Framework Reliability Policy File System Disk Subsystem

Localized • I/O Shepherd • Localized policies, … • More correct, less bug, simpler reliability management File System Shepherd Disk Subsystem

Add Mirror Check- sum More Retry More Protection Less Protection ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Flexible • I/O Shepherd • Localized, flexible policies, … File System Shepherd Disk Subsystem

Powerful • I/O Shepherd • Localized, flexible, andpowerful policies File System Shepherd Add Mirror Check- sum More Retry More Protection Add Mirror Check- sum More Retry More Protection Less Protection Compo- sable Policies Disk Subsystem ATA SCSI Archival Scientific Data Networked Storage Less Reliable Drive More Reliable Drive Custom Drive

Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion

Primitives SanityCheck Lookup Location OnlineFsck Checksum … Write Read … Policy Metadata Mirror-Map Remap-Map Checksum-Map Architecture • Building reliability framework • How to specify reliability policies? • How to make powerful policies? • How to simplify reliability management? • I/O Shepherd layer • Four important components • Policy table • Policy code • Policy primitives • Policy Metadata File System I/O Shepherd Policy Code DynMirrorWrite(DiskAddr D, MemAddr A) DiskAddr copyAddr; IOS_MapLookup(MMap, D, &copyAddr); if (copyAddr == NULL) PickMirrorLoc(MMap, D, &copyAddr); IOS_MapAllocate(MMap, D, copyAddr); return (IOS_Write(D, A, copyAddr, A)); Disk Subsystem

File System Shepherd /tmp /boot /lib /archive High-level reliability No protection Policy Table • How to specify reliability policies? • Different block types, different levels of importance • Different volumes, different reliability levels • Needfine-grainedpolicy • Policy table • Different policies across different block types • Different policy tables across different volumes

What support is needed to make powerful policies? Remapping: track bad block remapping Mirroring: allocate new block Sanity check: need on-disk structure specification Integration with file system Runtime allocation Detailed knowledge of on-disk structures I/O Shepherd Maps Managed by the shepherd Commonly used maps: Mirror-map Checksum-map Remap-map File System I/O Shepherd Remap Mirror-Map Csum-Map 1001 1001 1001 null 1010 2001 1002 1002 1002 1010 2002 null 1003 1003 1003 1010 null 3003 … … … … … … Policy Metadata

Policy Primitives and Code • How to make reliability management simple? • I/O Shepherd Primitives • Rich set and reusable • Complexities are hidden • Policy writer simply composes primitives into Policy Code Policy Primitives Maps Computation Map Update Checksum Map Lookup Parity FS-Level Layout Sanity Check Allocate Near Stop FS Allocate Far Policy Code MirrorData(Addr D) Addr M; MapLookup(MMap, D, M); if (M == NULL) M = PickMirrorLoc(D); MapAllocate(MMap, D, M); Copy(D, M); Write(D, M);

Mirror-Map D R Mirror-Map D NULL D R … … File System D D I/O Shepherd Policy Code MirrorData(Addr D) Addr R; R = MapLookup(MMap, D); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); Disk Subsystem D R D

Summary • Interposition simplifies reliability management • Localized policies • Simple and extensible policies • Challenge: Keeping new data and metadata consistent

Outline • Introduction • I/O Shepherd Architecture • Implementation • Consistency Management • Evaluation • Conclusion

Implementation • CrookFS • (named for the hooked staff of a shepherd) • An ext3 variant with I/O shepherding capabilities • Implementation • Changes in Core OS • Semantic information, layout and allocation interface, allocation during recovery • Consistency management (data journaling mode) • ~900 LOC (non-intrusive) • Shepherd Infrastructure • Shepherd primitives, thread support, maps management, etc. • ~3500 LOC (reusable for other file systems) • Well-integrated with the file system • Small overhead

TB D I TC D I Data Journaling Mode Memory Bm I D Sync (intent is logged) Tx Release Journal Fixed Location Checkpoint (intent is realized)

When to run policies? Policies (e.g. mirroring) are executed during checkpoint Is current journaling approach adequate to support reliability policy? Could we run remapping/mirroring during checkpoint? No – Problem of failed intentions Cannot react to checkpoint failures Reliability Policy + Journaling

Failed Intentions Example Policy: Remapping Crash Memory I D RMDR Impossible R I Tx Release Journal Inconsistencies: 1) Pointer ID invalid 2) No reference to R TB D I TC Fixed Location RMD0 D I R RMD0 Remap-Map RMDR Checkpoint completes Checkpoint (failed intent)

Journal: log intent to the journal If journal write failure occurs? Simply abort the transaction Checkpoint: intent is realized to final location If checkpoint failure occurs? No solution! Ext3, IBM JFS: ignore ReiserFS: stop the FS (coarse-grained recovery) Flawin current journaling approach No consistency for any checkpoint recovery that changes state Too late, transaction has been committed Crash could occur anytime Hopes checkpoint writes always succeed (wrong!) Consistent reliability + current journal = impossible Journaling Flaw

Chained Transactions • Contains all recent changes (e.g. modified shepherd’s metadata) • “Chained” with previous transaction • Rule: Only after the chained transaction commits, can we release the previous transaction

Chained Transactions Example Policy: Remapping Memory I D RMDR RMDR New: Tx Release after CTx commits Old : Tx Release Journal TB D I TC TB TC Fixed Location D I R RMD0 Checkpoint completes

Summary • Chained Transactions • Handles failed-intentions • Works for all policies • Minimal changes in the journaling layer • Repeatable across crashes • Idempotent policy • An important property for consistency in multiple crashes

Outline • Introduction • I/O Shepherd Architecture • Implementation • Evaluation • Conclusion

Evaluation • Flexible • Change ext3 to all-stop or more-retrypolicies • Fine-Grained • Implement gracefully-degrade RAID[TOS’05] • Composable • Perform multiple lines of defense • Simple • Craft8policies in a simple manner

No Recovery Retry Stop Propagate Not applicable Flexibility • Modify ext3 inconsistent read recovery policies Workload Failed Block Type Failed Block:Indirect block Workload: Path traversal cd /mnt/fs2/test/a/b/ Policy observed: Detect failure and propagate failure to app Propagate Retry Ignore failure Stop ext3

Flexibility • Modify ext3 policies to all-stop policies ext3 All-Stop No Recovery Retry Stop AllStopRead(Block B) if (Read(B) == OK) return OK; else Stop(); Propagate

Flexibility • Modify ext3 policies to retry-more policies ext3 Retry-More No Recovery Retry RetryMoreRead (Block B) for (int i = 0; i < RETRY_MAX; i++) if (Read(B) == SUCCESS) return SUCCESS; return FAILURE; Stop Propagate

File System RAID-0 file1.pdf /root/… /root File System Shepherd + DGRAID RAID-0 Fine-Granularity • RAID problem • Extreme unavailability • Partially available data • Unavailable root directory • DGRAID[TOS’05] • Degrade gracefully • Fault isolate a file to a disk • Highly replicate metadata f1.pdf f2.pdf

Fine-Granularity F: 1 A: 90% F: 2 A: 80% 10-way Linear X = 1, 5, 10 F: 3 A: ~40%

Composability ReadInode(Block B) { C = Lookup(Ch-Map, B); Read(B,C); if ( CompareChecksum(B, C) == OK ) return OK; M = Lookup(M-Map, B); Read(M); if ( CompareChecksum(M, C) == OK ) B = M; return OK; if ( SanityCheck(B) == OK ) return OK; if ( SanityCheck(M) == OK ) B = M; return OK; RunOnlineFsck(); return ReadInode(B); } Time (ms) • Multiple lines of defense • Assemble both low-level and high-level recovery mechanism

Simplicity • Writing reliability policy is simple • Implement 8 policies • Using reusable primitives • Complex one < 80 LOC

Conclusion • Modern storage failures are complex • Not only fail-stop, but also exhibit individual block failures • FS reliability framework does not exist • Scattered policy code – can’t expect much reliability • Journaling + Block Failures  Failed intentions (Flaw) • I/O Shepherding • Powerful • Deploy disk-level, RAID-level, FS-level policies • Flexible • Reliability as a function of workload and environment • Consistent • Chained-transactions

ADvanced Systems Laboratorywww.cs.wisc.edu/adsl Thanks to: I/O Shepherd’s shepherd – Frans Kaashoek ScholarshipSponsor: ResearchSponsor:

Extra Slides

Mirror-Map Mirror-Map D D Q R Mirror-Map D NULL D Q … … D Policy Code RemapMirrorData(Addr D) Addr R, Q; MapLookup(MMap, D, R); if (R == NULL) R = PickMirrorLoc(D); MapAllocate(MMap, D, R); Copy(D, R); Write(D, R); if (Fail(R)) Deallocate(R); Q = PickMirrorLoc(D); MapAllocate(MMap, D, Q); Write(Q); Disk Subsystem D R Q

Chained Transactions (2) Example Policy: RemapMirrorData Memory I D MDR1 MDR2 MDR2 Journal TB D I TC TB TC Fixed Location MD0 D I R1 R2 MD0 Checkpoint completes

Existing Solution Enough? • Is machinery in high-end systems enough (e.g. disk scrubbing, redundancy, end-to-end checksums)? • Not pervasive in home environment (store photos, tax returns) • New trend: commodity storage clusters (Google, EMC Centera) • Is RAID enough? • Requires more than one disk • Does not protect faults above disk system • Focus on whole disk failure • Does not enable fine-grained policies

Improving File System Reliability with I/O Shepherding

Improving File System Reliability with I/O Shepherding

Presentation Transcript

The Role of Leadership in Performance Management

Machine-Learning Assisted Binary Code Analysis

Analysis Of Stripped Binary Code

University of Nebraska

The Effect of Black Male Imprisonment on Black Child Poverty

Polydiacetylene Vesicles: Direct Biosensors with a Colorimetric Response

Atropisomerism in biaryl-containing natural products

Case-Based Learning Workshop

COTSEAL Workshop University of Madison, Wisconsin April 23, 2010 ------ .

Stereoselective Routes to Aziridines

Monte Carlo Analysis

Getting Started with GPU Computing

Learning with Trees

David M Webber University of Illinois at Urbana-Champaign (Now University of Wisconsin-Madison)

Water Balance Modeling for Alternative Water Balance (aka “ET”) Covers

Michael D. Rettig rettigmd@jmu Professor Emeritus James Madison University

Condor Administration

ISIS v9 Transition Demo

Welcome Board of Regents University of Wisconsin System

GEOSS Americas/Caribbean Remote Sensing Workshop – Transforming Data into Products

PASI2006 ORGANIZING COMMITTEE

GEOSS Americas/Caribbean Remote Sensing Workshop – Transforming Data into Products