400 likes | 523 Views
Automatic Data Structure Repair for Self-Healing Systems. Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. Motivation. Broken Data Structure. Errors Missing elements Inappropriate sharing Dangling references Out of bounds array indices
E N D
Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology
Motivation Broken Data Structure Errors • Missing elements • Inappropriate sharing • Dangling references • Out of bounds array indices • Inconsistent values F = 20 G = 10 F = 20 G = 5 I = 5 J = 2
Goal Broken Data Structure Consistent Data Structure F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2
Goal Broken Data Structure Consistency Properties From Developer Consistent Data Structure F = 2 G = 1 F = 20 G = 10 F = 10 G = 5 F = 20 G = 10 F = 20 G = 5 Repair Algorithm I = 3 I = 5 J = 2 J = 2
What Does Repair Algorithm Produce? • Data structure that • Satisfies consistency properties, and • Heuristically close to broken data structure • Not necessarily the same data structure as (hypothetical) correct program would produce • But enough to keep program operating successfully
Precursors • Data structure repair has historically appeared in systems with extreme reliability goals • 5ESS switch – hand coded audit routines • IBM MVS operating system – hand coded failure recovery routines • Key component of these systems
Where Is This Likely To Be Useful? • Not for systems with slack - can just reboot • Cause of error must go away after reboot • Must be OK to lose volatile state • Must be OK to wait for reboot • Persistent data structures (file systems, application files) • Autonomous and/or safety critical systems • Monitor/control unstable physical phenomena • Largely independent subcomputations • Moving time window
Architecture Broken Abstract Model Repaired Abstract Model Internal Consistency Properties External Consistency Properties Model Definition & Translation 1011100110001111011 1010101011110011101 1010111000111101110 1010011110001111011 1010110101110011010 1010111011001100010 Broken Bits Repaired Bits
Architecture Rationale Why go through the abstract model? • Simple, uniform structure • Sets of objects • Relations between objects • Simplifies both • Expression of consistency properties • Repair algorithm • Enables system to support full range of efficient, heavily encoded data structures
struct Entry { byte name[Length]; int firstBlock; } struct Block { int nextBlock; data byte[BlockSize]; } File System Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks struct Disk { Entry dir[NumEntries]; Block block[NumBlocks]; } Disk D;
Model Definition • Sets of objects set blocks of integer : partition used | free; • Relations between objects – values of object fields, referencing relationships between objects relation next : used, used; blocks next used free
Model Translation Bits translated to sets and relations in abstract model using statements of the form: Quantifiers, Condition Inclusion Constraint for i in 0..NumEntries, 0 D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks D.dir[i].firstBlock in used for b in used, 0 D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks b,D.block[b].nextBlock in next for b,n in next, true n inused for b in 0..NumBlocks, not (b in used) b in free
Model in Example abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks blocks used 0 next free 1 3 next 2
Internal Consistency Properties Quantifiers, Body • Body is first-order property of basic propositions • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E • Presence of required number of objects • size(S) = C, size(S) C, size(S) C • Topology of region surrounding each object • size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Example: for b in used, size(next.b) 1
Internal Consistency Violations Evaluate consistency properties, find violations for b in used, size(next.b) 1 is false for b = 1 blocks used 0 next free 1 3 next 2
Repairing Violations of Internal Consistency Properties • Violation provides binding for quantified variables • Convert Body to disjunctive normal form (p1 … pn ) … (q1 … qm ) p1 …pn , q1 …qm are basic propositions • Choose a conjunction to satisfy • Repair violated basic propositions in conjunction
Repairing Violations of Basic Propositions • Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E • Compute value of expression, assign field • Presence of required number of objects • size(S) = C, size(S) C, size(S) C • Remove or insert objects from/to set • Topology of region surrounding each object • size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C • Remove or insert pairs from/to relation • Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R • Remove or add the object or pair from/to set or relation
Repair in Example for b in used, size(next.b) 1 is false for b = 1 Must repair size(next.1) 1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 next 2
Repair in Example for b in used, size(next.b) 1 is false for b = 1 Must repair size(next.1) 1 Can remove either 0,1 or 2,1 from next blocks used 0 next free 1 3 2
Acyclic Repair Dependences • Questions • Isn’t it possible for the repair of one constraint to invalidate another constraint? • What about infinite repair loops? • What about unsatisfiable specifications? • Answer • We require specifications to have no cyclic repair dependences between constraints • So all repair sequences terminate • Repair can fail only because of resource limitations
External Consistency Constraints Quantifiers, Condition Body • Body of form V = E, V.F = E, V.F[I] = E • Example for b in free, true D.block[b].nextBlock = -2 for i,j in next, true D.block[i].nextBlock = j for b in used, size(b.next) = 0 D.block[b].nextBlock = -1 • Repair simply performs assignments • Translates model repairs to bit repairs
abst 0 intro 2 1 -5 1 -1 Directory Entries Disk Blocks abst 0 intro 2 1 -1 -1 -2 Directory Entries Disk Blocks Repair in Example Inconsistent File System Repaired File System
When to Test for Consistency and Repair • Persistent data structures • Repair can be independent activity, or • Repair when data written out or read in • Volatile data structures in running program • Under programmer control • Transaction-based approach • Identify transaction start and end • Repair at start, end, or both • Failure-based approach • Wait until program fails • Repair and restart from latest safe point
Experience • We acquired four benchmarks (written in C/C++) • CTAS (air-traffic control tool) • Simplified Linux file system • Freeciv interactive game • Microsoft Word files • We developed specifications for all four • Very little development time (days, not weeks) • Most of time spent figuring out Freeciv and CTAS • Each benchmark has • Workload • Fault insertion methodology • Ran benchmarks with and without repair
CTAS • Set of air-traffic control tools • Traffic management • Arrival planning • Flow visualization • Shortcut planning • Deployed in centers around country (Dallas/Ft. Worth, Los Angeles, Denver, Miami, Minneapolis/St. Paul, Atlanta, Oakland) • Approximately 1 million lines of C/C++ code
Results • Workload – recorded radar feed from DFW • Fault insertion • Simulate error in flight plan processing • Bad airport index in flight plan data structure • Without repair • System crashes – segmentation fault • With repair • Aircraft has different origin or destination • System continues to execute • Anomaly eventually flushed from system
Aspects of CTAS • Lots of independent subcomputations • System processes hundreds of aircraft – problem with one should not affect others • Multipurpose system (visualization, arrival planning, shortcuts, …) – problem in one purpose should not affect others • Sliding time window: anomalies eventually flushed • Rebooting ineffective – system will crash again as soon as it sees the problematic flight plan
Simplified Linux File System intro 0 110 1011 directory block super block group block inode bitmap block block bitmap block inode … inode disk blocks inode block Some Consistency Properties • inode bitmap consistent with inode usage • block bitmap consistent with block usage • directory entries refer to valid inodes • files contain valid blocks only • files do not share blocks
Results • Workload – write and verify several files • Fault insertion – crash file system • Inode and block bitmap errors • Partially initialized directory and inode entries • Without repair • Incorrect file contents because of inode and disk block sharing • With repair • Bitmaps repaired preventing illegal sharing, correct file contents
Freeciv Terrain Grid O = Ocean Consistency Properties • Tiles have valid terrain values • Cities are not in the ocean • Each city has exactly one reference from city location grid • City locations are consistent in • City structures and • tile grid O P M M P = Plain O O P M M = Mountain O P M M City Structures P P P M loc: 3,0 loc: 2,3
Results • Workload – Freeciv software plays against itself • Fault insertion – randomly corrupt terrain values • Without repair – program fails (seg fault) • With repair • Game runs just fine • But game plays out differently because of the different terrain values
Microsoft Word Files • Files consist of a sequence of streams • Streams stored using FAT-based data structure • Consistency Properties • FAT blocks exist and contain valid entries • FAT streams are properly terminated • Free blocks properly marked • Streams contain valid blocks • No sharing of blocks between streams abst 1 7 0 intro 1 9 2 1 -1 -1 -2 Directory Entries FAT Disk Blocks
Results • Workload – several Microsoft Word files • Fault insertion – scramble FAT • Without repair • If blocks containing the FAT were incorrectly marked as free, Word successfully loads file • Otherwise, “The document name or path is not valid” • With repair • Word loads all files
Extensions • Elimination of external consistency constraints • Eliminates problems with translating repairs on the abstract model to the actual data structure • Repair algorithm analyzes model definition rules to generate repair actions for the actual data structure
Extensions • Support for doubly linked data structures • Enables the repair algorithm to regenerate back links
Extensions • Compilation and optimization of consistency checking • Achieved significant speedups (n x) by compiling the specification • Achieved further speedups () by partially optimizing away the construction of the abstract model
Related Work • Hand-coded repair • Lucent 5ESS switch • IBM MVS operating system • Self-stabilizing algorithms • Log-based recovery for database systems • Recovery-oriented computing • Recursive restartability • Undo framework
Conclusion • Data structure repair interesting way to (potentially) improve reliability • Specification-based approach promises to make technique more widely applicable • Moving towards more robust, probabilistic, continuous concept of system behavior