1 / 34

Towards Formal Approaches to System Resilience

Towards Formal Approaches to System Resilience. Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ { vcsharma , haran , zvonimir , ganesh }@cs.utah.edu School of Computing, University of Utah.

soleil
Download Presentation

Towards Formal Approaches to System Resilience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Formal Approaches to System Resilience Vishal Chandra Sharma*, Arvind Haran*, ZvonimirRakamaric*, Ganesh Gopalakrishnan*§ {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah *Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. §Faculty Associate, SUPER (http://super-scidac.org/)

  2. Overview • Introduction • Fault Injector • Case Study • Fault Detector • Concluding Remarks

  3. Motivation • Recent studiesshow resiliency as a growing area of concern [arg13] [lanl05] • MTBF decreasing at a faster rate in exascale computing • Dynamic voltage/frequency scaling in low power computing • Our goal is to improve application-level resiliency • Primary focus is to detect transient faults in a software • Silent data corruption (SDC)

  4. Motivating Example int x = 2;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++;

  5. Motivating Example int x = 2;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++; Program output:x=2, y=12

  6. Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++;

  7. Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++; SDC in the output value of x Program output:x=4, y=11

  8. Our Contribution • A LLVM-level fault Injector for evaluation purpose [llvm04] • A simple case study on sorting algorithms • Demonstrates effectiveness of our solution • Highlights importance of design space exploration w.r.t. resiliency • A software-level fault detector based on idea of predicate abstraction • Applying it in resiliency research is a novel direction! • Introduced by Ballto define a novel program coverage metrics [pct05]

  9. Closely Related Work • Low-cost software level detectors • iSWAT by Sahoo et. al.uses likely program invariants [iswat08] • Derives likely invariants by monitoring program properties • Hardware-assisted framework to detect false positives • Error detector by Sloan et.al. [sloan13] • Algorithm based error detector applied to linear solvers • Utilizes algorithmic properties of linear solvers to detect and isolate errors • Software-level fault injectors • LLVM-level fault injector developed by Kuijif et. al. [relax10] • Publicly unavailable • A recent study done by a user suggests our fault injector has better fine-grained options [schen13] • LLFI fault injector by Thomas et. al. [thomas13] • Developed around same time as our fault injector, shares many similar features

  10. Overview • Introduction • Fault Injector • Case Study • Fault Detector • Concluding Remarks

  11. Fault Injector KontrollableUtah’s LLVM based Fault Injector  KULFI KULFI Indian dessert 

  12. KULFI: Fault Injection Logic Start Forall dynamic instructions Feasible? No Yes Inject Fault with user provided probability Stop

  13. KULFI: Fault Injection Process Program Clang Execution Outcome LLVM bitcode Program Input Vectors LLVM LLVM KULFI KULFI Dynamic Instruction Count Fault Injecting LLVM bitcode SDCSegFaultBenign

  14. Overview • Introduction • Fault Injector • Case Study • Fault Detector • Concluding Remarks

  15. Case Study Sorting routines - Bubblesort, Quicksort, Mergesort, Radixsort, Heapsort 1 Experiment 200 Fault Injection Campaigns 100 Program Runs 1 Fault Injection Campaign Inject one fault during each program run Total number of fault injections = 200*100 = 20,000!

  16. Case Study • A dynamic instruction is chosen at random for fault injection • A single-bit fault in a random bit position of the dynamic instruction’s register • For each fault injection campaign, categorize outcome into SDC, Benign, or Segmentation fault categories • Benign: 41, Segmentation: 29, SDC: 30

  17. Case Study • Plot fault count from a fault category corresponding to a fault injection campaign • X axis: Fault count corresponding to a fault injection campaign • Y axis: Sorting routines • Result shows strong clustering pattern with statistically significant distribution for each fault category

  18. Results

  19. Results

  20. Results

  21. Overview • Introduction • Fault Injector • Case Study • Fault Detector • Concluding Remarks

  22. A Software-Level Approach to Fault Detection • Predicates: Boolean program conditionals • Predicate State: <PP,BV> • PP: Program point between two successive program statements • BV:Bit-vector representing concrete boolean values of program conditionals at a given program point • Predicate State Transition: Current State  Next State

  23. A Software-Level Approach to Fault Detection Foo(int x, int y){PP0:If ( x<3 && y>10 ){PP1: y++; PP2:}else{ PP3: x++;PP4: }PP5:printf(“x=%d, y=%d”,x , y) } Predicates: x<3, y>10 Program Points: PP0, PP1, PP2, PP3, PP4, PP5 Input Vectors: x = 2, y = 11 Predicate State at PP0: <PP0, TT> Predicate State at PP1:<PP1, TT> Predicate State Transition: <PP0, TT>  <PP1, TT> <PP0,TT> <PP1,TT>

  24. A Software-Level Approach to Fault Detection Start Start Program P Program P Instrumented Program P2 Instrumented Program P1 Get Predicate Transition No Extract predicate transitions last transition? Check if Valid ? Yes Stop No Yes Fault Detected Profile valid predicate transitions Stop Detect transient faults

  25. Predicate Transition Diagram (PTD) Start Program Inject Fault Track Predicate Transitions Track Predicate Transitions Merge Predicate Transition Diagram Stop

  26. PTD of Foo()

  27. PTD of dgstrf() in SuperLU [slu99,05,11] • SuperLU is a direct linear solver for sparse and nonsymmetric systems of linear equations • Available at: http://crd-legacy.lbl.gov/~xiaoye/SuperLU/

  28. PTD of BlkSchlsEqEuroNoDiv() in Blackscholes • Financial analysis using blackscholesPDE • Part of Parsec 3.0 benchmark suite [parsec08]

  29. Overview • Introduction • Fault Injector • Case Study • Fault Detector • Concluding Remarks

  30. Concluding Remarks • A novel software-level fault detector • Enabling infrastructure for resiliency analysis and evaluation through KULFI • Recommended use during design space exploration • Try out KULFI: https://github.com/soar-lab/KULFI

  31. References [arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013 [lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer."  IEEE Transactions on Device and Materials Reliability, 2005 [llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004 [pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005 [iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008 [sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“,  in IEEE International Conference on Dependable Systems and Networks (DSN), 2013

  32. References [slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999 [slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf. [sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011 [parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008 [relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010 [thomas13] Thomas, Anna, and KarthikPattabiraman. "Error Detector Placement for Soft Computation." in International Conference on Dependable Systems and Networks (DSN), 2013. [schen13] S. Chen, personal communication, 2013.

  33. Acknowledgements • Pedro Diniz • PrabhakarKudva • ShuvenduLahiri • KarthikPattabiraman • Sui Chen • Anonymous reviewers of PRDC conference who reviewed our paper

  34. Thank you!

More Related