290 likes | 484 Views
Application-level Techniques to Improve System Resilience. Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ { vcsharma , haran , zvonimir , ganesh }@cs.utah.edu School of Computing, University of Utah.
E N D
Application-level Techniques to Improve System Resilience Vishal Chandra Sharma*, Arvind Haran*, ZvonimirRakamaric*, Ganesh Gopalakrishnan*§ {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah *Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. §Faculty Associate, SUPER (http://super-scidac.org/)
Research Goals • Robust Evaluation Infrastructure • Released KULFI, an open source instruction-level fault injector • Evaluation of sorting routines done using KULFI, results shared at PRDC’13 • Lightweight Application-level Detectors • Developed FUSED, soft-error detection framework • Preliminary results to be presented at SELSE’14 • Further work in progress to develop heuristics to optimize detector placement • Identifying Vulnerable Code-Regions • Application includes detector placement optimization • Work in progress
KULFI : A Soft-Error Injector • Flexible evaluation infrastructure using KULFI • Active collaborations to promote usage of KULFI in other resilience studies • Current collaborators -- Greg Bronevetsky (LLNL), Sui Chen, Lu Peng (LSU)
Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++;
Motivating Example LSB position of x flipped int x = 3;int y = 11; printf(“x=%d, y=%d” ,x ,y) if (x < 3&& y > 10) y++; else x++; SDC in the output value of x Program output:x=4, y=12
A Software-Level Approach to Fault Detection int x = 2 ; int y = 11;PP0:If ( x<3 && y>10 ){PP1: y++; PP2:}else{ PP3: x++;PP4: }PP5:printf(“x=%d, y=%d”,x , y) Program Conditionals: x<3, y>10 Program Points: PP0, PP1, PP2, PP3, PP4, PP5 Predicate State at PP0: <PP0, TT> Predicate State at PP1:<PP1, TT> Example Predicate State Transition: <PP0, TT> <PP1, TT>
FUSED Soft-Error Detection Framework • Automatically synthesizes and inserts detectors • Uses profilers to generate likely invariants • Likely invariants are used for soft error detection
Preliminary Experimental Results • FUSED is evaluated using SuperLU scientific library • Up to 90% of soft errors are detected • Detectors only inserted into top-level LU factorization routine • Average execution overhead of 19% due to the detectors • In future, optimize detector placement to reduce overhead
Identifying Vulnerable Code-Regions • Identify highly active code-regions w.r.t. data-flow • Compute activity cost for data-flow edges • Highly active code-regions most likely to be hit by soft-errors • Applications include detector placement optimization Work in Progress
Concluding Remarks & Future Work • KULFI, an open source fault injector for evaluation infrastructure • Try out KULFI: https://github.com/soar-lab/KULFI • FUSED error detection framework • Continue working to develop heuristics for detector placement optimization • Plan for open source release • Identifying Vulnerable Code-Regions • Applications include detector placement optimization • Characterizing resilience properties of a program
References [arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013 [lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer." IEEE Transactions on Device and Materials Reliability, 2005 [llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004 [pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005 [iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008 [sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“, in IEEE International Conference on Dependable Systems and Networks (DSN), 2013
References [slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999 [slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf. [sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011 [parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008 [relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010 [schen13] S. Chen, personal communication, 2013.
Closely Related Work • Low-cost software level detector is the need of the hour • iSWAT by Sahoo et. al. [iswat08] uses likely program invariants • Derives likely invariants by monitoring program properties • Uses hardware-assisted framework to detect false positives • Not based on predicate abstraction • Error localization by Sloan et.al. [sloan13] uses algorithm based approach • Need fault injector as part of evaluation infrastructure • LLVM-level fault injector developed by Kuijif et. al. [relax10] • Publicly unavailable • A recent study [schen13] done by a user suggests KULFI has better fine-grained options • LLFI fault injector by Thomas et. al. • Developed around same time as KULFI, shares many similar features
KULFI: Fault Injection Logic Start Forall dynamic instructions Feasible? No Yes Inject Fault with user provided probability Stop
Case Study • Empirically study resiliency of sorting algorithms - Bubblesort, Quicksort, Mergesort, Radixsort, Heapsort • Inject exactly one fault in a randomly chosen dynamic instruction of a sorting routine • 1 fault injection experiment = 100 runs with exactly on fault injected • Categorize outcome into SDC, Benign, or Segmentation fault categories • Benign: 41, Segmentation: 29, SDC: 30
Case Study • Executed 200 fault injection experiments per sorting routine • Total number of fault injections = 5*200*100 = 100000 • Plotted fault counts from each outcome category for each fault injection experiment • Result shows strong clustering pattern with statistically significant distribution for each outcome category
Overview • Introduction • KULFI: A LLVM Level Fault Injector • Case Study • Fault Detector • Concluding Remarks
A Software-Level Approach to Fault Detection • Predicates: Pure boolean program conditionals • Predicate State: <PP,BV> • PP: Program point between two successive program statements • BV:Bit-vector representing concrete boolean values of program conditionals at a given program point • Predicate State Transition: <PP:BV> <PP’:BV’> • PP’ is a program point which is an immediate successor of PP • BV’ is the bit-vector representing concrete boolean values of program conditionals at PP’
A Software-Level Approach to Fault Detection Start Start Program Program Execute Program Execute Program Get Predicate Transition No Extract all valid predicate transitions Is last transition? Check if Valid ? Yes Stop No Yes Fault Detected Stop
Predicate Transition Diagram (PTD) Start Program Execute Program Execute Program Inject Fault Track Predicate Transitions Track Predicate Transitions Merge Predicate Transition Diagram Stop
Acknowledgements • Pedro Diniz • PrabhakarKudva • ShuvenduLahiri • KarthikPattabiraman • Sui Chen • Anonymous reviewers of PRDC conference who reviewed our paper