320 likes | 334 Views
Explore recovery-oriented computing focusing on reducing recovery time and improving fault detection through structural behavior monitoring and application-level detection. Learn about Microreboot and System-Level Undo/Redo techniques.
E N D
EEC 688/788Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline Recovery oriented computing Overview Application level fault detection Structural behavior monitoring Path shape analysis Microreboot and System-Level Undo/Redo EEC688/788: Secure & Dependable Computing
Recovery-Oriented Computing On availability of soft realtime systems Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as well as reducing MTTR Recovery-oriented computing: focusing on reducing MTTR Making fault detection faster and more accurate Making recovery faster
Fault Detection and Localization Fault detection: determine if some component in the system has failed Fault localization: pinpoint the particular component that failed Low-level fault detection mechanism Based on timeout, probing each component periodically with a heartbeat message Cannot detect many application-level faults Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault detection
Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging
System Model for Recovery-Oriented Computing Three-tier architecture Separating application logic and data management Middle-tier is stateless or maintains only session state Component-based middleware Java Platform, Enterprise Edition (Java EE often referred to as J2EE) Key component: Enterprise Java Bean (EJB)
Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the application level One plausible fault detection method: acceptance test Developer would have to develop effective and efficient acceptance test routings Not practical for Internet apps due to their scale, complexity and rapid rate of changes ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app details
Structural Behavior Monitoring Interaction patterns between different components reflect the app-level functionality Each component implements a specific app function, e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory The internal structural behavior can be monitored to infer whether or not the app is functioning normally To monitor Log runtime path for each end-user request, including all incoming msgs, outgoing msgs, method invocations, etc.
Structural Behavior: Runtime Path Example Runtime path for a single end-user request Span 5 components Consist of 10 events
Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that resembles real workload Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to peer components Must train with real workload Fault (anomaly) detection: comparing observed patterns with those in the reference models
Component Interactions Modeling Focus on interactions between a component instance and all other component classes More scalable: can cope with cases when there are many instances of each class Suitable for using the Chi-square test for anomaly detection
Component Interactions Modeling Given a system with n component classes, the interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interact with each other We assume that interactions are symmetric (i.e., request and reply) Weight assigned to each link is the probability of the component instance interacting with the linked component class The sum of the weight on all links is 1, i.e., the component instance has probability of 1 to interact with other component classes
Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-users, 3 instances Class C and Class D: also app logic, representing shared state Class E: database server, persistent state
Component Interaction Model: Example Machine learning: determine link weight based on training data Training data A issued 400 remote invocations on b1 b1 issued 300 local method invocations on C, and 300 invocations on D Not important what happened between C & E, D & E Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3
Anomaly Detection Comparison of current behavior with the trained behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula: n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i: ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness, it follows a chi-square distribution
Anomaly Detection: Chi-Square Test Anomaly detected: D > the 1-a quantile of the chi-square distribution with degree of freedom of k=n-1 at a level of significance a Higher level of a => more sensitive => more false positive Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance The null hypothesis refers to a general statement or default position that there is no relationship between two measured phenomena. Rejecting or disproving the null hypothesis—and thus concluding that there are grounds for believing that there is a relationship between two phenomena
Anomaly Detection: Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C, and 20 invocations on D Link(A-b1): expected value is 100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for a=0.1, 90% quantile is 4.6 => anomaly detected
Path Shapes Modeling The shape of a runtime path is defined to be the ordered set of component classes A path shape is represented as a tree in which a node represents a component class The directional edge represents the causal relationship between two adjacent nodes
Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk, component classes in a path shape form Tk A list of nonterminal symbols, Ni Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules (see below) A list of production rules, Ni -> zj (a list of terminals and nonterminals) A list of probabilities Rij = P(Ni -> zj )
Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B R1j: SA, p=1.0 R2j: AB, p=1.0
Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5 Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0 R5j: DE, p=1.0 E is the last stop for all R5j: E$, p=1.0
Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if they confirm to the grammar An anomaly is detected if a path shape does not conform to the grammar PCFG itself only detect fault, but not pinpoint root cause (localization of fault) Need to use other method, such as decision tree
Microreboot Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems System design guideline Component based: such as Java EE, with EJB Separating application logic execution and state management Reboot should be cause state loss Loose coupling: to enable localized microreboot Reduce dependency among components: either self-contained, or interaction with other components should be mediated (e.g., via Java EE container) Key: any instance of the referenced component should be able to get the job done => when one under gone microreboot, another instance can provide same service Resilient inter-component interactions Lease-based resource management
Microreboot Automatic recovery with microreboot Equipping with a fault monitor and a recovery management The fault monitor implements some of the fault detection and localization algorithms described in the previously The recovery manager is responsible to recover the system from the fault recursively: by microrebooting first the identified faulty component, if the symptom does not disappear, a group of components according to a fault-dependency graph. If microrebooting does not work, the entire system is rebooted. The final resort is to notify a human operator
Microreboot Fault-dependency graph (f-map): consists of components as nodes and the fault-propagation paths as edges Equipping with a fault monitor and a recovery management Can be obtained using automatic failure-path inference (AFPI) AFPI Constructed by observing the system’s behaviors when faults are injected F-map is then refined during normal operation Cycles in the f-map: nodes in the cycle are grouped as a single node; entire group will be microrebooted as a single unit; f-map => r-map
Microreboot Automatic recovery with microreboot Reboot both reported faulty component and all components that are immediately downstream from the component If faulty symptom persists, the upstream component in the r-map is also microrebooted Recovery is carried out recursively until entire system is rebooted
Microreboot Implications of microreboot Microreboot faulty components before node-level failure Tolerating more false positives Proactive microreboot for software rejuvenation Enhance fault transparency for end-users
Overcoming Operator Errors System dependability is significantly reduced because of human errors Checkpointing and logging useful but not sufficient Operating system level State repair and selective replay System-level undo (rewind), repair, system-level redo (replay)
Exercise 1. Identify the set of most recent checkpoints that can be used to recover the system shown here after the crash of P1 10/23/2019 EEC693: Secure and Dependable Computing EEC688: Secure & Dependable Computing Wenbing Zhao
Exercise 2.Chandy and Lamport distributed snapshot protocol is used to produce a consistent global state of the system shown below. Draw all control msgs sent in the CL protocol, the checkpoints taken at P1 and P2, and specify the channel state for the P0 to/from P1 channels, the P1 to/from P2 channels, and P2 to/from P0 channels 10/23/2019 EEC688: Secure & Dependable Computing Wenbing Zhao
Exercise 3: Prove that the Chandy and Lamport Distributed Snapshot Protocol produces consistent checkpoints of the system.
Exercise 4: The following are the interactions that occurred in a system at instance b1 during a period, the total invocations on b1 at an instance are 1200. The remote invocation on b1 by A, the local method invocation by C, D, E and F are 300,200,300,200 and 200. If remote invocations on A by b1, the local method invocations on C, D, E and F observed are 35, 25,20,15, and 25 then find if anomalies are present in the system?