240 likes | 254 Views
Learn about recovery-oriented computing, application-level fault detection, structural behavior monitoring, and system recovery models to improve system reliability and availability.
E N D
EEC 688/788Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline Recovery oriented computing Overview Application level fault detection Structural behavior monitoring Path shape analysis EEC688/788: Secure & Dependable Computing
Chandy and Lamport Distributed Snapshot Protocol • Prove that the Chandy and Lamport Distributed Snapshot Protocol produces consistent checkpoints of the system.
Recovery-Oriented Computing On availability of soft realtime systems Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as well as reducing MTTR Recovery-oriented computing: focusing on reducing MTTR Making fault detection faster and more accurate Making recovery faster
Fault Detection and Localization Fault detection: determine if some component in the system has failed Fault localization: pinpoint the particular component that failed Low-level fault detection mechanism Based on timeout, probing each component periodically with a heartbeat message Cannot detect many application-level faults Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault detection
Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging
System Model for Recovery-Oriented Computing Three-tier architecture Separating application logic and data management Middle-tier is stateless or maintains only session state Component-based middleware Java Platform, Enterprise Edition (Java EE often referred to as J2EE) Key component: Enterprise Java Bean (EJB)
Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the application level One plausible fault detection method: acceptance test Developer would have to develop effective and efficient acceptance test routines Not practical for Internet apps due to their scale, complexity and rapid rate of changes ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app details
Structural Behavior Monitoring Interaction patterns between different components reflect the app-level functionality Each component implements a specific app function, e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory The internal structural behavior can be monitored to infer whether or not the app is functioning normally To monitor Log runtime path for each end-user request, including all incoming msgs, outgoing msgs, method invocations, etc.
Structural Behavior: Runtime Path Example Runtime path for a single end-user request Span 5 components Consist of 10 events
Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that resembles real workload Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to the peer components Must train with real workload Fault (anomaly) detection: comparing observed patterns with those in the reference models
Component Interactions Modeling Focus on interactions between a component instance and all other component classes More scalable: can cope with cases when there are many instances of each class Suitable for using the Chi-square test for anomaly detection
Component Interactions Modeling Given a system with n component classes, the interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interact with each other We assume that interactions are symmetric (i.e., request and reply) Weight assigned to each link is the probability of the component instance interacting with the linked component class The sum of the weight on all links is 1, i.e., the component instance has probability of 1 to interact with other component classes
Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-users, 3 instances Class C and Class D: also app logic, representing shared state Class E: database server, persistent state
Component Interaction Model: Example Machine learning: determine link weight based on training data Training data A issued 400 remote invocations on b1 b1 issued 300 local method invocations on C, and 300 invocations on D Not important what happened between C & E, D & E Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3
Anomaly Detection Comparison of current behavior with the trained behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula: n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i: ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness, it follows a chi-square distribution
Anomaly Detection: Chi-Square Test Anomaly detected: D > the 1-a quantile of the chi-square distribution with freedom of degree of k=n-1 at a level of significance a Higher level of a => more sensitive => more false positive Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance
Anomaly Detection: Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C, and 20 invocations on D Link(A-b1): expected value is 100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for a=0.1, 90% quantile is 4.6 => anomaly detected
Path Shapes Modeling The shape of a runtime path is defined to be the ordered set of component classes A path shape is represented as a tree in which a node represents a component class The directional edge represents the causal relationship between two adjacent nodes
Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk, component classes in a path shape form Tk A list of nonterminal symbols, Ni Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules (see below) A list of production rules, Ni -> zj (a list of terminals and nonterminals) A list of probabilities Rij = P(Ni -> zj )
Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B R1j: SA, p=1.0 R2j: AB, p=1.0
Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5 Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0 R5j: DE, p=1.0 E is the last stop for all R6j: E$, p=1.0
Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if they confirm to the grammar An anomaly is detected if a path shape does not conform to the grammar PCFG itself only detect fault, but not pinpoint root cause (localization of fault) Need to use other method, such as decision tree
Exercise 1: The following are the interactions that occurred in a system at instance b1 during a period, the total invocations on b1 at an instance are 1200. The remote invocation on b1 by A, the local method invocation by C, D, E and F are 300,200,300,200 and 200. If remote invocations on A by b1, the local method invocations on C, D, E and F observed are 35, 25,20,15, and 25 then find if anomalies are present in the system?