670 likes | 785 Views
Some thoughts for the industry session. Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291 Phone: (919)660-5269 e-mail: kst@ee.duke.edu At present: visiting Professor IIT Kanpur, CSE Dept. Cochin Conference Dec 18, 2002.
E N D
Some thoughts for the industry session Prof. Kishor S. Trivedi Department of Electrical and Computer Engineering Duke University Durham, NC 27708-0291 Phone: (919)660-5269 e-mail: kst@ee.duke.edu At present: visiting Professor IIT Kanpur, CSE Dept. Cochin Conference Dec 18, 2002
What does industry want? • Well trained students • Short term research problems solved • Short courses on timely topics
What do faculty want? • Funding for `their’ research • Place their students in good company labs • Hope to get their research results transferred to industry • To get to know important and difficult problems that can drive their research
Some lessons learned • Student placement should be guided by the advisor • Start early with summer internship • Patience is needed in listening to problems from industry • Patience is needed in getting the IP problems resolved • Expect to do at least 50% more work than the funding provided • Tech transfer is a double edged sword • Practical problems can give rise to respectable research papers • Short courses are ideal entry points
Characteristics of the Systemsbeing Studied Dependability (Reliability, Availability, Safety): • Redundancy: Hardware (Static,Dynamic), Information, Time • Fault Types: Permanent, Intermittent, Transient, Design • Fault Detection, Automated Reconfiguration • Imperfect Coverage • Maintenance: scheduled, unscheduled
Characteristics of the Systemsbeing Studied • Performance: • Resource Contention, Concurrency and Synchronization • Timeliness (Have to Meet Deadlines) • Composite Performance and Dependability: • Degradable Levels of Performance • Need Techniques and Tools that can Evaluate: • Systems with All the Characteristics Above • Explicitly Address Complexity
MEASURES TO BE EVALUATED • Dependability • Reliability: R(t), System MTTF • Availability: Steady-state, Transient, Interval • Safety “Does it work, and for how long?'' • Performance • Throughput, Loss Probability, Response Time “Given that it works, how well does it work?''
MEASURES TO BE EVALUATED • Composite Performance and Dependability “How much work will be done(lost) in a given interval including the effects of failure/repair/contention?'' • Need Techniques and Tools That Can Evaluate • Performance, Dependability and Their Combinations
PURPOSE OF EVALUATION • Understanding a System • Observation Operational Environment Controlled Environment • Reasoning A Model is a Convenient Abstraction
PURPOSE OF EVALUATION • Predicting Behavior of a System Need a Model Accuracy Based on Degree of Extrapolation • All Models are Wrong; Some Models are Useful • Prediction is fine as long as it is not about the future
Methods of Quantitative EVALUATION • Measurement-Based Most believable, most expensive Not always possible or cost effective during system design
Methods of Quantitative Evaluation(Continued) • Model-Based Less believable, Less expensive 1. Discrete-Event Simulation vs. Analytic 2. State-Space Methods vs. Non-State-Space Methods 3. Hybrid: Simulation + Analytic (SPNP) 4. State Space + Non-State Space (SHARPE)
Why MODEL? • Provides a framework for gathering, organizing, understanding and evaluating information about a system e.g. Zitel, US&S,HP • A cost-effective means to evaluate a system e.g. Boeing, US&S, HP,IBM, Motorola, Cisco,SUN
Why MODEL? (continued) • Provides a means of evaluating a set of alternatives in a structured and quantitative manner e.g. Zitel, DEC,HP • Sometimes needed due to legal and contractual obligations e.g. FAA • Sometimes needed for business reasons: Motorola, SUN, Cisco
Compare two CLIENT-SERVER Architectures Architecture 2 Architecture 1
Compare Connection Reliabilities • Connection reliability R(t) is the probability that throughout the interval [0,t) at least one path exists from the client to server on which all components are operational. • From R(t), system mean time to failure can be computed:
Compare Connection Availabilities • Connection (instantaneous, transient or point) availability A(t) is the probability that at time t at least one path exists from the client to server on which all components are operational. • A(t)R(t) and limiting or steady-state Availability
MODELING THROUGHOUT SYSTEM LIFECYCLE • System Specification/Design Phase Answer “What-if Questions'' • Compare design alternatives (Zitel,HP,Motorola) • Performance-Dependability Trade-offs (DEC) • Design Optimization (wireless handoff)
MODELING THROUGHOUT SYSTEM LIFECYCLE • Design Verification Phase Use Measurements + Models E.g. Fault/Injection + Reliability Model Union Switch and Signals, Boeing, Draper • Configuration Selection Phase: DEC • System Operational Phase: Lucent • It is fun!
CASE STUDY: ZITEL • Comparison of two different fault-tolerant RAMdisks. • Stochastic Petri Net Package (SPNP) was used to model the two systems for their reliability.
CASE STUDY: ZITEL • Trivedi worked with the designers directly: • Model Validation was done using face validation and sanity checks. • Parameterization was easy due to the experience of the designers. • One difficult research problem originated from the study; Subsequently solved and published in Microelectronics and Reliability journal.
CASE STUDY: VAXCLUSTER • Developed three models of Processor Subsystem: • Two-Level Decomposition (IEEE-TR, Apr 89) Inner Level: 9-state Markov Outer level: n parallel diodes • A Detailed SPN Model (PNPM 89) • A Detailed SPN model for Heterogeneous Cluster (Averesky book)
CASE STUDY: VAXCLUSTER • Storage Subsystem Model: A fixed-point iteration over a set of Markov submodels. (IEEE-TR, to appear) • Observed that availability is maximized with 2 processors (HCSS 90) • Many interesting reliability, availability, performability measures computed
Case Study: HP • Cluster Availability Modeling • Server Availability • Mass Storage Arrays Availability Modeling • Started with Markov chains via SHARPE • Progressed toward Stochastic Petri Nets and Stochastic Reward nets via SPNP
CASE STUDY: LUCENT • A Validated Model of Hardware-Software Availability. • Worked with V. Mendiratta of Naperville. • Model is semi-Markov; solved using SHARPE. • Parameters collected form field data. • Model results validated against actual measurements.
CASE STUDY: LUCENT, IBM, Motorola, SUN • Software Rejuvenation: • A technique to counter software “aging” and increase its availability to clients. • Evaluated optimum rejuvenation interval which maximizes steady state availability (minimizes expected cost) for IBM cluster, Motorola CMTS cluster • Collected data from real systems to show aging and to determine proactive fault management strategies. Worked in our lab, with SUN Microsystems
CASE STUDY: MOTOROLA • Availability & Performability Modeling: • Modeled several configurations of Communication Enterprise Common Platform. • Practical approaches for approximating steady state measures in large, repairable, and highly dependable system: model decomposition, state space truncation, etc. • Both SHARPE and SPNP used.
CASE STUDY: MOTOROLA • Recovery strategies in wireless handoff: • proposed and modeled several strategies • a patent being filed by Motorola • SPNP was used • Hierarchy of two-level models used • Fixed-point iteration was used
CASE STUDY: BELLCORE • Architecture-based software reliability: • proposed a methodology • applied the methodology to SHARPE • used Bellcore’s test coverage tool, ATAC, to parameterize the model • Bellcore is currently enhancing ATAC to incorporate our methodology
CASE STUDY: DRAPER LAB • Overall aim was Verification of system with very high reliability/availability specifications. Prototype under consideration was FTPP cluster 3. • Hybrid approach proposed • Fault injection based measurements. • Statistical analysis of measured data to enable parameterization of analytical models.
CASE STUDY: DRAPER LAB • Reliability modeling of the prototype done: Parameterization done with the aid of existing reliability databases. • Analytical solution provided exact closed form expressions • Markov model solved using SHARPE • Petri net model solved using SPNP • Reliability bottlenecks found
CASE STUDY: AT & T • GSHARPE: • A Preprocessor to SHARPE developed at Bell Labs by a Duke Student. • User can specify Weibull Failure times and lognormal and other repair time distributions. • GSHARPE fits these to phase type distributions and produces a Markov model that is generated for processing by SHARPE
CASE STUDY: BOEING • An Integrated Reliability Environment • A working prototype • Developed a high-level modeling language (SDM) • Designed and implemented an intelligent interpreter
CASE STUDY: BOEING(Continued) • Interpreter determines which solution method is applicable • Five different modeling engines are integrated: • CAFTA, SETS, EHARP, SHARPE and SPNP.
QUANTITATIVE EVALUATION TAXONOMY Closed-form solution Numerical solution using a tool
ANALYTIC MODELING TAXONOMY NON-STATE SPACE MODELING TECHNIQUES Product form queuing models SP reliability block diagrams Non-SP reliability block diagrams
State Space Modeling Taxonomy discrete-time Markov chains Markovian modeling continuous-time Markov chains Markov reward models State space methods Semi-Markov models non-Markovian modeling Markov regenerative models Non-Homogeneous Markov
State-Space Based Models • Transition label: • Probability: (homogeneous) discrete-time Markov chain (DTMC) • Time-independent Rate: homogeneous continuous-time Markov chain • Time-dependent Rate: non-homogeneous continuous-time Markov chain • Distribution function: semi Markov process • Two Dist. Functions: Markov Regenerative Process
IN ORDER TO FULFILL OUR GOALS OF • Modeling Performance, Dependability and Performability • Modeling Complex Systems We Need • Automatic Generation and Solution of Large Markov Reward Models
IN ORDER TO FULFILL OUR GOALS OF • Facility for State Truncation, Hierarchical composition of Non-State-Space and State-Space Models, Fixed-Point Iteration • There are Two Tools that Potentially meet these Goals • Stochastic Petri Net Package (SPNP) • Symbolic Hierarchical Automated Rel. and Perf. Evaluator (SHARPE)
MODELING SOFTWARE PACKAGES • HARP - Hybrid Automated Reliability Predictor (Duke Univ, funded by NASA Langley) • SAVE - System Availability Estimator (Duke Univ. funded by IBM) • SHARPE - Symbolic Hierarchical Automated Reliability and Performance Evaluator; installed at nearly 280 locations (GUI available) • SPNP - Stochastic Petri Net Package installed at nearly 120 locations (iSPN - GUI available) • D_RAMP for Union Switch and Signals by Duke, UVA and CMU • SDM - Boeing Integrated Reliability Modeling Environment (Jointly developed by Duke Univ., Univ. of Wash. and Boeing) • SDDS - Developed by Sohar with the help from K. Trivedi • SREPT - Software Reliability Estimation and Prediction Tool
Challenges in Modeling
COMPLEXITIES OF MODELS • Large State Space • Model construction problem • Model solution problem • Model Stiffness. Fast and slow rates acting together • Failure And Recovery/Repair • Performance and failure
COMPLEXITIES OF MODELS • Modeling Non-Exponential Distributions • Combining performance and reliability • Believability/Understandability/Usability • Incorporation in the design process • Connection between measurements & models: • Parameterization • Validation
LARGENESS TOLERANCE • Automated Model Construction • Stochastic Petri nets (GreatSPN, SPNP, SHARPE, DSPNexpress, ULTRASAN) • High level languages (SAVE, QNAP, ASSIST, SDM) • Fault-Tree + Recovery Info (HARP) • Object-Oriented Approaches (TANGRAM) • Loops in the specification of CTMC (SHARPE)
LARGENESS TOLERANCE • Efficient numerical solution techniques • Sparse Storage • Accurate and Efficient Solution Methods We have Generated and Solved Models with 1,000,000 states (has gone up considerably recently) Steady-State : NEAR-Optimal SOR Transient: Modified Jensen's method