310 likes | 450 Views
Advanced Data Processing Methods for Dependability Benchmarking and Log Analysis BEHIND THE SCENE: A COLLECTION OF OBSERVATIONS NOT DESCRIBED IN OUR PAPERS. András Pataricza pataric@mit.bme.hu. Trends in IT . Evolution Environment Specification Technology Adaptivity Drivers :
E N D
Advanced Data Processing Methods for Dependability Benchmarking and Log AnalysisBEHIND THE SCENE: A COLLECTION OF OBSERVATIONS NOT DESCRIBED IN OUR PAPERS AndrásPataricza pataric@mit.bme.hu
TrendsinIT • Evolution • Environment • Specification • Technology • Adaptivity • Drivers: • Run-timetaskallocation (virtualization) • Extendstocyber-physicalsystems • Functional: adaptivetaskecosystem • Context-sensitive, on-demand • Optimization of resourceuse • Computational • Communication • Energy • Traditionalpre-operationphase design movestorun-time • Off-lineassessment -> run-timecontrolconfigurationparametrization+ run-timeassessment • Assessmentcriteria • Generalizable • Reusable • Parametrizable • Coversall main aspectsneededforoperationdecisions and control • Informationsources • In vitro: benchmarking • In vivo: • fieldmesurement • Log analysis • Reusabilityconstraints • Differentlevels of detail • Anytimealgorithms • Incrementalalgorithms • Complexity • Go continous?
Typicalusecase: control of infrastructure Source: AMBER teachingmaterial
Self* Computing • Controlled computing • Autonomic • Virtualization • Cloud • Self-* properties • Emphasizes control loop • Relation to • control theory • signal processing Obstacle: wedealwithnetworks of (practically) blackboxes!
System Management as a Control Problem Control theory applied to IT Infrastructures Collect and store data about the state of the infrastucture Controlled Plant Monitoring Sensors Service Control Objective provides Controller Decision Making Software Component Control Policy Actuator Provisioning Based on human expertise or automation deployed on Effectuate changes in the infrastucture Supervised Node Monitoring / Control Node
Performability Management QoS requirementobjective metric (e.g. response time < 3 sec) metric Service metric - + metric reference provides Decision Making Set a reference value (e.g: 2.5 sec) „have some margin but do not overperform” Software Component Provisioning deployed on Reconfigure the service provider
A Simple Performance Management Pattern • A very common pattern • Simplicity • Platform support • Control/rule design? • that is practical Load Service Response Spare pool or other service Load-balanced cluster
A Simple Performance Management Pattern Forthe IT system management expert Service Sparepoolorother service Load-balancedcluster
A Simple Performance Management Pattern Forthecontrolexpert Service Sparepoolorother service Load-balancedcluster
Objective: Proactive Qualitative Performance Control Predictstate Decideonaction Service Sparepoolorother service Load-balancedcluster
Empiricaldependabilitycharacterization Operation: RTdata acquisition and monitoring Design time: modeling, analysis, testing Op. decisions Design decisions Validation Validation Service (human) Service (J2EE) Services Service (Web) Service (e-mail) IMS transaction
Empiricaldependabilitycharacterization Design time: modeling, analysis, testing • Challenges: • Incompleteness • Environmentsensitivity • Changetolerance • CORE ISSUE: • frominstanceassessmenttoprediction • Assessmentcriteria • Generalizable • Reusable • Parametrizable • KNOWLEDGE EXTRACTION
Empiricaldependabilitycharacterization Operation: RTdata acquisition and monitoring ? • Somechallenges: • Tresholdconfiguration • Embeddingdiagnosis • Embeddingforecasting • Over/undermonitoring
Examples – pilot components Apache ~ Loadbalancer – UA (task) Tomcat (applicationspecific platform independent + implementationdep.) Linux OS Agent (platform + task) MySQL ~ VI Agent – add-on (platform + task)
Faultstakeninto account - sidenote SRDS WS 2008. 5. 10. • Source: qualitativedynamic modelling of Apache Separatework: representativeness HOW TO GENERALIZE MEASUREMENTS?
TPC-W Workload • A standard benchmark for multi-tier systems • Models an e-bookshop • Customer behavioral models: • 14 different web pages • Varying load on the system • 3 standard workload mix • Highly non-deterministic • ABSOLUTELY INAPPROPRIATE AS A PLATFORM BENCHMARK Representativeness Synthetic/naturalbenchmarks
The Problem of Over-Instrumentation • Overly complex rule set/model • V&V? • Maintenance? • Control design? • A few of variables significant w.r.t. a management goal • „control theory for IT” works do not tackle this provides metric metric metric Service Service Service metric metric metric metric metric metric Software Component Software Component Software Component Software Component Software Component metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric deployed on metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric metric WHAT TO MEASURE: Measurable≠ To be measured VariableSelectionProblem
Design phase - measurement • Objectives: • Design time: all candidate control variables • Runtime: few (selection) • Stress the system (scalability) to reveal operation domains and dynamics EDCC-5: PintérG., Madeira H., Vieira M., Pataricza A. and Majzik I."A Data Mining Approach to Identify Key Factors in Dependability Experiments"
Component Metrics Gathered Database + IBM Tivoli Monitoring Agent • Phenomenological service metrics: • Average response time • Failed SQL statements % • Number of active sessions • … • „Causal” metrics: • DB2 status • Buffer pool hit ratio • Average pool read/write time • Average locks held • Rows read/write rate • … No. of database metrics: MySQL: 12, Oracle: 640, DB2: 880. • Phenomenological resource metrics: • Average CPU usage • Average disk I/O usage • …
Qualitative State Definition for Prediction • Coarse control: intuitively, interval aggregate defines state • High freq. „jitter”: noise; lower level means • Aggregation interval: match prediction horizon! • Alternatively: explicitly filter out „noise” • The same intent • Presented: • Amplitude filter • Medianfiltering 0% - 25% - 45% - 100% THROUGHPUT DATA FILTERING CLASSIFICATION
Design phase – variable selection • Objective: • Control variables • As few as possible, as much as needed • mRMR (minimum Redundancy Maximum Relevance ) feature selection • Cf. AUTONOMICS 2009 paper
Variable Selection 160+ metrics METRICS: FULL DATASET Simplestatistics is insufficent – signalprocessing FILTERING VARIABLE SELECTION VARIABLES GOAL METRIC DATASET FILTERING 12 metricschosen Algorithm: mRMR
Example Selected Metrics: Median Filtering • First 7 of 12 („value” decreases) • Tomcatload/CPU alwaysin top 3 – bottleneck • Clustercharacterization: ongoingwork
Operationphase – measurement Decide on the system state based on the samples
1 Minute Prediction for Median Filtering • Qualitativepredictionaccuracy: >90% • (multipleruns; 4 hourvalidationset)
Operational domains? • Normal operational state • Internal relationships tend to be linear (with some „noise”) • Saturation (over-loaded) • Objective metrics behave linear again • Physical limits of the system • Degrading state • The point of interest! • Seemingly non-linear behaviour • mRMR metric selection better • For the specific case
Minimum MeanSquareError Shouldhavebeenmonotonicdecreasing
Concluding remarks • Assessmentforpredictive systemmanagement needs SIGNAL PROCESSING(at the moment more than control theory) • Shannon’s law is in there ? • Asynchronoussamplingproblem • Our experiment: design flaws • TPC-W: closed loop • Result: coupling of workload and transfer characteristics • Too strong autocorrelation of client behavior • Methodology still valid • Introducing dependability: „easy”