210 likes | 360 Views
The State-Space Approach to Self-Management of Enterprise Systems. Vibhore Kumar, Karsten Schwan Subu Iyer*, Yuan Chen*, Akhil Sahai* Georgia Institute of Technology Hewlett-Packard labs*. Outline. Motivation: Enterprise Complexity Issues Solution Overview Policy-Driven Self-Management
E N D
The State-Space Approach to Self-Management of Enterprise Systems Vibhore Kumar, Karsten Schwan Subu Iyer*, Yuan Chen*, Akhil Sahai* Georgia Institute of Technology Hewlett-Packard labs*
Outline • Motivation: Enterprise Complexity • Issues • Solution Overview • Policy-Driven Self-Management • Dynamic SLA Decomposition • Results • Future Work
Enterprise Complexity: Some Facts • From a survey conducted by Forrester Research • Enterprises now devote 80% of their overall IT budget to maintenance and ongoing operations • More than half of the 347 participating companies used at least 3 database vendors • A major banking-industry client had 18 different travel and expense systems in the organization • “VP of IT Governance” - says tons about the state of enterprise IT infrastructure
The Complexity Wall “If we don’t get a handle on complexity, it will stop the expansion” - Paul Horn, Senior Vice President, IBM Research “Our enterprise customers are working with enormous complexity” - Dick Lampman, Former Director, HP Labs
The Complexity Wall @ • Worldspan, one of our industry collaborators, provides services to the travel industry • One of their airline ticket pricing/availability services is hosted on a farm of 1400 servers • In 2006 alone, they processed around 9.6 billion messages • Highly varying request rates and request type mix • Several behaviors of their system are not well understood • Effects of Ticket Geography • Effects of Cache Refresh Time • Effects of Time of Day …
To Handle The Complexity… • One must enable self-management of complex enterprise infrastructures driven by high-level goals
Enterprise Self-Management: The Hurdles • Enterprise systems are too big • The problem of Scale • It is tough to relate high-level goals to low-level actions • The problem of Complex System Modeling • The operating environment is very dynamic • The problem of Dynamism • Administrators find it hard to trust black-box solutions • The problem of Trust & Tractability
Variables of Interest Vø V, e.g. Response-Time, QoI • Controllable Variables Vα V, e.g. Allocated-Servers, Memory Solution Overview: System State-Space Enterprise System • The aim is to establish a relation between Vø and Vα under current operating conditions Monitored System Variables Monitored Component Variables System State SpaceV = (v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,vn)
Simple Automated Operation • SLO: “Response Time < 10msec” • Event: SLO Violation • Condition: Bandwidth=90Mbps, Request Rate=30 • Action: set Allocated Servers to 3 : VαVø given V – (Vα U Vø) Vα Vø 30 3 1 90 12 8 12 9 Request Rate Response Time Allocated Servers Bandwidth
v1 v2 . . . . . . . . . . . . . vn Solution Overview: The Function • Learn from observed system states • But there are problems • Different behavior in different sub-spaces • Large state space, |V| ≈ 102 to 103 CPU Bottleneck Machine Learning Network Bottleneck Observed System States
v1 v2 . . . . . . . . . . . . . vn Solution Overview: The Function • We decided to model the system using multiple µ-models = { } • We intelligently partition the set of observed system states • partitions exhibit homogenous behavior • partitions have a reduced number of relevant variables • Partitioning & µ-Modeling solve two problems! • The problem of Scale • The problem of Complex System Modeling Reduced Number of Relevant Variables in a µ-model
Solution Overview: µ-Models • We use Tree Augmented Naïve Bayes (TAN) Classifier to build µ-models • The model returns the following probability γ = Pr(Vα | Vdesired) • Find assignment of values to variables in Vα that maximizes the probabilityof moving the system to the desired state
Solution Approach: Dynamism • As the system keeps running more system states are generated, which could be incorporated into the µ-models • µ-models are easier to update as compared to monolithic system models • As a result of µ-model update • Policy Invalidation • Policy Adaptation • New Policies can Result • This addresses the problem of Dynamism
Solution Approach: Tractability & Trust • Each self-management action that assigns values to variables in Vα is associated with a probability γ = Pr(Vα | V – Vø) • An action is taken only when γ > γthreshold • This can be used to fine-tune self-management • TANs can be easily understood by administrators
Outline • Motivation: Enterprise Complexity • Issues • Solution Overview • Policy-Driven Self-Management • Dynamic SLA Decomposition • Results • Future Work
Policy-Driven Self-Management • SLO: “Response Time < 10msec” • Event: SLO Violation • Condition: Bandwidth=90Mbps, Request Rate=30 • Given the goal state (90,30,9), find the µ-model to use • Action: set Allocated Servers to 3 Current State Goal State (90,30,12) (90,30,9) 30 1 3 90 12 8 12 9 Request Rate Response Time Allocated Servers Bandwidth
System-Level SLA SLA1 SLA2 SLA3 SLA4 SLA5 conformance(SLA1, SLA2, …, SLAn) conformance(System SLA) Dynamic SLA Decomposition • Problem: To determine sub-SLAs for components that lead to SLA conformance • Sub-SLAs can be thought of as per-component range of values for controllable variables • If each component adheres to the sub-SLAs then the SLA is not violated • Our techniques can handle SLA decomposition
Experimental Results: SOA Simulator Without Self-Management With Self-Management
Database Perturbation Partition Change Experimental Results: RUBiS over VMs Without Self-Management With Self-Management
Conclusions & Future Work • Our techniques are applicable for a variety of enterprise systems • In our experiments the techniques have proven to be very scalable and accurate • Monitoring overheads can be reduced by taking inputs about relevant variables from the state-space partitions • Design & Implement techniques that can proactively avoid SLA violations
Thank You! References [1] V. Kumar, K. Schwan, S. Iyer, Y. Chen, A. Sahai. The state-space approach to SLA-based management. In submission to NOMS 2008. [2] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. iManage: Policy-Driven Self-Management for Enterprise-Scale Systsem. Middleware 2007. [3] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. Enabling Policy-Driven Self-Management for Enterprise Systems. PBAC 2007 in conjunction with ICAC-2007 [4] V. Kumar, et al. Implementing Diverse Messaging Models with Self-Managing Properties using IFLOW. ICAC 2006