1 / 21

The State-Space Approach to Self-Management of Enterprise Systems

The State-Space Approach to Self-Management of Enterprise Systems. Vibhore Kumar, Karsten Schwan Subu Iyer*, Yuan Chen*, Akhil Sahai* Georgia Institute of Technology Hewlett-Packard labs*. Outline. Motivation: Enterprise Complexity Issues Solution Overview Policy-Driven Self-Management

maire
Download Presentation

The State-Space Approach to Self-Management of Enterprise Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The State-Space Approach to Self-Management of Enterprise Systems Vibhore Kumar, Karsten Schwan Subu Iyer*, Yuan Chen*, Akhil Sahai* Georgia Institute of Technology Hewlett-Packard labs*

  2. Outline • Motivation: Enterprise Complexity • Issues • Solution Overview • Policy-Driven Self-Management • Dynamic SLA Decomposition • Results • Future Work

  3. Enterprise Complexity: Some Facts • From a survey conducted by Forrester Research • Enterprises now devote 80% of their overall IT budget to maintenance and ongoing operations • More than half of the 347 participating companies used at least 3 database vendors • A major banking-industry client had 18 different travel and expense systems in the organization • “VP of IT Governance” - says tons about the state of enterprise IT infrastructure

  4. The Complexity Wall “If we don’t get a handle on complexity, it will stop the expansion” - Paul Horn, Senior Vice President, IBM Research “Our enterprise customers are working with enormous complexity” - Dick Lampman, Former Director, HP Labs

  5. The Complexity Wall @ • Worldspan, one of our industry collaborators, provides services to the travel industry • One of their airline ticket pricing/availability services is hosted on a farm of 1400 servers • In 2006 alone, they processed around 9.6 billion messages • Highly varying request rates and request type mix • Several behaviors of their system are not well understood • Effects of Ticket Geography • Effects of Cache Refresh Time • Effects of Time of Day …

  6. To Handle The Complexity… • One must enable self-management of complex enterprise infrastructures driven by high-level goals

  7. Enterprise Self-Management: The Hurdles • Enterprise systems are too big • The problem of Scale • It is tough to relate high-level goals to low-level actions • The problem of Complex System Modeling • The operating environment is very dynamic • The problem of Dynamism • Administrators find it hard to trust black-box solutions • The problem of Trust & Tractability

  8. Variables of Interest Vø V, e.g. Response-Time, QoI • Controllable Variables Vα V, e.g. Allocated-Servers, Memory Solution Overview: System State-Space Enterprise System • The aim is to establish a relation between Vø and Vα under current operating conditions Monitored System Variables Monitored Component Variables System State SpaceV = (v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,vn)

  9. Simple Automated Operation • SLO: “Response Time < 10msec” • Event: SLO Violation • Condition: Bandwidth=90Mbps, Request Rate=30 • Action: set Allocated Servers to 3 : VαVø given V – (Vα U Vø) Vα Vø 30 3 1 90 12 8 12 9 Request Rate Response Time Allocated Servers Bandwidth

  10. v1 v2 . . . . . . . . . . . . . vn Solution Overview: The Function • Learn from observed system states • But there are problems • Different behavior in different sub-spaces • Large state space, |V| ≈ 102 to 103 CPU Bottleneck Machine Learning Network Bottleneck Observed System States

  11. v1 v2 . . . . . . . . . . . . . vn Solution Overview: The Function • We decided to model the system using multiple µ-models = { } • We intelligently partition the set of observed system states • partitions exhibit homogenous behavior • partitions have a reduced number of relevant variables • Partitioning & µ-Modeling solve two problems! • The problem of Scale • The problem of Complex System Modeling Reduced Number of Relevant Variables in a µ-model

  12. Solution Overview: µ-Models • We use Tree Augmented Naïve Bayes (TAN) Classifier to build µ-models • The model returns the following probability γ = Pr(Vα | Vdesired) • Find assignment of values to variables in Vα that maximizes the probabilityof moving the system to the desired state

  13. Solution Approach: Dynamism • As the system keeps running more system states are generated, which could be incorporated into the µ-models • µ-models are easier to update as compared to monolithic system models • As a result of µ-model update • Policy Invalidation • Policy Adaptation • New Policies can Result • This addresses the problem of Dynamism

  14. Solution Approach: Tractability & Trust • Each self-management action that assigns values to variables in Vα is associated with a probability γ = Pr(Vα | V – Vø) • An action is taken only when γ > γthreshold • This can be used to fine-tune self-management • TANs can be easily understood by administrators

  15. Outline • Motivation: Enterprise Complexity • Issues • Solution Overview • Policy-Driven Self-Management • Dynamic SLA Decomposition • Results • Future Work

  16. Policy-Driven Self-Management • SLO: “Response Time < 10msec” • Event: SLO Violation • Condition: Bandwidth=90Mbps, Request Rate=30 • Given the goal state (90,30,9), find the µ-model to use • Action: set Allocated Servers to 3 Current State Goal State (90,30,12) (90,30,9) 30 1 3 90 12 8 12 9 Request Rate Response Time Allocated Servers Bandwidth

  17. System-Level SLA SLA1 SLA2 SLA3 SLA4 SLA5 conformance(SLA1, SLA2, …, SLAn) conformance(System SLA) Dynamic SLA Decomposition • Problem: To determine sub-SLAs for components that lead to SLA conformance • Sub-SLAs can be thought of as per-component range of values for controllable variables • If each component adheres to the sub-SLAs then the SLA is not violated • Our techniques can handle SLA decomposition

  18. Experimental Results: SOA Simulator Without Self-Management With Self-Management

  19. Database Perturbation Partition Change Experimental Results: RUBiS over VMs Without Self-Management With Self-Management

  20. Conclusions & Future Work • Our techniques are applicable for a variety of enterprise systems • In our experiments the techniques have proven to be very scalable and accurate • Monitoring overheads can be reduced by taking inputs about relevant variables from the state-space partitions • Design & Implement techniques that can proactively avoid SLA violations

  21. Thank You! References [1] V. Kumar, K. Schwan, S. Iyer, Y. Chen, A. Sahai. The state-space approach to SLA-based management. In submission to NOMS 2008. [2] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. iManage: Policy-Driven Self-Management for Enterprise-Scale Systsem. Middleware 2007. [3] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. Enabling Policy-Driven Self-Management for Enterprise Systems. PBAC 2007 in conjunction with ICAC-2007 [4] V. Kumar, et al. Implementing Diverse Messaging Models with Self-Managing Properties using IFLOW. ICAC 2006

More Related