1 / 49

Research Challenges in Autonomic Computing

Research Challenges in Autonomic Computing. Jeff Kephart IBM Research. kephart@us.ibm.com www.research.ibm.com/autonomic. Outline. Background and Motivation Autonomic Computing Research at IBM Architecture Overview of Research Program Autonomic Computing Research Challenges Conclusions.

Download Presentation

Research Challenges in Autonomic Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research Challenges inAutonomic Computing Jeff Kephart IBM Research • kephart@us.ibm.com • www.research.ibm.com/autonomic

  2. Outline • Background and Motivation • Autonomic Computing Research at IBM • Architecture • Overview of Research Program • Autonomic Computing Research Challenges • Conclusions

  3. Background and Motivation (Kephart) • My role in autonomic computing • My group does research on agents and multi-agent systems • Architecture, Communication, Negotiation, Machine learning • AC Research strategy; joint program manager • University relations; faculty awards, equipment grants • Chair, Autonomic Computing Advisory Board • What I hope to achieve here • Stir up interest in autonomic computing research • Explore collaborations with IBM Research • Learn from you: new viewpoints, new approaches

  4. Complex heterogeneous infrastructures are a reality!

  5. Autonomic Computing: Motivation • Individual system elements increasingly difficult to maintain and operate • 100s of config, tuning parameters for commercial databases, servers, storage • Heterogeneous systems are becoming increasingly connected • Integration becoming ever more difficult • Architects can't intricately plan component interactions • Increasingly dynamic; more frequently with unanticipated components • This places greater burden on system administrators, but • they are already overtaxed • they are already a major source of cost (6:1 for storage) and error • We need self-managing computing systems • Behavior specified by sys admins via high-level policies • System and its components figure out how to carry out policies

  6. Business case: Increased resiliency, responsiveness, efficiency, ROI Reduced down-time, risk, time-to-value, cost Facets of Self-Management

  7. Evolving towards Autonomic Computing Systems Level 1 Level 2 Level 3 Level 4 Level 5 Multiple sources of system generated data Data & actions consolidated through mgt tools Sys monitors correlates & recommends actions Sys monitors correlates & takes action Components dynamically respond to business policies Characteristics Extensive, highly skilled IT staff IT staff approves & initiates actions IT staff focuses on enabling business needs IT staff analyzes & takes actions IT staff manages performance against SLAs Skills Greater system awareness Improved productivity Less need for deep skills Faster/better decision making Human/system interaction IT agility & resiliency Business policy drives IT mgt Business agility and resiliency Basic Requirements Met Benefits Autonomic Manual

  8. Outline • Background and Motivation • Autonomic Computing Research at IBM • Architecture • Overview of Research Program • AI Research Challenges • Conclusions

  9. Analyze Plan Monitor Execute Knowledge S E Autonomic Computing ArchitectureThe Autonomic Element • AEs are the basic atoms of autonomic systems • An AE contains • Exactly oneautonomic manager • Zero or more managed element(s) • AE is responsible for • Managing own behavior in accordance with policies • Interacting with other autonomic elements to provide or consume computational services Autonomic Manager Managed Element An Autonomic Element Service-oriented architecture Software agents E.g. Database, storage, server, software app, workload mgr, sentinel, arbiter, OGSA infrastructure elements An Autonomic Element

  10. Autonomic Computing ArchitectureElement interactions • System self-* properties, behavior arise from interactions among autonomic managers • Interactions are • Dynamic, ephemeral • Formed by (negotiated) agreement • Flexible in pattern; determined by policies • Based on OGSA and specific AC extensions • Required messages • Optional but standard • Application-specific • For advanced interactions: conversation support • “Choreography” defines structure of multi-step interactions A multi-agent system!

  11. Overview of IBM’s Autonomic Computing Research Program • Over 150 researchers working on various aspects of Autonomic Computing • Some projects predate AC initiative; now trying to realign them with AC architecture • Technologies for specific autonomic elements • Database, storage, server, client… • Generic element technologies for autonomic elements • Autonomic Manager Toolset integrates many element-level technologies • Modeling, analysis, forecasting, optimization, planning, feedback control, etc. • Uses Open Grid Services Architecture standards for inter-element communication • Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later • Generic system-level technologies • Dependency management, problem determination and remediation, workload management, provisioning, … • System scenarios and prototypes • Small- to medium-scale autonomic systems • Demonstrate self-* arising from AC architecture + technology • Identify gaps, necessary modifications

  12. Overview of IBM’s Autonomic Computing Research Program • Over 150 researchers working on various aspects of Autonomic Computing • Some projects predate AC initiative; now trying to realign them with AC architecture • Technologies for specific autonomic elements • Database, storage, server, client… • Generic element technologies for autonomic elements • Autonomic Manager Toolset integrates many element-level technologies • Modeling, analysis, forecasting, optimization, planning, feedback control, etc. • Uses Open Grid Services Architecture standards for inter-element communication • Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later • Generic system-level technologies • Dependency management, problem determination and remediation, workload management, provisioning, … • System scenarios and prototypes • Small- to medium-scale autonomic systems • Demonstrate self-* arising from AC architecture + technology • Identify gaps, necessary modifications

  13. Statistics Optimizer Optimizer Adjustments Adjustments Best Plan Best Plan 2. Analyze EstimatedCardinalities Plan Execution Plan Execution Estimated Cardinalities ActualCardinalities Actual Cardinalities LEarning Optimizer for DB2 (LEO)G. Lohman, Almaden SQL Compilation Query 4. Exploit 3. Feedback 1. Monitor

  14. IBM IceCube ServerR. Freitas, Almaden • Lego-like Collection of ‘Intelligent Bricks” • Fail-in-place policy: bad bricks are left in place • 7 x smaller than equivalent standard systems • Fast, power-hungry components (CPU etc) ok • Includes resource allocation software • First Application : Petabyte-class Storage Server • intended to be managed by one person 10 Gbit/s capacitive “Coupler” (6) per brick 6” Full IceCube System blue: Storage Bricks yellow: Compute Bricks 3D mesh @ 10 Gb/s per link No connectors, wires, fibers, lasers or fans = “Brick” “Thermal Bus Array” Prototype Brick: - (12) 2.5” disks - 8-port Switch - Linux on fast CPU

  15. SLEDS (SLA-based management of storage performance)D. Chambliss, Almaden • Storage customers establish SLAs w/ storage system • Storage system throttles optimally in accord w/ SLAs Storage Customers Cust Policy Cust Policy SLA Server SAN Fabric Manager Storage Server

  16. Clean up? New software? Make space? Plan: Choose components, resolve dependencies selections explanations Analysis: characterize inventory SmartCatalog Analysisrules Planningrules Policies PPE inventory collection install/ uninstall Personal software configurationD. Bantz & D. Frank, Watson • Automate SW maint & migration on personal devices • “Upgrade all my applications” • “Make my new laptop work like the old one” • “Migrate most valuable Palm apps to my PC”

  17. Overview of IBM’s Autonomic Computing Research Program • Over 150 researchers working on various aspects of Autonomic Computing • Some projects predate AC initiative; now trying to realign them with AC architecture • Technologies for specific autonomic elements • Database, storage, server, client… • Generic element technologies for autonomic elements • Autonomic Manager Toolset integrates many element-level technologies • Modeling, analysis, forecasting, optimization, planning, feedback control, etc. • Uses Open Grid Services Architecture standards for inter-element communication • Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later • Generic system-level technologies • Dependency management, problem determination and remediation, workload management, provisioning, … • System scenarios and prototypes • Small- to medium-scale autonomic systems • Demonstrate self-* arising from AC architecture + technology • Identify gaps, necessary modifications

  18. Autonomic Manager Analyze Plan Monitor Execute Knowledge Managed Element S S E E An Autonomic Element An Autonomic Element Autonomic Manager ToolkitW. Arnold et al., Watson • Facilitates autonomic mgr construction • In accordance w/ AC architecture • Catcher for generic AM technologies • OGSA messaging • Policy tools • Monitoring technologies • AI tools for knowledge representation, reasoning • Math libraries for modeling, analysis, planning • Feedback control • V1.0 available as part of Emerging Technologies Toolkit v 1.1 on IBM alphaWorks (www.alphaworks.ibm.com) • Considering open source

  19. Policies and Autonomic ComputingD. Verma and D. Kandlur, Watson • Policy: Set of guidelines or directives provided to autonomic element to influence its behavior. • Key Challenge: • Move away from low level controls • Move towards high level directives (policies) over autonomic decisions • Developing scenarios, standards and technologies to support policies for autonomic computing

  20. Router Think Times Servers Mathematical Modeling and OptimizationM. Squillante, Watson • Develop and implement sophisticated mathematical methods and algorithms to support AC systems • Modeling • Statistical Analysis • Stochastic Models • Forecasting • Optimization • Discrete • Stochastic • Nonlinear • Control • Control Theory • Dynamical Systems • Chaos

  21. Admin CPU* Mem* Generic Adaptive ControlJ. Hellerstein, Watson • Feedback control to tune effectors • Based on high-level behavioral specs • Multiple goals • Multiple effectors • Time-varying demand • Various database and server applications CPU* Mem* A e P M Controller + E - CPU Mem KeepAlive MaxClients Web service requests S E Apache Server t

  22. Utility Functions and Autonomic ComputingW. Walsh, Watson • Utility functions can guide autonomic decision making within an element • Self-optimization: natural and flexible way to express optimization criteria based on business objectives • Avoids hard-coded preferences, special-purpose algorithms • Basis for translating business-level objectives into resource allocation objectives • Algorithms based on modeling and optimization V(RT) Response time RT Utility function

  23. Overview of IBM’s Autonomic Computing Research Program • Over 150 researchers working on various aspects of Autonomic Computing • Some projects predate AC initiative; now trying to realign them with AC architecture • Technologies for specific autonomic elements • Database, storage, server, client… • Generic element technologies for autonomic elements • Autonomic Manager Toolset integrates many element-level technologies • Modeling, analysis, forecasting, optimization, planning, feedback control, etc. • Uses Open Grid Services Architecture standards for inter-element communication • Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later • Generic system-level technologies • Dependency management, problem determination and remediation, workload management, provisioning, … • System scenarios and prototypes • Small- to medium-scale autonomic systems • Demonstrate self-* arising from AC architecture + technology • Identify gaps, necessary modifications

  24. Dependency Mgt & Self-Healing G. Kar, Watson and H. Lee & S. Ma, Watson • Determine functional dependencies among elements • Mine design docs, system config metadata, log files • Actively probe running system • Use dependency information for system management • Localize problem (real-time active inference & learning) App Server HAS Dependency Matrix DB Server Web Server HDBS HWS Router Probe Analysis & Control

  25. Overview of IBM’s Autonomic Computing Research Program • Over 150 researchers working on various aspects of Autonomic Computing • Some projects predate AC initiative; now trying to realign them with AC architecture • Technologies for specific autonomic elements • Database, storage, server, client… • Generic element technologies for autonomic elements • Autonomic Manager Toolset integrates many element-level technologies • Modeling, analysis, forecasting, optimization, planning, feedback control, etc. • Uses Open Grid Services Architecture standards for inter-element communication • Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later • Generic system-level technologies • Dependency management, problem determination and remediation, workload management, provisioning, … • System scenarios and prototypes • Small- to medium-scale autonomic systems • Demonstrate self-* arising from AC architecture + technology • Identify gaps, necessary modifications

  26. Human Interaction with Autonomic SystemsP. Maglio, Almaden We start with looking at the proxy server log files, then the web server log files, then the application server admin log files then the application log files. • Basic questions • What do middleware administrators do? • How can we better support the problems and practices they have? • Learn answers to these questions via ethnographic studies • Use insights to develop new ways to interact with complex computing systems We had it wrong. Our assumption of how it worked was incorrect. … but we thought that was the return port!

  27. Enterprise Workload ManagementD. Dillenberger Large, distributed, heterogeneous system • Achieves end-to-end performance via adaptive algorithms • Administrator defines policy • Desired response times for various classes of users, apps • eWLM managers on each resource cooperate to adaptively tune parameters • OS, network, storage, virtual server knobs • JVM heap size, # garbage collection threads • Workload balancing, routing parameters

  28. Resource Arbiter System Manager Application Manager Application Manager Server Registry Policy Repository Database Server Database Router Router Storage Storage Application Environment Application Environment Example scenario: Autonomic Data Center Autonomic Data Center Resource-level utility Service-level utility Client 2-1 Client 1-1 Client 2-2 Client 1-2

  29. Outline • Background and Motivation • Autonomic Computing Research at IBM • Architecture • Overview of Research Program • Scenarios • Autonomic Computing Research Challenges • Systems and Software • Architecture, software engineering & tools, testing/validation • Prototyping a large-scale self-* system • Human-Computer Interaction • Policies, Interfaces • Artificial Intelligence • Learning, Negotiation, Self-healing, Emergent Behavior • Conclusions

  30. Autonomic Manager Analyze Plan Knowledge Monitor Execute Managed Element An Autonomic Element S S E E Define set of fundamental architectural principles from which self-* emerges Challenge: Architecture • AE: How to coordinate multiple threads of activity? • AE’s live in complex environments • Multiple task instances and types • concurrent, asynchronous • Multiple interacting expert modules • AE: How to detect/resolve conflicts arising from • Internal decisions by independent expert modules • External directives (possibly asynchronous) • Internal policies vs. external directives • System-level: Enable more flexible, service-oriented patterns of interaction • As opposed to traditional top-down, hierarchical systems management • Multi-agent architecture • Communication • Representing and reasoning about needs, capabilities, dependencies

  31. Challenge: Software engineering and programming tools • Develop appropriate software engineering concepts and programming tools for composing autonomic elements and systems; support for • Monitoring, analysis, planning and execution • Expressing and understanding policies • Interactions with other elements • Negotiation • Monitoring and enforcing agreements

  32. Challenge: Testing and Verification • Develop methods for testing and verifying behavior of autonomic elements • testbeds and simulation environments • in situ mechanisms that permit new versions of software to run alongside old versions until they have established their trustworthiness

  33. Autonomic Manager Analyze Plan Monitor Execute Knowledge Managed Element S S E E Challenge: Policy Policy: “Set of guidelines or directives provided to autonomic element to influence its behavior” • Human interface • Authoring and understanding policies • Avoiding or ameliorating specification errors • Developing a universal representation and grammar • Many different application domains, disciplines • Many different flavors of policy • Covers service agreements too? • Algorithms that operate upon policies (and agreements?) • Automated derivation of actions (e.g. planning, optimization) • Automated derivation of lower-level policies from high-level policies • E.g. “Maximize profit from this set of service contracts” • Conflict resolution • Both design time and run time • Need to establish protocols, interfaces, algorithms

  34. Possible State s1 a1 Possible State s2 Current State S a2 Possible State s3 a3 Decision-theoretic Planning [More levels of code hierarchy] Workflows Adapters, Translaters Element utility functions System utility functions Optimization Machine code ElementGoals Programming Modeling, Optimization Rules Actions Generative Planning Higher-level specifications Three flavors of (policy = “decision-making guide”) • Action rule • If (S) then do a2 • Results implicitly in desired state s2 • Goal • Achieve a most desired state s2 • Compute a2 most likely to result in s2 • Assumes that most desired state can be determined a priori • Utility function • Achieve state s with maximal net value V(s) – C(aSds) • Benefit and burden of being explicit about value • States have intrinsic value; value of policy is a derived quantity

  35. Policies: Theory meets Reality • We can’t specify the full state of the world • Policy conflicts can arise from incomplete descriptions of state • E.g. different action-rule antecedents can apply to same state, but have conflicting consequents • Goal-type policies can conflict too (sets of acceptable and feasible states don’t intersect) • It’s hard to elicit a full specification of desired behavior from people • Preference elicitation is difficult when there are many attributes • But people are good at noticing when the system isn’t behaving as they like • “Complaint-based tuning” (Ganger, CMU) • Can a universal representation and calculus handle such a broad range? • Storage, network, database, server, etc. • Temporal conditions; correlations • Access control • Classification

  36. Challenge: Human-System Interface • Develop new languages, metaphors and translation technologies that enable humans to monitor, visualize, and control AC systems • Specify goals and objectives to AC systems, and visualize their potential effect • Techniques must be • Sufficiently expressive of preferences regarding cost vs. performance, security, risk and reliability • Sufficiently structured and/or naturally suited to human psychology and cognition to keep specification errors to an absolute minimum • Robust to specification errors

  37. Challenge: Learning Establish theoretical foundation for understanding and performing learning and optimization in multi-agent systems. • Single element level • AE needs to learn a model of itself and environment quickly; environment is noisy, and dynamic in both state and structure • On-line, so exploration of the space can be costly and/or harmful • May be several hundreds of tunable parameters! • Maybe only a few dozen are relevant, but which ones? • Some of them can only be changed upon reboot – is it worthwhile? • System level • Multi-agent system: several interacting learners • What are good learning algorithms for cooperative, competitive systems? • What are conditions for stability? • What is sensitivity to perturbations? • Opportunities for layered learning

  38. Challenge: Negotiation • Develop and analyze • Methods for expressing or computing preferences • Negotiation protocols • Negotiation algorithms • Establish theoretical foundation for negotiation • Explore conditions under which to apply • Bilateral • Multi-lateral (mediated, or not) • Supply-chain • Study how system behavior depends on mixture of negotiation algorithms in AE population

  39. Network Router Problem Determ. DB Probe Station Dependency Agent Database Web Server Challenge: Self-Healing Systems • Develop robust, scalable approaches to monitoring/controlling health, security and performance of autonomic systems • Automated capture of human expert knowledge about problem diagnosis and recovery • Predictive, adaptive diagnosis/recovery • Data mining to learn correlated event patterns for diagnosis • Automated learning and execution of appropriate recovery plan • Construction and learning of adaptive statistical models of large networked systems • And do it all without being too invasive! Remediator Inference & Learning Engs. Simulator & Action Mgr GUI Real-time Event Mgr Probe Driver Diagnos. State Dep. Info, Config Problem Diagnosis/Localization Mgr

  40. Challenge: Control and Harness Emergent Behavior • Understand, control, and exploit emergent behavior in autonomic systems • How do self-*, stability, etc. depend on • Behaviors and goals of the autonomic elements • Pattern and type of interactions among AEs • External influences and demands on system • Invert relationship to attain desired global behavior • How? • Are there fundamental limits? • Develop theory of interacting feedback loops • Hierarchical • Distributed

  41. Outline • Background and Motivation • Autonomic Computing Research at IBM • Architecture • Scenarios • Overview of Research Program • AI Research Challenges • Conclusions

  42. Conclusions • Autonomic Computing is a grand challenge, requiring advances in several fields of science and technology • Policy, planning, learning, knowledge representation, multi-agent systems, negotiation, emergent behavior • Human-system interfaces • Integrating these technologies to support self-management in complex, realistic environments is a research challenge in itself • What are the best architectures and design patterns? Role of (multi-)agent systems? • Building system prototypes is key to developing and validating AC technology and architecture • What to do if you’re interested in working on these problems • Just go do it and publish your results • Find an IBM Researcher who is interested in collaborating with you (I can help) • Get them to help you pursue a faculty award or equipment grant • How can we establish a research community around autonomic computing? • International Conference on Autonomic Computing, May 17-18, 2004, New York City • Co-located with WWW 2004 • Co-chair: Manish Parashar • What about defining challenge problems? • We have developed several realistic industry scenarios that could serve as a basis

  43. Additional Information • A Vision of Autonomic Computing • IEEE Computer, January 2003 • IBM Systems Journal special issue on Autonomic Computing • http://www.research.ibm.com/journal/sj42-1.html • Web site • www.research.ibm.com/autonomic • International Conference on Autonomic Computing • www.autonomic-conference.org • May 17-18, New York City • Submission deadline: January 12, 2003

  44. Backup Slides

  45. Other Autonomic Computing Workshops and Conferences • First Workshop on Algorithms and Architectures for Self-Managing Systems (at FCRC ’03) • June 11, 2003 in San Diego, CA • 5th Annual International Conference on Active Middleware Services: Autonomic Computing Workshop • June 25, 2003 in Seattle, WA • IJCAI-03 AI and Autonomic Computing: Developing a Research Agenda for Self Managing Computer Systems • August 10, 2003 in Acapulco, Mexico • First International Workshop Autonomic Computing Systems at 14th International Conference on Database and Expert Systems Applications (DEXA'2003) • 1-5 September, 2003 in Prague, Czech Republic • 14th IFIP/IEEE International Workshop on Distributed Systems: Operations & Management (DSOM-03) • October 20-22, 2003 in Heidelberg, Germany

  46. Controller Thermostat Thermostat Thermostat Thermostat AC Mech. AC Mech. AC Mech. AC Mech. Challenge: Putting it all together into a self-managing systemAutonomic Thermostat scenario • Locus of high-level policy optimization • Authority over thermostats in domain • Local knowledge of environment • Direct control of cooling mechanism • Varying degrees of sophistication

  47. Value function $100 $80 Value 50 72 76 90 Temperature Cost function $100 Cost $35 $25 50 72 76 90 Temperature Scenario: Autonomic Thermostat Controller Policy: Choose temperature that maximizes U(Temperature) =Value(Temperature)– Cost(Temperature) How much would you pay to get temperature T? How costly is it to attain temperature T?

  48. kWh1(T) ? Value function $100 Controller $80 Value Power Co. Determine T* that maximizes V1(T) – C1(T) 50 72 76 90 Policy Repos. Temperature Thermostat Thermostat C(kWh) kWh1(T) Cost 10 AC Mech. AC Mech. 0 5 10 kWh Thermostat kWH 3.5 kWh(Tcurrent, Textern, T) 2.5 50 72 76 90 Temperature AC Mech. Scenario: Autonomic Thermostat V1(T) – C1(T) ?

  49. Value function Controller $100 $80 Value 50 72 76 90 Temperature Cost function Temp. goal = T’ +/- d’ $100 Cost Priority Policies $35 Man. Control 1. Abide by temp goal from entity with higher authority $25 2. If (cost exceeds X) reset temp goal to affordable value Thermostat 50 72 76 90 Temperature AC Mech. Scenario: Autonomic ThermostatConflict Resolution Temp. goal = T* +/- d* Action Policies 1. If (in cooling mode && Tcurr < T* - d*) then turn AC off 2. If (in cooling mode && Tcurr > T* + d*) then turn AC on

More Related