1 / 11

CompSci 296.2 Self-Managing Systems

CompSci 296.2 Self-Managing Systems. Shivnath Babu. Today. Wrap up sample projects ROC discussion. Sample Projects. NIMO Fa Combining structured & unstructured data Projects using Nagios Projects using IBM autonomic computing toolkit.

thane-snow
Download Presentation

CompSci 296.2 Self-Managing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CompSci 296.2 Self-Managing Systems Shivnath Babu

  2. Today • Wrap up sample projects • ROC discussion

  3. Sample Projects • NIMO • Fa • Combining structured & unstructured data • Projects using Nagios • Projects using IBM autonomic computing toolkit

  4. NIMO: NonInvasive Modeling for Optimization • Build performance models for scientific apps • Automatic, online, and noninvasive • Projects • Study many scientific apps (e.g., 140 bio apps in BioPortal)  characterize behavior, good models • “Steal app”, build and refine model • Incorporate NIMO in a “grid” scheduler (Condor, Globus) • Optimization problems in scheduling workflows

  5. Fa • Testbed to study: • Whether we can automate problem prediction, diagnosis • Relationship among problems, causes, data, & models • Projects • Models for predicting performance problems (online) • Models and mechanisms for root-cause queries • Others

  6. Structured and Unstructured Data • Combined querying/mining of structured and unstructured system data • Structured data: time series of CPU utilization • Unstructured data (free text): System error log • Ex: Characterize system state when a specific error occurs

  7. Add New Features to Current Systems • Add problem-prediction capability to Nagios • Add root-cause querying to Nagios • Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework • Remember the “mechanism projects” • Undo, virtualization, active probing

  8. ROC: Recovery-Oriented Computing • Complaints about current systems • Focus only on performance  Availability & maintainability is neglected • Focus on MTTF of individual components  MTTR neglected • MTTF of system << MTTF of individual components

  9. ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • People/HW/SW failures are facts, not problems • Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs.old focus on longer time between failures

  10. ROC Principles • Recovery experiments: benchmarking recovery • Pinpoint: Automatic problem diagnosis • Recursive restart: Innovative use of reboot • App and system undo • Defense in depth: ROC at hardware level

  11. Discussion • Strong point: Comprehensive, relate to other fields • Margin of safety for systems • Current examples? • How to incorporate? • Negative point: Evolution Vs. revolution? • What approach is the project taking? • At what level should we support Undo? • Transaction, application, system • Pros and cons • Benchmarking availability/recovery (TOC?) • How can you claim that a system is 99.999% available? • Dealing with the automation irony • Fire drills

More Related