110 likes | 232 Views
CompSci 296.2 Self-Managing Systems. Shivnath Babu. Today. Wrap up sample projects ROC discussion. Sample Projects. NIMO Fa Combining structured & unstructured data Projects using Nagios Projects using IBM autonomic computing toolkit.
E N D
CompSci 296.2 Self-Managing Systems Shivnath Babu
Today • Wrap up sample projects • ROC discussion
Sample Projects • NIMO • Fa • Combining structured & unstructured data • Projects using Nagios • Projects using IBM autonomic computing toolkit
NIMO: NonInvasive Modeling for Optimization • Build performance models for scientific apps • Automatic, online, and noninvasive • Projects • Study many scientific apps (e.g., 140 bio apps in BioPortal) characterize behavior, good models • “Steal app”, build and refine model • Incorporate NIMO in a “grid” scheduler (Condor, Globus) • Optimization problems in scheduling workflows
Fa • Testbed to study: • Whether we can automate problem prediction, diagnosis • Relationship among problems, causes, data, & models • Projects • Models for predicting performance problems (online) • Models and mechanisms for root-cause queries • Others
Structured and Unstructured Data • Combined querying/mining of structured and unstructured system data • Structured data: time series of CPU utilization • Unstructured data (free text): System error log • Ex: Characterize system state when a specific error occurs
Add New Features to Current Systems • Add problem-prediction capability to Nagios • Add root-cause querying to Nagios • Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework • Remember the “mechanism projects” • Undo, virtualization, active probing
ROC: Recovery-Oriented Computing • Complaints about current systems • Focus only on performance Availability & maintainability is neglected • Focus on MTTF of individual components MTTR neglected • MTTF of system << MTTF of individual components
ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • People/HW/SW failures are facts, not problems • Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs.old focus on longer time between failures
ROC Principles • Recovery experiments: benchmarking recovery • Pinpoint: Automatic problem diagnosis • Recursive restart: Innovative use of reboot • App and system undo • Defense in depth: ROC at hardware level
Discussion • Strong point: Comprehensive, relate to other fields • Margin of safety for systems • Current examples? • How to incorporate? • Negative point: Evolution Vs. revolution? • What approach is the project taking? • At what level should we support Undo? • Transaction, application, system • Pros and cons • Benchmarking availability/recovery (TOC?) • How can you claim that a system is 99.999% available? • Dealing with the automation irony • Fire drills