10 likes | 106 Views
Randy Schauer, Anupam Joshi. A Probabilistic Approach to Distributed System Management. Why is the management of large scale distributed systems a problem? New High Performance Computing (HPC) clusters are already running
E N D
Randy Schauer, Anupam Joshi A Probabilistic Approach to Distributed System Management • Why is the management of large scale distributed • systems a problem? • New High Performance Computing (HPC) clusters are already running • over 100 TeraFLOPS (Trillion Floating Point Operations per Second) on • a consistent basis, the PetaFLOP era is near. • Systems are becoming too large for system administrators to manage easily BlueGene/L 596 TFLOPS LLNL Livermore, CA • How can this problem be solved? • The system must be able to manage aspects of its configuration without • using a central image master, relying only on the knowledge of its peers • The system must be able to understand and evaluate its operating • environment to catch issues before they become catastrophic problems LNXI ATC (MJM) 53 TFLOPS ARL MSRC Aberdeen, MD • How can we determine the correct configuration in a distributed system? • Large clusters require the various commodity components to be tied together operationally through software • configurations, resulting in an inability to accurately model all possible configuration parameters • Based on the infinite possible configurations and optimal settings for differing environments, a statistical • relational learning method is the preferred inference mechanism, specifically Markov Logic Networks • Markov Logic Networks provide a first-order predicate knowledge base with a weight applied to each • formula, allowing for an initial set of conditions that capture the rules needed to make informed decisions • File Access Permissions • Comparisons required to ensure proper permissions for both • security and access include majority rule, most restrictive and • time-based differences • A statistical approach to solving this issue takes known factors • into account and weights them as appropriate, allowing us to • minimize uncertainty and determine the most valid option • Processor Heat Analysis • Determine if a processor is overheating by comparing the • temperatures being reported on the neighboring nodes and in • the nodes residing in the same rack location in neighboring racks • Nodes toward the middle and top tend to get hotter than nodes • toward the outside and bottom • So, what have we learned so far? • We understand that the ability to diagnose and recover from performance and configuration issues without • resorting to a centralized knowledge base is the next great stride in allowing systems to self-manage their • reliability and stability • Preliminary results show this is a good approach to using logic for probabilistic model-based diagnosis. • The results are promising, especially for such a radical change in the approach to system management, but • for production deployment, further refinement is necessary in order to obtain statistically significant results. Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon