1 / 15

Extensible Scalable Monitoring for Clusters of Computers

This research paper discusses the challenges of monitoring a cluster of cooperating computers and presents solutions for extensible and scalable monitoring. It covers topics such as handling evolving software, detecting and recovering from failures, scaling data management, and scaling visualization. The implementation details of the monitoring system are also described, including the use of relational tables, timestamps for weak synchronization, hierarchical data access protocols, and statistical aggregation techniques.

thobson
Download Presentation

Extensible Scalable Monitoring for Clusters of Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat

  2. Overall Problem • Monitoring a cluster of cooperating computers • Different from client-server where only server’s matter • Requires substantial information from all machines • 100’s-1000’s of nodes • Client-server becomes subset of this problem

  3. Problems & Solutions • Cluster software and hardware is constantly evolving • Monitoring software must be extensible and flexible • Use relational tables • Failures will occur in the cluster • Monitoring software must detect and recover from failures • Use timestamps for weak synchronization • Scalability needed to hundreds of nodes • Need to efficiently transfer data from sources to sinks • Use hierarchy & hybrid push-pull protocol • Need to display statistics and information from all nodes • Use statistical aggregation + color,shade to minimize info. loss

  4. Overview • Details of solutions • Handling evolving software • Detecting and recovering from failures • Scaling data management • Scaling visualization • Implementation • Architecture • Programs • Snapshot • Experience • Conclusion & Future Work

  5. Problem: Clusters Evolve • Solution: Relational tables • Increases flexibility by decoupling data users from data providers • Increases extensibility by structuring data into independent tables • Increases extensibility by allowing additional columns in tables without breaking old programs • Retains performance through transparent use of indicies • Improvement over tree structures in previous systems

  6. Problem: Failures Occur • Solution: Use timestamps • Loss of periodic updates to timestamps allow remote nodes to detect failures • Timestamps allow weak synchronization between databases • Better availability during failures, simpler recovery • Timestamps allow stale data to be eliminated • Only requires purges run every so often rather than relying on programs to clean up after themselves • Reasons 2 & 3 are useful even in normal operation

  7. Problem: Scalable Data Access • Solution: Hierarchy + efficient protocol • Hierarchy allows • Batching of data from different nodes (all data from routers) • Specialization to particular data (all data on processes) • Efficient protocol (Hybrid of push/pull) • Sink sends (SQL select command, interval, count ) to source • Changed data is extracted via SQL every interval seconds and forwarded to the sink count times • Sink can cancel requests at any time • Achieves the best of pull and push protocols in terms of wasted data transfers, freshness, and network bandwidth

  8. Problem: Scalable Visualization • Solution: Statistical aggregation + use of shade & color to minimize information loss • Aggregate across similar variables (average load of 10 machines); show dispersion (std. dev.) as shade • Aggregate across variables from one node (utilization = max{disk,network,cpu}) • Both forms of aggregation at the same time — hierarchical aggregation • Use color to draw attention to special things (nodes down) to limit visual overload

  9. forwarder forwarder forwarder Java applet top-level DB javaserver node-level DB node-level DB node-level DB joinpush Java applet forwarder forwarder mid-level DB mid-level DB joinpush joinpush gather gather gather gather forwarder forwarder node-level DB Implementation Architecture

  10. Implementation Details • Databases are MiniSQL • Freely available with source code • Implements subset of SQL • Forwarder implements source part of hybrid protocol • Using polling to get data from database • Joinpush implements merging part of hierarchy • Control of merge sources external to the program • Both forwarder & joinpush implemented in threaded C • Simpler implementation for blocking operations • Could be merged in with the database

  11. Implementation Details, cont. • Gather implemented in perl • Simpler to add new data sources, but would like threading • Somewhat inefficient, might re-implement in C • Javaserver implemented in perl • Easier to extend with additional aggregation forms • Application level proxy because Java can’t access network • Javaclient implemented in Java • Allows clients to run in browser anywhere in the world • Weak feedback to javaserver to control information displayed

  12. Implementation Snapshot

  13. Experience • Configuration information should be in database • Had them in random files; database collects it together • Reset-world operation very important • Puts system in known state • Useful for default destination of statistics of remote database • Minimizes load on monitored nodes • Potentially reduces fault tolerance • Browser user interface very useful • Limitations of Java very obnoxious

  14. Conclusion • Four problems & solutions important for any cluster monitoring system • Evolution inherent in uses of clusters • Independent failures occur in all clusters • Scalability of data management needed for large clusters • Scalability of visualization also needed for large clusters • Implementation works, and initially useful, further deployment needed • Experience identified problems, places for improvements.

  15. Future Work • Automatic identification of statistics relevant to problems • Expect to be able to use Boolean disjunction learning algorithms • Tracking of long term trends and statistical measures • Self tuning of specialized databases based on usage • Addition of notification, repair components • Gathering of more statistics (via SNMP for example) • Distribution of system to external sites

More Related