70 likes | 181 Views
Clusters, Fault Tolerance, and Other Thoughts. Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003. Cluster 2002 http://www.mcs.anl.gov/cluster2002/. 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002 Next 2 meetings are: December 2003 in Hong Kong
E N D
Clusters, Fault Tolerance, and Other Thoughts Daniel S. Katz JPL/Caltech SOS7 Meeting 4 March 2003
Cluster 2002 http://www.mcs.anl.gov/cluster2002/ • 2002 IEEE International Conference on Cluster Computing, Chicago, 23-26 Sep. 2002 • Next 2 meetings are: • December 2003 in Hong Kong • September 2004 in San Diego • Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to both meetings • Motivation: • The series of conferences and their sponsor, the Task Force for Cluster Computing (TFCC), were created to: • Bring the together the cluster community • Establish best practices • Provide educational material • Cross-fertilize ideas between industry and academia
Cluster 2002 Topics • Running a cluster and making it usable • Software for management, including configuration • Middleware software • Building a cluster • Software and hardware for networking • Choosing node hardware • Packaging hardware • Making use of a cluster • New and innovative applications
Cluster 2002 Results and Conclusions • Positives: • Software tools are getting better - management, configuration and administration • Interesting and promising work ongoing in: • Self-tuning software • Component redundancy • Applications • Clusters are enabling platforms due to low entry cost • Negatives: • Large (possibly heterogeneous) systems are not easy to build or maintain • Systems administration is normally underestimated and un(der)funded • Component failure in large systems can be a problem • Other: • Clusters are good for work for which we know they are good • Minimum cost clusters can handle some jobs well • Should design and build cluster to suit application needs
FALSE 2002http://false2002.vanderbilt.edu/ • Workshop on Fault-Adaptive Large-Scale Real-Time Systems • Held at Vanderbilt, 14-15 Nov. 2002 • Sponsored by NSF ITR Project: BTeV Real Time Embedded Systems • Of the 42 attendees at FALSE 2002 and 120 attendees at SOS7, 2 are common to both meetings (Tony Skjellum and I) • Motivation: • High Energy Physics community wants to build systems to monitor experiments • Others (DARPA, NASA) have an interest in similar systems • An occasion to share knowledge and plan future research • Topics: • Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs) • Novel approaches to achieving fault tolerance at low cost (< 10% overhead) • How to make fault responses domain-specific (tools that enable the user to specify the response to different failures, and to implement these responses throughout the system) • Results/Consensus • No results from this initial meeting; just information sharing (w/ complete consensus)
General Thoughts • Fault-Tolerance is becoming important to large-scale systems • Embedded and non-embedded systems • Real-time and non-real-time systems • Is there a common solution (or partial solution) to this issue? • “There is no software problem an additional layer of abstraction won’t solve”
Thanks • Questions?