MRNet: From Scalable Performance to Scalable Reliability

MRNet:From Scalable Performance to Scalable Reliability Dorian C. Arnold University of Wisconsin-Madison Paradyn/Condor Week April 14-16, 2004 Madison, WI

More HPC Facts • Statistics from Top500 List: • 24%: number of processors ≥ 512 • 10%: number of processors ≥ 1024 • 9 systems: number of processors ≥ 4096 • Largest system has 8192 processors • By 2009, 500th entry faster than today’s #1 • Bottom Line: HPC systems with many thousands of nodes will soon be the standard. Scalability and Reliability

Applications Must Address Scalability! • Challenge 1: Scalable Performance • Provide distributed tools with a mechanism for scalable, efficient group communicationsand data analyses. • Scalable Multicast • Scalable Reductions • In-network data aggregations Scalability and Reliability

Applications Must Address Scalability! • Scalability necessitates reliability! • Challenge 2: Scalable Reliability • Provide mechanisms for reliability in our large-scale environment that do not degrade scalability. • Scalable multicast • Scalable reductions • In-network data aggregations Scalability and Reliability

Target Applications • Distributed tools and debuggers • Paradyn, Tau, PAPI’s perfometer, … • Grid and Distributed Middleware • Condor, Globus • Cluster and system monitoring applications • Distributed shell for command-line tools Goal: Provide a generic scaling mechanismfor monitoring, control, troubleshooting and general middleware components for Grid infrastructures. Scalability and Reliability

Challenge 1: Scalable Performance Tool Front End • Problem: Centralization leads to poor scalability • Communication overhead does not scale. • Data Analyses restricted to front-end. BE0 BE1 BE2 BE3 BEn-4 BEn-3 BEn-2 BEn-1 a0 a1 a2 a3 an-4 an-3 an-2 an-1 Scalability and Reliability

MRNet: Solution to Scalable Tool Performance Tool Front End • Multicast/Reduction Network • Scalable data multicast and reduction operations. • In-network data aggregations. … … … BE0 BE1 BE2 BE3 BEn-4 BEn-3 BEn-2 BEn-1 … a0 a1 a2 a3 an-4 an-3 an-2 an-1 Scalability and Reliability

Paradyn/MRNet Integration • Scalable start-up • Broadcast metric data to daemons • Gather daemon data at front-end • Front-end/daemon clock skew detection • Performance data aggregation • Time-based synchronization Scalability and Reliability

Paradyn Data Aggregation (32 metrics) Scalability and Reliability

MRNet References • Technical papers: • Roth, Arnold, and Miller, “MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools”, in SC2003 (Phoenix, AZ, November 2003). • Roth, Arnold and Miller, “Benchmarking the MRNet Distributed Tool Infrastructure: Lessons Learned”, in 2004 High-Performance Grid Computing Workshop held in conjunction with IPDPS 2004 (Santa Fe, New Mexico, April 2004). Scalability and Reliability

Scalable Performance Achieved:What Next? • More and increasingly complex components in large scale systems. A system with 10,000 nodes is 104 timesmore likely to fail than one with 100 nodes. Scalability and Reliability

Challenge 2: Scalable Reliability • Goals: • Design scalable reliability mechanisms for communication infrastructures with reduction operations and in-network data aggregations. • Quantitative understanding of scalability trade-off between different levels of resiliency and reliability. Scalability and Reliability

Challenge 2: Scalable Reliability • Reliability vs. Resiliency: • A reliable system executes correctly in the presence of (tolerated) failures. • A resilient system recovers to a mode in which it can once again execute correctly. • During a failure, errors are visible at the system interface level. Scalability and Reliability

Challenge 2: Scalable Reliability • Problem: • Scalability → decentralization, low-overhead • Scalability wants simple systems. • Reliability → consensus, convergence, high-overhead • Reliability wants complex systems. • How can we leverage our tree-based topology to achieve scalable reliability? Scalability and Reliability

Recovery Models and Semantics • Fault model: crash-stop failures • TCP-like reliability for tree-based multicast and reduction operations • System should tolerate any and all internal node failures • System slowly degrades to flat topology • Models based on operational complexity • E.g. Are in-network filters stateful? Scalability and Reliability

Recovery Models and Semantics: Challenges • Detecting loss , duplication and ordering • Quick recovery from message loss • Correct recovery from failure • Recovery of state information from aggregation operations • Simultaneous failures • Validation of our scalability methodology Scalability and Reliability

Challenge 2: Scalable Reliability Hypothesis: Aggregating control messagescan effectively achieve scalable, reliable systems. Scalability and Reliability

Example: Scalable Failure Detection • Goal: A scalable failure-detection service with high rates of convergence. • Previous work: • non-scalable overhead • poor convergence properties • non-deterministic guarantees • costly assumptions • E.g. fully-connected meshes Scalability and Reliability

Failure Detection Approaches • Gossip-style failure detection and propagation • Gupta et al, van Renesse et al Scalability and Reliability

Failure Detection Approaches • Hierarchical heartbeat detection and propagation • Felber et al, Overcast, Grid monitoring Scalability and Reliability

Scalable Failure Detection • Tracking senders in aggregated message: • Naïve approaches: • Append 32/64-bit source ID for each source • Pathological case: many senders • Bit-array where bits represent potential sources • Pathological case: many potential sources, few actual senders • Our Approach: • Variable size bit-array: • Number of bits vary according to descendants beneath the intermediate node (i.e. depth in topology) Scalability and Reliability

1 1 0 1 0 0 1 1 1 0 1 1 1 0 1 0 0 1 Scalable Failure Detection Hierarchical heartbeats/propagation (with message aggregation): Scalability and Reliability

Scalable Failure Detection • Study scalability and convergence implications of our scalable failure detection protocol. • In theory: • Pure Hierarchical • msgs = nh x h • Hierarchical w/aggregation • msgs = ( (nh+1 – 1)/(n – 1) ) – 1 • Example n=8, h=4 (4096 leaves): • Pure hierarchical: 16,384 msgs • With aggregation: 4,680 msgs Scalability and Reliability

Scalable Event Propagation • Implement generic event propagation service • Encode events into 1-byte codes • Combine with aggregation protocol for low-overhead control messages • Piggyback control messages with data messages Scalability and Reliability

Summary • MRNet provides tools and grid services with scalable communications and data analyses. • We are studying techniques to provide high degrees of reliability at large scales. • MRNet website: • http://www.paradyn.org/mrnet darnold@cs.wisc.edu Scalability and Reliability

MRNet: From Scalable Performance to Scalable Reliability