320 likes | 434 Views
Challenges and Successes in MRNet. Matthew LeGendre & Madhavi Krishnan. MRNet Refresher. Packet Filter. CP. Tree of Communication Processes. CP. CP. CP. CP. CP. CP. CP. CP. FE. FE. Front-end. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE. BE.
E N D
Challenges and Successes in MRNet Matthew LeGendre & Madhavi Krishnan
MRNet Refresher PacketFilter CP Tree ofCommunicationProcesses CP CP CP CP CP CP CP CP FE FE Front-end BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE Back-ends
MRNet Goals • Communications Network for Tools • Scalable - 212,992 nodes at LLNL’s BG/L • Multi-platform - Linux, BlueGene, Cray XT, AIX, Solaris, Windows • Reliable – Automatic fault recovery • Flexible – Programmable filters, customizable topology • Open Source
Challenges in MRNet • System Constraints • IO Node/Compute nodes on BlueGene • Shared library availability • Light-weight kernels • Scalability • Need “Whole System” scalability • Building a general tool • Paradyn is like an OEM • Some users need lightweight MRNet BE
MRNet on BlueGene • User launches FE LaunchMON FE Control Node • LaunchMONlaunces BEs via control node. Front End Nodes CP CP CP • MRNet launces CP processes. IO Nodes BE BE BE BE … Compute Nodes … … 256 …
MRNet on Cray XT • User launches FE Front End Nodes FE • ALPs launches BEs • ALPs launches CP processes. CP CP CP • MRNet initializes network BE BE BE Compute Nodes
BlueGene • BE runs on IO nodes • 256 cores per tool backend • Cray XT • BE runs on compute node • 12 cores per tool backend • MRNet on Cray XT has more BEs for same size job
Scalable Topology Propagation • Topology may change during execution • Need to broadcast topology information • Needed for: • Startup in “Back-end Connect” mode. • Reliability System • Can lead to topology update storms FE FE T CP T T CP CP T T CP CP ! CP CP CP CP BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE
Scalable Topology Propagation • Use timeout filters to propagate updates • Collect all updates from a time slice before propagating • Individual node delays are small • Reduces network traffic if many updates. FE FE CP CP CP CP CP CP CP CP CP BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE BE
MRNet Backend • Traditional MRNet BE can’t run on BlueGene compute node • Multi-threading not supported • No multi-processing for dedicated tool processes • Don’t always want other MRNet BE side effects • C++ and threading introduce library overheads • Hard to embed in application processes
Lightweight MRNet • Lightweight back-end as part of application • C library • Single-threaded • No filtering at back-end • Traditional MRNet as back-end part of tool • C++ library • Multi-threaded • Dedicated thread receives data • Can run
Some MRNet Success Stories • Stack Trace Analysis Tool (STAT) • Cray Application Termination Processing (ATP) • TAU over MRNet (ToM) • Open|SpeedShop, Component Based Tool Framework (CBTF) • Krell Institute • On-line detection of large scale application structure • UPC Barcelona Tech • Paradyn Performance Tool • Group File Operations, FINAL • Totalview using TBON-FS • University of Wisconsin, Madison • …
Stack Trace Analysis Tool (STAT) • Stack trace sampling and analysis for large scale applications • Reduce number of tasks to debug • Discover equivalent process behavior • Useful and powerful debugging tool • Extreme scaling • BG/L - 212,992 tasks • Jaguar - 147,456 tasks • Easy to develop • Built over MRNet, LaunchMon, StackwalkerAPI and SymbtabAPI
STAT MRNet Backend • Collect stack trace from • application • Encode as call prefix tree • MRNet stream send operation • stream->send(callGraph) STAT Frontend FE CP ... ... Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App
STAT MRNet Filter _ STAT Frontend FE void merge_Stacktrace_Filter { /* Receive and process packets* / for each input packet { inPkt = unpack packet; /* Implement filter for merge */ mergedGraph = merge(inPkt); } /* Send output packet */ new Packet pkt(mergedGraph); push_out(pkt); } CP Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App
STAT MRNet Frontend _ • Store final merged graph • stream->recv(mergedGraph) • External visualization tools STAT Frontend FE CP Tree Merge Filter CP CP CP CP STAT Tool Daemon BE BE BE BE App App App App App App
CRAY ATP Tool • Abnormal Termination Processing • One, many or all processes may crash • Reduce number of core files • STAT like analysis to find equivalent process behavior • Request core dump on a subset of processes • Released with Cray debugging Tool 1.0 • Multiple MRNet streams • Crash stream: Notifies crash and requests ATP analysis • Stacktrace stream: Collects stack traces • Control stream: Requests core-dumps
ATP MRNet Crash Stream • Application • Triggers signal handler • Backends • Request ATP analysis • Filters • TFILTER_SUM • SFILTER_DONTWAIT ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App
ATP MRNetStacktrace Stream • Frontend • Sends message to all • backends to collect • stack traces • Backends • Collect stack traces • from the application • Filters • TFILTER_Merge_Stacktrace • SFILTER_WAITFORALL ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App
ATP MRNetStacktrace Stream _ • Frontend • Sends message to all • backends to collect • stack traces • Backends • Collect stack traces • from the application • Filters • TFILTER_Merge_Stacktrace • SFILTER_WAITFORALL ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE App App App App App App
ATP MRNetControl Stream • Frontend • Requests core-dumps • Sends control messages • Shutdown • Disable ATP • Acknowledgements • Backends • Trigger core-dumps • from specific processes ATP Frontend FE CP Filters CP CP CP CP ATP Tool Daemons BE BE BE BE Core-dump to disk App App App App App App
TAU over MRNet (ToM) • Online performance monitoring • Long running applications at scale • Performance data collected and interpreted at runtime • Runtime feedback into measurement subsystem • Optimize measurement • MRNet support was added in about a week • Two types of MRNet streams • Data Stream – Collection and aggregation of data • Control stream – Monitoring, control and feedback
TAU MRNet Data Stream • Backends • Collect performance data • Built-in filters • Sum, average, max, min • User-built filters • Mean, variance, histogram • Clustering • Frontend • Stores and analyses • aggregated data • Change filter parameters • to tune aggregation ToM Frontend FE CP Filters CP CP CP CP ToM Daemons BE BE BE BE App App App App App App
TAU MRNet Control Stream • Frontend broadcasts • control messages to backends • Startup/finalize messages • Selection of events • Sample interval • Measurement options • Instrumentation options ToM Frontend FE CP Filters CP CP CP CP ToM Daemons BE BE BE BE App App App App App App
Benefits of MRNet • Lightweight transportation fabric • Powerful and flexible data aggregation • Extremely scalable • Portable • Fault tolerant • Easy to use and integrate with other tool components
MRNet Filter Examples • Aggregating similar data for scalable presentation • Symbol table and call graphs - checksum • Stack traces – call graph prefix trees • Aggregating different data for scalable analysis • Parallel concatenation • Statistical reduction • Sum, average, max, min, mean, std deviation • Histogram • FE dynamically programs filter parameters for binning • Parallel processing to reduce workload at frontend • Hierarchical clustering • Parallel Smith-Waterman algorithm
MRNet Filter Capabilities • Built-in and user-defined filters • Transformation filters • Concatenation, sum, average, minimum, maximum • Synchronization Filters • Wait-for-all, wait-for-any, time-out • Runtime configurable filter parameters • Simultaneous bi-directional output packets • Heterogeneous stream based filters • Local topology information at filter • Fault tolerant filter state
MRNet Filter Types PacketFilter Packet Batching/Unbatching Transformation Filter Synchronization Filter Packet Batching/Unbatching
MRNet Transformation Filter void reduceFilter (packets_in, packets_out, packets_out_reverse, filter_state, config_params) { /* Receive and Process Input Packets */ for ( i = 0; i < packets_in.size(); i++ ) { cur_pkt = packets_in[i]; cur_pkt->unpack(“format string”, &data); reduceData(&data, &FEdata, &BEdata); } /* Send Output Packet */ packerPtrFE_pkt = new Packet (FEdata, …); packets_out.push_back(FE_pkt); packerPtrBE_pkt = new Packet (BEdata, …); packets_out_reverse.push_back(BE_pkt); return; } • Built-in filters • Concatenation • Minimum • Maximum • Sum • Average • User build filter
MRNet Synchronization Filter • Wait For All • Wait for Any • Time out • User-defined void batchFilter (packets_in, packets_out, filter_state, config_params { /* Get saved packets from filter state */ batch_size = getBatchSize(config_params); packets = getPrevPackets(filter_state); packets.push_back(packets_in); /* Batch up packets */ if (packets.size() >= batch_size ) { packets_out.push_back(packets); packets.clear(); } updateFilterState (filter_state, packets); return; }
_start __libc_start_main main PMPI_WaitAll do_SendorStall PMPI_Barrier MPID_RecvComplete MPID_ELAN_Barrier elan_tport_RxWait elan_hgsync elan_tportRxWaitNormal elan_gsyncShm elan_hgsyncNet elan_gsyncShm [Unknown] Elan_waitBlk Elan_waitWord elan_gsyncNet Elan_pollWord elan_waitBlk elan_hgsyncNet elan_gsyncShm [Unknown]