270 likes | 343 Views
Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant V. Kale Parallel Programming Laboratory University of Illinois at Urbana Champaign. Outline. Processor virtualization QsNet Opportunities
E N D
Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNetCAC WorkshopSanta Fe, NM, 2004 Sameer Kumar* and Laxmikant V. KaleParallel Programming LaboratoryUniversity of Illinois at Urbana Champaign
Outline • Processor virtualization • QsNet • Opportunities • Performance Evaluation of QsNet • Challenges of QsNet • Summary
System implementation User View Processor Virtualization • Basic idea of processor virtualization • User specifies interaction between objects (VPs) • RTS maps VPs onto physical processors • Typically, # virtual processors > # processors • Embodied in Charm++ and AMPI
QsNet • Popular interconnect from Quadrics • Several parallel systems in top500 use QsNet • Pittsburgh’s Lemieux (6TF) • ASCI-Q (20TF) • Elite network • Elan adaptor
Elite Network • 320 MB/s each way after protocol • Reliable fat-tree network • Multiple routes provides fault tolerance • Adaptive worm hole routing • 35 ns per hop
Elan Network Adaptor • Features • Low latency (4.5 μs for MPI) • High bandwidth (320MB/s/node) • Components • Sparc processor • DMA Engine • 64 MB RAM • On chip cache
Low CPU Overhead CPU Overhead is small and does not change much with the message size
Idle Time Traditional Message Passing Send Overhead Receive Overhead P0 P1 Time Traditional Message Passing does not utilize low CPU overhead of Elan
P1 Adaptive Overlap Send Overhead Receive Overhead P0 VP0 VP1 VP0 VP1 Time Processor Virtualization takes full advantage of the low CPU overhead of Elan
Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.
Receive Message Post Receives Tport Send Charm++ Message Driven Execution Handler Scheduler Pump Garbage Collection Send
NAMD: A Production MD System • Written in Charm++ • Fully featured program • NIH-funded development • Distributed free of charge (5000+ downloads so far) • Binaries and source code • Installed at NSF centers • Large published simulations (e.g., aquaporin simulation featured in keynote)
Scaling NAMD • Several QsNet challenges had to be overcome to scale NAMD
QsNet Challange: Latency Applications need to post receives for messages of different sizes
Latency Bottlenecks • Latency • Slow NIC processor with a 100Mhz clock • Cache size only 8KB • Traversing a large loop flushes it Cache Misses vs Number of Receives Posted
Phase 1: Processors send messages to row neighbors Phase 1: Processors send messages to column neighbors 2* messages instead of P-1 Managing Latency: Message Combining Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2)
NAMD PME Performance Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
QsNet Challenge: Bandwidth QsNet Network Bandwidth 320 MB/s PCI/DMA Contention restricts bandwidth on Alpha servers
Improving Bandwidth Node bandwidth (MB/s) for different placements of source and destination Sending messages from Elan memory is faster
Force compute Integrate QsNet Challenge: Stretched Handlers NAMD Timeline • Stretched Sends • Green superscripts • Similar stretches observed in the middle of entry methods Processors Time
Stretching Solution • Stretched Sends • Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged • Solved the problem by closely working with Quadrics and obtaining a patch • Isend only blocks on the rendezvous of the previous message to the same destination
Stretching Solution Contd. • Stretches in the middle of entry methods • Caused by OS daemons • Using blocking receives minimized these stretches • Daemons can be scheduled when processor is idle
NAMD With Blocking Receives Blocking Receives Processors Time
Summary • QsNet and excellent network • NIC co-processor ideal for message driven execution • Programming guidelines • Send messages from Elan memory • Post limited number of receives and before the sends • Blocking receives to avoid stretching
Future Work • One sided communication • Barrier? • Persistent one sided communication • Reserve buffers on destination