Outline

Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNetCAC WorkshopSanta Fe, NM, 2004 Sameer Kumar* and Laxmikant V. KaleParallel Programming LaboratoryUniversity of Illinois at Urbana Champaign

Outline • Processor virtualization • QsNet • Opportunities • Performance Evaluation of QsNet • Challenges of QsNet • Summary

System implementation User View Processor Virtualization • Basic idea of processor virtualization • User specifies interaction between objects (VPs) • RTS maps VPs onto physical processors • Typically, # virtual processors > # processors • Embodied in Charm++ and AMPI

QsNet • Popular interconnect from Quadrics • Several parallel systems in top500 use QsNet • Pittsburgh’s Lemieux (6TF) • ASCI-Q (20TF) • Elite network • Elan adaptor

Elite Network • 320 MB/s each way after protocol • Reliable fat-tree network • Multiple routes provides fault tolerance • Adaptive worm hole routing • 35 ns per hop

Elan Network Adaptor • Features • Low latency (4.5 μs for MPI) • High bandwidth (320MB/s/node) • Components • Sparc processor • DMA Engine • 64 MB RAM • On chip cache

Low CPU Overhead CPU Overhead is small and does not change much with the message size

Idle Time Traditional Message Passing Send Overhead Receive Overhead P0 P1 Time Traditional Message Passing does not utilize low CPU overhead of Elan

P1 Adaptive Overlap Send Overhead Receive Overhead P0 VP0 VP1 VP0 VP1 Time Processor Virtualization takes full advantage of the low CPU overhead of Elan

Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.

Receive Message Post Receives Tport Send Charm++ Message Driven Execution Handler Scheduler Pump Garbage Collection Send

NAMD: A Production MD System • Written in Charm++ • Fully featured program • NIH-funded development • Distributed free of charge (5000+ downloads so far) • Binaries and source code • Installed at NSF centers • Large published simulations (e.g., aquaporin simulation featured in keynote)

Scaling NAMD • Several QsNet challenges had to be overcome to scale NAMD

QsNet Challange: Latency Applications need to post receives for messages of different sizes

Latency Bottlenecks • Latency • Slow NIC processor with a 100Mhz clock • Cache size only 8KB • Traversing a large loop flushes it Cache Misses vs Number of Receives Posted

Phase 1: Processors send messages to row neighbors Phase 1: Processors send messages to column neighbors 2* messages instead of P-1 Managing Latency: Message Combining Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2)

NAMD PME Performance Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

QsNet Challenge: Bandwidth QsNet Network Bandwidth 320 MB/s PCI/DMA Contention restricts bandwidth on Alpha servers

Improving Bandwidth Node bandwidth (MB/s) for different placements of source and destination Sending messages from Elan memory is faster

Force compute Integrate QsNet Challenge: Stretched Handlers NAMD Timeline • Stretched Sends • Green superscripts • Similar stretches observed in the middle of entry methods Processors Time

Stretching Solution • Stretched Sends • Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged • Solved the problem by closely working with Quadrics and obtaining a patch • Isend only blocks on the rendezvous of the previous message to the same destination

Stretching Solution Contd. • Stretches in the middle of entry methods • Caused by OS daemons • Using blocking receives minimized these stretches • Daemons can be scheduled when processor is idle

NAMD With Blocking Receives Blocking Receives Processors Time

NAMD Performance on Lemieux

Summary • QsNet and excellent network • NIC co-processor ideal for message driven execution • Programming guidelines • Send messages from Elan memory • Post limited number of receives and before the sends • Blocking receives to avoid stretching

Future Work • One sided communication • Barrier? • Persistent one sided communication • Reserve buffers on destination

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: