1 / 26

Outline

Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant V. Kale Parallel Programming Laboratory University of Illinois at Urbana Champaign. Outline. Processor virtualization QsNet Opportunities

Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNetCAC WorkshopSanta Fe, NM, 2004 Sameer Kumar* and Laxmikant V. KaleParallel Programming LaboratoryUniversity of Illinois at Urbana Champaign

  2. Outline • Processor virtualization • QsNet • Opportunities • Performance Evaluation of QsNet • Challenges of QsNet • Summary

  3. System implementation User View Processor Virtualization • Basic idea of processor virtualization • User specifies interaction between objects (VPs) • RTS maps VPs onto physical processors • Typically, # virtual processors > # processors • Embodied in Charm++ and AMPI

  4. QsNet • Popular interconnect from Quadrics • Several parallel systems in top500 use QsNet • Pittsburgh’s Lemieux (6TF) • ASCI-Q (20TF) • Elite network • Elan adaptor

  5. Elite Network • 320 MB/s each way after protocol • Reliable fat-tree network • Multiple routes provides fault tolerance • Adaptive worm hole routing • 35 ns per hop

  6. Elan Network Adaptor • Features • Low latency (4.5 μs for MPI) • High bandwidth (320MB/s/node) • Components • Sparc processor • DMA Engine • 64 MB RAM • On chip cache

  7. Low CPU Overhead CPU Overhead is small and does not change much with the message size

  8. Idle Time Traditional Message Passing Send Overhead Receive Overhead P0 P1 Time Traditional Message Passing does not utilize low CPU overhead of Elan

  9. P1 Adaptive Overlap Send Overhead Receive Overhead P0 VP0 VP1 VP0 VP1 Time Processor Virtualization takes full advantage of the low CPU overhead of Elan

  10. Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8.

  11. Receive Message Post Receives Tport Send Charm++ Message Driven Execution Handler Scheduler Pump Garbage Collection Send

  12. NAMD: A Production MD System • Written in Charm++ • Fully featured program • NIH-funded development • Distributed free of charge (5000+ downloads so far) • Binaries and source code • Installed at NSF centers • Large published simulations (e.g., aquaporin simulation featured in keynote)

  13. Scaling NAMD • Several QsNet challenges had to be overcome to scale NAMD

  14. QsNet Challange: Latency Applications need to post receives for messages of different sizes

  15. Latency Bottlenecks • Latency • Slow NIC processor with a 100Mhz clock • Cache size only 8KB • Traversing a large loop flushes it Cache Misses vs Number of Receives Posted

  16. Phase 1: Processors send messages to row neighbors Phase 1: Processors send messages to column neighbors 2* messages instead of P-1 Managing Latency: Message Combining Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2)

  17. NAMD PME Performance Performance of Namd with the Atpase molecule. PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

  18. QsNet Challenge: Bandwidth QsNet Network Bandwidth 320 MB/s PCI/DMA Contention restricts bandwidth on Alpha servers

  19. Improving Bandwidth Node bandwidth (MB/s) for different placements of source and destination Sending messages from Elan memory is faster

  20. Force compute Integrate QsNet Challenge: Stretched Handlers NAMD Timeline • Stretched Sends • Green superscripts • Similar stretches observed in the middle of entry methods Processors Time

  21. Stretching Solution • Stretched Sends • Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged • Solved the problem by closely working with Quadrics and obtaining a patch • Isend only blocks on the rendezvous of the previous message to the same destination

  22. Stretching Solution Contd. • Stretches in the middle of entry methods • Caused by OS daemons • Using blocking receives minimized these stretches • Daemons can be scheduled when processor is idle

  23. NAMD With Blocking Receives Blocking Receives Processors Time

  24. NAMD Performance on Lemieux

  25. Summary • QsNet and excellent network • NIC co-processor ideal for message driven execution • Programming guidelines • Send messages from Elan memory • Post limited number of receives and before the sends • Blocking receives to avoid stretching

  26. Future Work • One sided communication • Barrier? • Persistent one sided communication • Reserve buffers on destination

More Related