1 / 40

Architectural Interactions in High Performance Clusters

Architectural Interactions in High Performance Clusters. RTPP 98 David E. Culler Computer Science Division University of California, Berkeley. Parallel Program. Parallel Program. Parallel Program. Run-Time Framework. ° ° °. RunTime. RunTime. RunTime. Machine Architecture. Machine

loman
Download Presentation

Architectural Interactions in High Performance Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Interactions in High Performance Clusters RTPP 98 David E. Culler Computer Science Division University of California, Berkeley

  2. Parallel Program Parallel Program Parallel Program Run-Time Framework ° ° ° RunTime RunTime RunTime Machine Architecture Machine Architecture Machine Architecture Network

  3. Two Example RunTime Layers • Split-C • thin global address space abstraction over Active Messages • get, put, read, write • MPI • thicker message passing abstraction over Active Messages • send, receive

  4. Split-C over Active Messages • Read, Write, Get, Put built on small Active Message request / reply (RPC) • Bulk-Transfer (store & get) Request handler Reply handler

  5. Model Framework: LogP P ( processors ) • Latency in sending a (small) message between modules • overhead felt by the processor on sending or receiving msg • gap between successive sends or receives (1/rate) • Processors P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Round Trip time: 2 x ( 2o + L)

  6. LogP Summary of Current Machines Max MB/s: 38 141 47

  7. Methodology • Apparatus: • 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5) • M2F Lanai + Myricom Net in Fat Tree variant • GAM + Split-C • Modify the Active Message layer to inflate L, o, g, or G independently • Execute a diverse suite of applications and observe effect • Evaluate against natural performance models

  8. Adjusting L, o, and g (and G) in situ Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers)

  9. Calibration

  10. Applications Characteristics • Message Frequency • Write-based vs. Read-based • Short vs. Bulk Messages • Synchronization • Communication Balance

  11. Applications used in the Study

  12. Baseline Communication

  13. Application Sensitivity to Communication Performance

  14. Sensitivity to Overhead

  15. Sensitivity to gap (1/msg rate)

  16. Sensitivity to Latency

  17. Sensitivity to bulk BW (1/G)

  18. Modeling Effects of Overhead • Tpred = Torig + 2 x max #msgs xo • request / response • proc with most msgs limits overall time • Why does this model under-predict?

  19. Modeling Effects of gap • Uniform communication model Tpred = Torig , if g < I, average msg interval = Torig + m (g - I ), otherwise • Bursty Communication Tpred = Torig + m g g

  20. Extrapolating to Low Overhead

  21. MPI over AM: ping-pong bandwidth

  22. MPI over AM: start-up

  23. NPB2 Speedup: NOW vs SP2

  24. NOW vs. Origin

  25. Single Processor Performance

  26. Understanding Speedup SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait) Tcompute = (work/p + extra) x efficiency

  27. Performance Tools for Clusters • Independent data collection on every node • Timing • Sampling • Tracing • Little perturbation of global effects

  28. Where the Time Goes: LU-a

  29. Where the Time Goes: BT-a

  30. Constant Problem Size Scaling 4 8 16 32 64 128 256

  31. Communication Scaling Normalized Msgs per Proc Average Message Size

  32. Communication Scaling: Volume

  33. Extra Work

  34. Cache Working Sets: LU 8-fold reduction in miss rate from 4 to 8 proc

  35. Cache Working Sets: BT

  36. Cycles per Instruction

  37. MPI Internal Protocol Sender Receiver

  38. Revised Protocol Sender Receiver

  39. Sensitivity to Overhead

  40. Conclusions • Run Time systems for Parallel Programs must deal with a host of architectural interactions • communication • computation • memory system • Build a performance model of you RTPP • only way to recognize anomalies • Build tools along with the RT to reflect characteristics and sensitivity back to PP • Much can lurk beneath a perfect speedup curve

More Related