1 / 18

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000. Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science

darrin
Download Presentation

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding Application ScalingNAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15th, 1998

  2. Introduction • NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems • 7 scientific benchmarks that represents the most common computation kernels • NPB is written on top of Message Passing Interface (MPI) for portability • NPB is a Constant Problem Size (CPS) scaling benchmark suite • This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

  3. Motivation • Early study on NPB shows ideal speedup on NOW! • Scaling as good as T3D and better than SP-2 • Per node performance better than T3D, close to SP-2 • Submitted results for Origin 2000 show a spread

  4. Presentation Outline • Hardware Configuration • Time Breakdown of the Applications • Communication Performance • Computation Performance • Conclusion

  5. Hardware Configuration • SGI Origin 2000 (64 nodes) • MIPS R10000 processor, 195 MHz, 32KB/32KB L1 • 4MB external L2 cache per processor • 16GB memory total • MPI performance: 13 sec one-way latency, 150 MB peak, half-power at 8KB message size • Network Of Workstations (NOW) • UltraSPARC I processor, 167MHz, 16KB/16KB L1 • 512KB external L2 cache per processor • 128 MB memory per processor • MPI performance: 22 sec one-way latency, 27 MB peak, half-power at 4KB message size

  6. Time Breakdown -- LU • Black line -- total running time • a single-man - 10 secs job • ideally, requires 5 secs for 2 men • total amount of work -- 10 secs • More work, need communication

  7. Time Breakdown -- LU

  8. Time Breakdown -- SP

  9. Communication Performance • Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

  10. Communication Efficiency • absolute bandwidth delivered are close • SP/32 on NOW -- 215s • SP/32 on SGI -- 289s • comm. efficiency on SGI only achieved 30% of potential bandwidth • protocols tradeoff are pronounce • hand-shake vs. bulk-send in pt2pt • collective ops

  11. Computation Performance • Relative performance of the benchmarks on single node roughly close to the processor performance difference • Both computational CPI and L2 misses change significantly on both platforms when scaled

  12. Recap on CPS Scaling 4 8 16 32 64 128 256

  13. LU Working Set • 4-processor • Knee starts at 256KB

  14. LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB

  15. LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB

  16. LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB • 32-processor • Knee starts at 32KB • miss rate drops from 2MB to 4 MB global cache

  17. SP Working Set • Cost under scaling • extra work worsen memory system’s performance • total memory references on SGI • 4-processor has 64.38 billion memory reference • 25-processor has 72.35 billion memory reference • 12.38% increase Cost Benefit

  18. Conclusion • NPB • -benchmarks hard to predict comm performance • global cache increases effectively reduce comp. time • sequential node arch. is a dominant factor in NPB perf. • NOW • an inexpensive way to go parallel • absolute performance is excellent • MPI on NOW has good scalability and performance • NOW vs. proprietary system -- detail instrumentation ability • speedup cannot tell the whole story, scalability involves: • the interplay of program and machine scaling • delivered comm. performance, not -benchmarks • complicated memory system performance

More Related