180 likes | 293 Views
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000. Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science
E N D
Understanding Application ScalingNAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15th, 1998
Introduction • NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems • 7 scientific benchmarks that represents the most common computation kernels • NPB is written on top of Message Passing Interface (MPI) for portability • NPB is a Constant Problem Size (CPS) scaling benchmark suite • This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000
Motivation • Early study on NPB shows ideal speedup on NOW! • Scaling as good as T3D and better than SP-2 • Per node performance better than T3D, close to SP-2 • Submitted results for Origin 2000 show a spread
Presentation Outline • Hardware Configuration • Time Breakdown of the Applications • Communication Performance • Computation Performance • Conclusion
Hardware Configuration • SGI Origin 2000 (64 nodes) • MIPS R10000 processor, 195 MHz, 32KB/32KB L1 • 4MB external L2 cache per processor • 16GB memory total • MPI performance: 13 sec one-way latency, 150 MB peak, half-power at 8KB message size • Network Of Workstations (NOW) • UltraSPARC I processor, 167MHz, 16KB/16KB L1 • 512KB external L2 cache per processor • 128 MB memory per processor • MPI performance: 22 sec one-way latency, 27 MB peak, half-power at 4KB message size
Time Breakdown -- LU • Black line -- total running time • a single-man - 10 secs job • ideally, requires 5 secs for 2 men • total amount of work -- 10 secs • More work, need communication
Communication Performance • Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW
Communication Efficiency • absolute bandwidth delivered are close • SP/32 on NOW -- 215s • SP/32 on SGI -- 289s • comm. efficiency on SGI only achieved 30% of potential bandwidth • protocols tradeoff are pronounce • hand-shake vs. bulk-send in pt2pt • collective ops
Computation Performance • Relative performance of the benchmarks on single node roughly close to the processor performance difference • Both computational CPI and L2 misses change significantly on both platforms when scaled
Recap on CPS Scaling 4 8 16 32 64 128 256
LU Working Set • 4-processor • Knee starts at 256KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB
LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB • 32-processor • Knee starts at 32KB • miss rate drops from 2MB to 4 MB global cache
SP Working Set • Cost under scaling • extra work worsen memory system’s performance • total memory references on SGI • 4-processor has 64.38 billion memory reference • 25-processor has 72.35 billion memory reference • 12.38% increase Cost Benefit
Conclusion • NPB • -benchmarks hard to predict comm performance • global cache increases effectively reduce comp. time • sequential node arch. is a dominant factor in NPB perf. • NOW • an inexpensive way to go parallel • absolute performance is excellent • MPI on NOW has good scalability and performance • NOW vs. proprietary system -- detail instrumentation ability • speedup cannot tell the whole story, scalability involves: • the interplay of program and machine scaling • delivered comm. performance, not -benchmarks • complicated memory system performance