Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Understanding Application ScalingNAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15th, 1998

Introduction • NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems • 7 scientific benchmarks that represents the most common computation kernels • NPB is written on top of Message Passing Interface (MPI) for portability • NPB is a Constant Problem Size (CPS) scaling benchmark suite • This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

Motivation • Early study on NPB shows ideal speedup on NOW! • Scaling as good as T3D and better than SP-2 • Per node performance better than T3D, close to SP-2 • Submitted results for Origin 2000 show a spread

Presentation Outline • Hardware Configuration • Time Breakdown of the Applications • Communication Performance • Computation Performance • Conclusion

Hardware Configuration • SGI Origin 2000 (64 nodes) • MIPS R10000 processor, 195 MHz, 32KB/32KB L1 • 4MB external L2 cache per processor • 16GB memory total • MPI performance: 13 sec one-way latency, 150 MB peak, half-power at 8KB message size • Network Of Workstations (NOW) • UltraSPARC I processor, 167MHz, 16KB/16KB L1 • 512KB external L2 cache per processor • 128 MB memory per processor • MPI performance: 22 sec one-way latency, 27 MB peak, half-power at 4KB message size

Time Breakdown -- LU • Black line -- total running time • a single-man - 10 secs job • ideally, requires 5 secs for 2 men • total amount of work -- 10 secs • More work, need communication

Time Breakdown -- LU

Time Breakdown -- SP

Communication Performance • Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

Communication Efficiency • absolute bandwidth delivered are close • SP/32 on NOW -- 215s • SP/32 on SGI -- 289s • comm. efficiency on SGI only achieved 30% of potential bandwidth • protocols tradeoff are pronounce • hand-shake vs. bulk-send in pt2pt • collective ops

Computation Performance • Relative performance of the benchmarks on single node roughly close to the processor performance difference • Both computational CPI and L2 misses change significantly on both platforms when scaled

Recap on CPS Scaling 4 8 16 32 64 128 256

LU Working Set • 4-processor • Knee starts at 256KB

LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB

LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB

LU Working Set • 4-processor • Knee starts at 256KB • 8-processor • Knee starts at 128KB • 16-processor • Knee starts at 64KB • 32-processor • Knee starts at 32KB • miss rate drops from 2MB to 4 MB global cache

SP Working Set • Cost under scaling • extra work worsen memory system’s performance • total memory references on SGI • 4-processor has 64.38 billion memory reference • 25-processor has 72.35 billion memory reference • 12.38% increase Cost Benefit

Conclusion • NPB • -benchmarks hard to predict comm performance • global cache increases effectively reduce comp. time • sequential node arch. is a dominant factor in NPB perf. • NOW • an inexpensive way to go parallel • absolute performance is excellent • MPI on NOW has good scalability and performance • NOW vs. proprietary system -- detail instrumentation ability • speedup cannot tell the whole story, scalability involves: • the interplay of program and machine scaling • delivered comm. performance, not -benchmarks • complicated memory system performance

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000

Presentation Transcript

2.2 Parallel and Perpendicular Lines and Circles

Scalable CC-NUMA Design Study - SGI Origin 2000

OGO 2.1 SGI Origin 2000

Parallel Multidimensional Scaling Performance on Multicore Systems

Application Scaling

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT

Benchmarks for Parallel Systems

The PFunc Implementation of NAS Parallel Benchmarks.

Parallel Programming on the SGI Origin2000

Operational Forecasting on the SGI Origin 3800 and Linux Clusters

Parallel Application Scaling, Performance, and Efficiency

Benchmarks on BG/L: Parallel and Serial

Parallel Programming on the SGI Origin2000

SGI

Parallel/Concurrent Programming on the SGI Altix

SGI Origin 3000

OGO 2.1 SGI Origin 2000

Scalable CC-NUMA Design Study - SGI Origin 2000

Parallel Application Scaling, Performance, and Efficiency

Scaling Parallel Applications