1 / 39

Auto-tuning Sparse Matrix Kernels

Auto-tuning Sparse Matrix Kernels. Sam Williams 1,2 Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology samw@cs.berkeley.edu. Motivation.

sunila
Download Presentation

Auto-tuning Sparse Matrix Kernels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Auto-tuning Sparse Matrix Kernels Sam Williams1,2 Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2 1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology samw@cs.berkeley.edu

  2. Motivation • Multicore is the de facto solution for improving peak performance for the next decade • How do we ensure this applies to sustained performance as well ? • Processor architectures are extremely diverse and compilers can rarely fully exploit them • Require a HW/SW solution that guarantees performance without completely sacrificing productivity

  3. Overview • Examine Sparse Matrix Vector Multiplication (SpMV) kernel • Present and analyze two threaded & auto-tuned implementations • Benchmarked performance across 4 diverse multicore architectures • Intel Xeon (Clovertown) • AMD Opteron • Sun Niagara2 (Huron) • IBM QS20 Cell Blade • We show • Auto-tuning can significantly improve performance • Cell consistently delivers good performance and efficiency • Niagara2 delivers good performance and productivity

  4. Multicore SMPs used

  5. Multicore SMP Systems Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM

  6. Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Conventional Cache-based Memory Hierarchy

  7. Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy

  8. Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Cache + Pthreads implementationsb Local Store + libspe implementations

  9. Multicore SMP Systems(peak flops) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM 75 Gflop/s 17 Gflop/s 11 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s

  10. Multicore SMP Systems(peak DRAM bandwidth) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM 21 GB/s(read) 10 GB/s(write) 21 GB/s 42 GB/s(read) 21 GB/s(write) 51 GB/s

  11. Multicore SMP Systems Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Non-Uniform Memory Access Uniform Memory Access

  12. Arithmetic Intensity • Arithmetic Intensity ~ Total Flops / Total DRAM Bytes • Some HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality) • But there are many important and interesting kernels that don’t O( N ) O( 1 ) O( log(N) ) A r i t h m e t i c I n t e n s i t y FFTs SpMV, BLAS1,2 Dense Linear Algebra (BLAS3) Stencils (PDEs) Particle Methods Lattice Methods

  13. Auto-tuning

  14. Auto-tuning • Hand optimizing each architecture/dataset combination is not feasible • Goal: Productive Solution for Performance portability • Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search • Perl script generates many possible kernels • (Generate SIMD optimized kernels) • Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/… • Performance depends on the optimizations generated • Heuristics are often desirable when the search space isn’t tractable • Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

  15. Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

  16. Sparse MatrixVector Multiplication • Sparse Matrix • Most entries are 0.0 • Performance advantage in only storing/operating on the nonzeros • Requires significant meta data • Evaluate y=Ax • A is a sparse matrix • x & y are dense vectors • Challenges • Difficult to exploit ILP(bad for superscalar), • Difficult to exploit DLP(bad for SIMD) • Irregular memory access to source vector • Difficult to load balance • Very low computational intensity (often >6 bytes/flop) = likely memory bound A x y

  17. Dataset (Matrices) 2K x 2K Dense matrix stored in sparse format • Pruned original SPARSITY suite down to 14 • none should fit in cache • Subdivided them into 4 categories • Rank ranges from 2K to 1M Dense Well Structured (sorted by nonzeros/row) Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship Economics Epidemiology Poorly Structured hodgepodge FEM / Accelerator Circuit webbase Extreme Aspect Ratio (linear programming) LP

  18. Naïve Serial Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPE) • Vanilla C implementation • Matrix stored in CSR (compressed sparse row) • Explored compiler options, but only the best is presented here • x86 core delivers > 10x the performance of a Niagara2 thread

  19. Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine Naïve Pthreads Naïve

  20. Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance Naïve Pthreads Naïve

  21. Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth Naïve Pthreads Naïve

  22. Auto-tuned Performance(+NUMA & SW Prefetching) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Use first touch, or libnuma to exploit NUMA. • Also includes process affinity. • Tag prefetches with temporal locality • Auto-tune: search for the optimal prefetch distances +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  23. Auto-tuned Performance(+Matrix Compression) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • If memory bound, only hope is minimizing memory traffic • Heuristically compress the parallelized matrix to minimize it • Implemented with SSE • Benefit of prefetching is hidden by requirement of register blocking • Options: register blocking, index size, format, etc… +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  24. Auto-tuned Performance(+Cache/TLB Blocking) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Reorganize matrix to maximize locality of source vector accesses +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  25. Auto-tuned Performance(+DIMMs, Firmware, Padding) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Clovertown was already fully populated with DIMMs • Gave Opteron as many DIMMs as Clovertown • Firmware update for Niagara2 • Array padding to avoid inter-thread conflict misses • PPE’s use ~1/3 of Cell chip area +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  26. Auto-tuned Performance(+DIMMs, Firmware, Padding) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Clovertown was already fully populated with DIMMs • Gave Opteron as many DIMMs as Clovertown • Firmware update for Niagara2 • Array padding to avoid inter-thread conflict misses • PPE’s use ~1/3 of Cell chip area 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression 10% of peak flops 10% of bandwidth +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  27. Auto-tuned Performance(+Cell/SPE version) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster (median) than using the PPEs alone. +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  28. Auto-tuned Performance(+Cell/SPE version) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster than using the PPEs alone. 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth +More DIMMs(opteron), +FW fix, array padding(N2), etc… 40% of peak flops 92% of bandwidth +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  29. Auto-tuned Performance(How much did double precision and 2x1 blocking hurt) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Model faster cores by commenting out the inner kernel calls, but still performing all DMAs • Enabled 1x1 BCOO • ~16% improvement +better Cell implementation +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  30. Speedup from Auto-tuningMedian & (max) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster than using the PPEs alone. 1.6x (2.7x) 3.9x (4.4x) 1.3x (2.9x) 26x (34x) +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

  31. Summary

  32. Aggregate Performance (Fully optimized) • Cell consistently delivers the best full system performance • Although, Niagara2 delivers near comparable per socket performance • Dual core Opteron delivers far better performance (bandwidth) than Clovertown • Clovertown has far too little effective FSB bandwidth • Huron has far more bandwidth than it can exploit • (too much latency, too few cores)

  33. Parallel Efficiency(average performance per thread, Fully optimized) • Aggregate Mflop/s / #cores • Niagara2 & Cell showed very good multicore scaling • Clovertown showed very poor multicore scaling on both applications • For SpMV, Opteron and Clovertown showed good multisocket scaling

  34. Power Efficiency(Fully Optimized) • Used a digital power meter to measure sustained power under load • Calculate power efficiency as: sustained performance / sustained power • All cache-based machines delivered similar power efficiency • FBDIMMs (~12W each) sustained power • 8 DIMMs on Clovertown (total of ~330W) • 16 DIMMs on N2 machine (total of ~450W)

  35. Productivity • Niagara2 required significantly less work to deliver good performance. • Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

  36. Summary • Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance. • Niagara2 delivered both very good performance and productivity • Cell delivered very good performance and efficiency (processor and power) • Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07) • Architectural transparency is invaluable in optimizing code

  37. Acknowledgements • UC Berkeley • RADLab Cluster (Opterons) • PSI cluster(Clovertowns) • Sun Microsystems • Niagara2 donations • Forschungszentrum Jülich • Cell blade cluster access

  38. Questions?

  39. switch to pOSKI

More Related