400 likes | 504 Views
Auto-tuning Sparse Matrix Kernels. Sam Williams 1,2 Richard Vuduc 3 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology samw@cs.berkeley.edu. Motivation.
E N D
Auto-tuning Sparse Matrix Kernels Sam Williams1,2 Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2 1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology samw@cs.berkeley.edu
Motivation • Multicore is the de facto solution for improving peak performance for the next decade • How do we ensure this applies to sustained performance as well ? • Processor architectures are extremely diverse and compilers can rarely fully exploit them • Require a HW/SW solution that guarantees performance without completely sacrificing productivity
Overview • Examine Sparse Matrix Vector Multiplication (SpMV) kernel • Present and analyze two threaded & auto-tuned implementations • Benchmarked performance across 4 diverse multicore architectures • Intel Xeon (Clovertown) • AMD Opteron • Sun Niagara2 (Huron) • IBM QS20 Cell Blade • We show • Auto-tuning can significantly improve performance • Cell consistently delivers good performance and efficiency • Niagara2 delivers good performance and productivity
Multicore SMP Systems Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM
Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Conventional Cache-based Memory Hierarchy
Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy
Multicore SMP Systems(memory hierarchy) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Cache + Pthreads implementationsb Local Store + libspe implementations
Multicore SMP Systems(peak flops) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM 75 Gflop/s 17 Gflop/s 11 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s
Multicore SMP Systems(peak DRAM bandwidth) Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM 21 GB/s(read) 10 GB/s(write) 21 GB/s 42 GB/s(read) 21 GB/s(write) 51 GB/s
Multicore SMP Systems Intel Clovertown AMD Opteron Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Opteron Opteron Opteron Opteron 4GB/s (each direction) 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 1MB victim 1MB victim 1MB victim 1MB victim HT HT SRI / crossbar SRI / crossbar FSB FSB 10.6 GB/s 10.6 GB/s 128b memory controller 128b memory controller Chipset (4x64b controllers) 10.66 GB/s 10.66 GB/s 21.3 GB/s(read) 10.6 GB/s(write) 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 667MHz FBDIMMs Sun Niagara2 (Huron) IBM QS20 Cell Blade PPE SPE SPE SPE SPE SPE SPE SPE SPE PPE MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc MT Sparc 256K 256K 256K 256K 256K 256K 256K 256K 512KB L2 512KB L2 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 MFC MFC MFC MFC MFC MFC MFC MFC Crossbar Switch EIB (Ring Network) <20GB/s each direction EIB (Ring Network) 90 GB/s (writethru) 179 GB/s (fill) BIF XDR MFC MFC MFC MFC 4MB Shared L2 (16 way) (address interleaving via 8x64B banks) MFC MFC MFC MFC XDR BIF 256K 256K 256K 256K 256K 256K 256K 256K 4x128b memory controllers (2 banks each) SPE SPE SPE SPE SPE SPE SPE SPE 21.33 GB/s (write) 42.66 GB/s (read) 25.6GB/s 25.6GB/s 667MHz FBDIMMs 512MB XDR DRAM 512MB XDR DRAM Non-Uniform Memory Access Uniform Memory Access
Arithmetic Intensity • Arithmetic Intensity ~ Total Flops / Total DRAM Bytes • Some HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality) • But there are many important and interesting kernels that don’t O( N ) O( 1 ) O( log(N) ) A r i t h m e t i c I n t e n s i t y FFTs SpMV, BLAS1,2 Dense Linear Algebra (BLAS3) Stencils (PDEs) Particle Methods Lattice Methods
Auto-tuning • Hand optimizing each architecture/dataset combination is not feasible • Goal: Productive Solution for Performance portability • Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search • Perl script generates many possible kernels • (Generate SIMD optimized kernels) • Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/… • Performance depends on the optimizations generated • Heuristics are often desirable when the search space isn’t tractable • Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
Sparse MatrixVector Multiplication • Sparse Matrix • Most entries are 0.0 • Performance advantage in only storing/operating on the nonzeros • Requires significant meta data • Evaluate y=Ax • A is a sparse matrix • x & y are dense vectors • Challenges • Difficult to exploit ILP(bad for superscalar), • Difficult to exploit DLP(bad for SIMD) • Irregular memory access to source vector • Difficult to load balance • Very low computational intensity (often >6 bytes/flop) = likely memory bound A x y
Dataset (Matrices) 2K x 2K Dense matrix stored in sparse format • Pruned original SPARSITY suite down to 14 • none should fit in cache • Subdivided them into 4 categories • Rank ranges from 2K to 1M Dense Well Structured (sorted by nonzeros/row) Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship Economics Epidemiology Poorly Structured hodgepodge FEM / Accelerator Circuit webbase Extreme Aspect Ratio (linear programming) LP
Naïve Serial Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPE) • Vanilla C implementation • Matrix stored in CSR (compressed sparse row) • Explored compiler options, but only the best is presented here • x86 core delivers > 10x the performance of a Niagara2 thread
Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine Naïve Pthreads Naïve
Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance Naïve Pthreads Naïve
Naïve Parallel Implementation Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • SPMD style • Partition by rows • Load balance by nonzeros • N2 ~ 2.5x x86 machine 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth Naïve Pthreads Naïve
Auto-tuned Performance(+NUMA & SW Prefetching) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Use first touch, or libnuma to exploit NUMA. • Also includes process affinity. • Tag prefetches with temporal locality • Auto-tune: search for the optimal prefetch distances +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+Matrix Compression) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • If memory bound, only hope is minimizing memory traffic • Heuristically compress the parallelized matrix to minimize it • Implemented with SSE • Benefit of prefetching is hidden by requirement of register blocking • Options: register blocking, index size, format, etc… +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+Cache/TLB Blocking) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Reorganize matrix to maximize locality of source vector accesses +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+DIMMs, Firmware, Padding) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Clovertown was already fully populated with DIMMs • Gave Opteron as many DIMMs as Clovertown • Firmware update for Niagara2 • Array padding to avoid inter-thread conflict misses • PPE’s use ~1/3 of Cell chip area +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+DIMMs, Firmware, Padding) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (PPEs) • Clovertown was already fully populated with DIMMs • Gave Opteron as many DIMMs as Clovertown • Firmware update for Niagara2 • Array padding to avoid inter-thread conflict misses • PPE’s use ~1/3 of Cell chip area 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression 10% of peak flops 10% of bandwidth +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+Cell/SPE version) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster (median) than using the PPEs alone. +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(+Cell/SPE version) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster than using the PPEs alone. 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth +More DIMMs(opteron), +FW fix, array padding(N2), etc… 40% of peak flops 92% of bandwidth +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Auto-tuned Performance(How much did double precision and 2x1 blocking hurt) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Model faster cores by commenting out the inner kernel calls, but still performing all DMAs • Enabled 1x1 BCOO • ~16% improvement +better Cell implementation +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Speedup from Auto-tuningMedian & (max) Intel Clovertown AMD Opteron Sun Niagara2 (Huron) IBM Cell Blade (SPEs) • Wrote a double precision Cell/SPE version • DMA, local store blocked, NUMA aware, etc… • Only 2x1 and larger BCOO • Only the SpMV-proper routine changed • About 12x faster than using the PPEs alone. 1.6x (2.7x) 3.9x (4.4x) 1.3x (2.9x) 26x (34x) +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
Aggregate Performance (Fully optimized) • Cell consistently delivers the best full system performance • Although, Niagara2 delivers near comparable per socket performance • Dual core Opteron delivers far better performance (bandwidth) than Clovertown • Clovertown has far too little effective FSB bandwidth • Huron has far more bandwidth than it can exploit • (too much latency, too few cores)
Parallel Efficiency(average performance per thread, Fully optimized) • Aggregate Mflop/s / #cores • Niagara2 & Cell showed very good multicore scaling • Clovertown showed very poor multicore scaling on both applications • For SpMV, Opteron and Clovertown showed good multisocket scaling
Power Efficiency(Fully Optimized) • Used a digital power meter to measure sustained power under load • Calculate power efficiency as: sustained performance / sustained power • All cache-based machines delivered similar power efficiency • FBDIMMs (~12W each) sustained power • 8 DIMMs on Clovertown (total of ~330W) • 16 DIMMs on N2 machine (total of ~450W)
Productivity • Niagara2 required significantly less work to deliver good performance. • Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)
Summary • Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance. • Niagara2 delivered both very good performance and productivity • Cell delivered very good performance and efficiency (processor and power) • Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07) • Architectural transparency is invaluable in optimizing code
Acknowledgements • UC Berkeley • RADLab Cluster (Opterons) • PSI cluster(Clovertowns) • Sun Microsystems • Niagara2 donations • Forschungszentrum Jülich • Cell blade cluster access