1 / 25

Thinking Outside of the Tera-Scale Box

Thinking Outside of the Tera-Scale Box. Piotr Luszczek. Brief History of Tera-flop: 1997. 1997. ASCI Red. Brief History of Tera-flop: 2007. Intel Polaris. 2007. 1997. ASCI Red. Brief History of Tera-flop: GPGPU. GPGPU. Intel Polaris. 2007. 2008. 1997. ASCI Red.

ora
Download Presentation

Thinking Outside of the Tera-Scale Box

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thinking Outside of the Tera-Scale Box Piotr Luszczek

  2. Brief History of Tera-flop: 1997 1997 ASCI Red

  3. Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red

  4. Brief History of Tera-flop: GPGPU GPGPU Intel Polaris 2007 2008 1997 ASCI Red

  5. Tera-flop: a Dictionary Perspective • Belongs to supercomputer expert • Consumes 500 kW • Costs over $1M • Requires distributed memory programming • Memory: 1 GiB or more • Belongs togaming geek • Consumes under 500 W • Costs under $200 • Requires only a CUDA kernel • Memory: 1 GiB or more

  6. Double-Dip Recession? Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, … Productivity Multicore Hybrid Multicore HMPP, PGI Accelerator, PathScale? Time 2005 2007

  7. Locality Space in HPC Challenge Parallel-TransposeSTREAM HPLMat-Mat-Mul RandomAccess FFT

  8. Locality Space in HPC Challenge: Bounds Parallel-TransposeSTREAM HPLMat-Mat-Mul AORSA2D Bandwidth-bound Compute-bound PCIe RandomAccess FFT Latency-bound

  9. CUDA vs. OpenCL: Similar Software Stacks

  10. Matrix-Matrix Multiply: the Simple Form for i = 1:Mfor j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato

  11. Fermi C2050: from CUDA to OpenCL

  12. Fermi C2050: Simple OpenCL Porting

  13. ATI 5870: Simple OpenCL Porting

  14. Making Case for Autotuning: Use of Texture

  15. Making Case for Autotuning: Blocking Factors Provided by NVIDIA

  16. Mat-Mul Portability Summary

  17. Now, let’s think outside of the box

  18. Recipe for Performance Before Multicore Load/Store Instr. scheduling Wider superscalar instr. issue Increase clock frequency Increase number of transistor

  19. Recipe for Performance After Multicore More cores Parallel software

  20. Recipe for Future Performance More cores Better sequential software DAG Runtime

  21. From Assembly to DAG: PLASMA Sequential instruction stream Sequential task stream • Typical loop in assembly • load R0, #1000 • divf F1, F2, F3 • load R1, #1000 • divf F2, F4, F5 • dec R1 • jump … • LU factorization • for i = 1:N • GETR2(A[i, i]) • for j = i+1:N • TRSM(A[i, i], A[i, j]) • for k = i+1:N • GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling GETRF2 TRSM GEMM GEMM GETRF2 DAG scheduling TRSM GEMM GEMM

  22. Scalability of LU Factorization on Multicore 6 cores = 1 socket 24 cores = 4 sockets 48 cores = 8 sockets PLASMA MKL 11.0 LAPACK AMD Istanbul 2.8 GHz

  23. Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

  24. Performance on Distributed Memory QR LU Peak DPLASMA Cholesky Cray LibSci ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus

  25. People and Projects • Jack Dongarra • Mathieu Faverge • Jakub Kurzak • Hatem Ltaief • Stan Tomov • Asim YarKhan • Peng Du • Rick Weber • MAGMA • PLASMA • DPLASMA and DAGuE • George Bosilca • Aurelien Bouteiller • Anthony Danalis • Thomas Herault • Pierre Lemarinier

More Related