230 likes | 526 Views
Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors. Amik Singh ParLab, EECS CS 252 Project Presentation 05/04/2012. May the 4 th be with you. Outline. Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work.
E N D
Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors Amik Singh ParLab, EECS CS 252 Project Presentation 05/04/2012 May the 4th be with you
Outline • Introduction • Experimental Setup • Challenges • Optimizations • Results & Conclusions • Future Work
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method • Multilevel technique to accelerate iterative solver convergence • Conventional iterative solver operates on a grid at full resolution, require many iterations to converge • Multigrid iterates towards convergence via a hierarchy of grid resolutions
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Finer Level Coarser Level • Coarsened grids damp out errors at large spatial frequencies • Fine grids damp out high-frequency errors
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Multigrid Method Multigrid method operates in what is called a V-cycle Consists of three main phases • Smooth :- relaxation such as Jacobi or the Gauss-Seidel, Red-Black (GSRB) used in our study
Introduction Experimental Setup Challenges Optimizations Results Future Work Multigrid Method • Smooth • Restrict :- copy information from the finest grid to progressively coarsened grids • Interpolate :- reverse of restrict, copy the correction from a coarse grid to a finer grid
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Team Samuel Williams Brian Van Straalen Ann Almgren John Shalf Leonid Oliker Computational Research Division Lawrence Berkeley National Laboratory Dhiraj D. Kalamkar Anand M. Deshpande Mikhail Smelyanskiy Pradeep Dubey Intel Corporation Amik Singh
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures used
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures used
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Problem Specification Au = f • Variable-coefficient, finite volume discretization of canonical Helmholtz (Laplacian minus Identity) operator • Right-hand side for our benchmarking is sin(∏x)sin(∏y)sin(∏z) on the [0,1] cubical domain • Problem size fixed to 2563 discretization for time to solution comparison on different architectures
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Smooth Pseudo-code • Read in 7 arrays, write out 1 array • 25 flops per update • Flops/Byte = 0.2 << 3.6 for GPUs
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Challenges on GPU • No SIMDization due to red-black updates • Very small shared memory (48 KB) • Expensive inter thread block communication Red-Black Update Pattern
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Baseline Implementation • Only 1 ghost zone • Communicate amongst different sub-domains after each smoothing operation
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Baseline Implementation • Only 1 ghost zone • Communicate amongst different sub-domains after each smoothing operation
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work More ghost zones • We do 2 red/black updates or 4 updates per smooth. • Have 4 ghost zones • Need not communicate after each update Communication Avoiding!
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Wavefront Approach
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work GPU Baseline vs GPU optimized
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Different Architectures
Introduction Experimental Setup Challenges Optimizations Results & Conclusions Future Work Conclusions CPU’s GPU’s • Hardware pre-fetchers decouple memory access through speculative loads • Sufficient on chip-memory for communication avoiding implementations • Parallelism achieved by multi-threaded paradigm • Limited on-chip memory hamper realization of communication – avoiding benefits
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work • Have a multi-GPU, MPI enabled implementation of the solver • Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves
Introduction Experimental Setup Challenges Optimizations Results Future Work Future Work • Have a multi-GPU, MPI enabled implementation of the solver • Explore the use of communication-avoiding techniques in matrix-free Krylov Subspace methods like BiCGstab for fast bottom solves Submit to SC’12 tonight!
Thank You Questions?