Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology CACHES 2011 Tucson, Arizona, June 4th, 2011

Outline • Motivation • Spherical Harmonic Transforms (SHT) • Methods • Direct Method • Efficiency of Threads Utilization • Reshaped Method • Concurrent Kernel Execution • Experiments

Motivation • Computing the S.H.T with GPUs • S.H.T is widely used • But with complexity of O(N3) • GPUs are powerful • Performance Metric in the SM level • Only emphasizing on the OCCUPANCY • Finding another metric to measure how the launched threads are efficiently used

Spherical Harmonic Transforms(1/2) ξ: state variable ξnm: spectral coefficients of state variable ξ μ: Gaussian latitude λ: Longitude M: model truncation wavenumber N(m): highest degree of associated Legendre function for wavenumber m Pnm(μ)eimλ: associated Legendre functions

Spherical Harmonic Transforms(2/2) Forward Fourier Forward Legendre Inverse Legendre Inverse Fourier

Methods – Direct (1/9) • Forward Legendre • m ≤ n CUDA Thread Thread Block

Methods – Direct (2/9) • Inverse Legendre • m ≤ n CUDA Threads of block j

Methods – ETU Metric (3/9) • Efficiency of Thread Utilization(ETU) • Measures the proportion of launched threads doing useful work during the entire execution interval • Mainly used as a algorithm design guideline • Assumption • Algorithms consist of many micro steps • tu(t,s) function • t: thread • s: micro step

Methods – ETU (4/9) • ETU Metric • Example

Methods – Reshaped (5/9) • Forward Legendre reshape ETU ≈ 1/2 ETU ≈ 1

Methods – Reshaped (6/9) • Inverse Legendre • T213 model reshape

Methods – Reshaped (7/9) • Inverse Legendre • T213 model reconstruct

Methods – Reshaped (8/9) • Inverse Legendre • T213 model • computation for trapezium α and β

Methods – Concurrent Kernel (9/9) • Concurrent Kernel Execution • Supported by Fermi and later architectures • Programs with many small kernels can efficiently executed on GPUs • The consideration of software scalability in the future • T213 model

Experiments (1/4) • Validation of ETU metric • T341 model • Variable Block size • Observations • Basically larger ETU indicates better performance • No direct relationship shows between OCCUPANCY and performance • Same OCCUPANCY doesn't mean equal performance • Same-OCCUPANCY, larger-ETU, better performance

Experiments (2/4) • Performance Forward Legendre Inverse Legendre

Experiments (3/4) • Case Study: STSWM • A global shallow water model based on S.H.T. • Exhibits many mathematical and computational properties of more complete models • Used to investigate and compare numerical methods for simulating atmospheric models • T213 truncation • Forward Legendre: ftrnve, ftrndi and ftrnpi • Invserse legendre: shtrns

Experiments (4/4) • Case Study: STSWM

Review • Motivation • Spherical Harmonic Transforms • Methods • Direct Method • Efficiency of Threads Utilization • Reshaped Method • Concurrent Kernel Execution • Experiments

Thanks for your attention! Any Question? Email: lufengshun@nudt.edu.cn linwangqun2005@gmail.com

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs