1 / 21

CLFFT: An FFT code generator for heterogeneous systems

CLFFT: An FFT code generator for heterogeneous systems. Krishna G Pai Rejith George Joseph Girish Ravunnikutty. Agenda. FFT Intro to CLFFT Brief intro to OpenCL Comparisons of FFT Algorithms Comparison with CUFFT Future work. Discrete Fourier Transform.

burt
Download Presentation

CLFFT: An FFT code generator for heterogeneous systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLFFT: An FFT code generator for heterogeneous systems Krishna G Pai Rejith George Joseph Girish Ravunnikutty

  2. Agenda • FFT • Intro to CLFFT • Brief intro to OpenCL • Comparisons of FFT Algorithms • Comparison with CUFFT • Future work

  3. Discrete Fourier Transform • Takes O(n2) with a naive implementation. • Fast Fourier Transforms (FFTs) are O(nlogn) implementation of DFTs.  Image from Intel.com

  4. Why FFT’s ? • Lots of Ongoing Research • FFTW ( http://fftw.org/) • Spiral (http://spiral.net/) • Spectral methods are one of the 13 Dwarfs of Parallel Computing • Rich set of Algorithms each optimal for certain ‘N’. • And of course, Wide applicability.

  5. Our Approach • FFTW generates code that adapts to a particular architecture (CPU’s) • Spiral also the same but optimizes at compile time (Also CPU’s) • Other research that is optimized for GPU’s most notably Govindaraju et al. • Use all the available computing resources to make FFT’s really fast !

  6. Heterogeneous Computing Intel Core 2 Duo Nvidia Tesla Use both these resources simultaneously

  7. Intro to CLFFT • Future systems are going to be heterogeneous (multi core CPUs and GPGPUs as co processors) in nature. • Study various FFT algorithms and implementthem on a GPGPU and multi-core CPUs. • Explore how FFT's can be scheduled across both these computing resources and the performance thus obtained. • OpenCL to program the GPGPU and OpenMp to parallelize on CPU's.

  8. FFT’s Studied .. • SlowFFT (Naive Implementation) • Cooley-Tookey (Radix 2 , for N = 2k) • Stockham (Radix 2 , for N = 2k) • Sande-Tookey (Radix 2 , Decimation in Frequency, for N = 2k) • Bluesteins (Radix 2 , for any N) • Cooley-Tookey and SlowFFT also parallelized with OpenMp.

  9. Computational Parity • Intel Xeon has about 70 GFlops at peak performance • nVidia Tesla has about 933 Gflops • So not much computational parity on the hpc tesla machines • Better parity on Laptops with GPGPU’s. • Thus more work can be shared b/w CPU and GPU. Source Intel and Nvidia

  10. Open CL • Standard for parallel programming of heterogeneous systems involving CPU, GPU(s), CPU + GPU, IBM cell blade etc. • So we can have portability across various architectures without a very great performance penalty* More on this when we compare matrix multiplication…

  11. Differences w.r.t CUDA • No stand alone compiler to produce binaries. • We compile at run time . • Command Queues for launching kernels and Memory operations. • Device memory managed via buffer objects, which provides richer functionality than in CUDA • Allows a host memory region to be used by the device directly • OpenCl requires memcpy between device and host to be explicitly synchronized

  12. OpenCL Implementations • On August 5, 2009, AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program • On August 28, 2009, Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL • September 28, 2009, NVIDIA released  OpenCL drivers and SDK implementation.

  13. Limitations • NvidiaopenCL supports only GPU as the openCL device. • Driver doesnt consider CPU as an openCL device. • Hence cannot invoke an openCL kernel on CPU. • Had to use openMP for CPU • AMD stream openCL has support for openCL on CPU

  14. Work Flow • Currently , we split work b/w CPU and GPU’s only for Cooley-Tukey. • Cooley-Tukey here is radix-2 • Results are merged • On Tesla, bias is highly in favor of GPU computation • From the host one thread invokes OpenMp kernel and other threads equivalent to number of GPU’s invoke OpenCL Kernels

  15. Comparison of all Radix 2

  16. Comparison of Cooley-Tukey

  17. CLFFT vsCuFFT for Power of 2

  18. Performance Comparison

  19. CuFFTvs CLFFT for any n

  20. Future Work • Split Radix and Mixed Radix Algorithms • 2^p3^q5^r point FFTs • Winnograds Prime Number FFT • Optimize CPU implementations • Create Plan for an n and implement across multiple compute devices.

  21. Thank You

More Related