240 likes | 369 Views
Christopher Mitchell CDA 6938, Spring 2009. The Discrete Cosine Transform (DCT). The Discrete Cosine Transform. In the same family as the Fourier Transform Converts data to frequency domain. Represents data via summation of variable frequency cosine waves.
E N D
Christopher Mitchell CDA 6938, Spring 2009 The Discrete Cosine Transform (DCT)
The Discrete Cosine Transform • In the same family as the Fourier Transform • Converts data to frequency domain. • Represents data via summation of variable frequency cosine waves. • Since it is a discrete version, conducive to problems formatted for computer analysis. • Captures only real components of the function. • Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful. • Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.
Significance / Where is this used? • Image Processing • Compression - Ex.) JPEG • Scientific Analysis - Ex.) Radio Telescope Data • Audio Processing • Compression - Ex.) MPEG – Layer 3, aka. MP3 • Scientific Computing / High Performance Computing (HPC) • Partial Differential Equation Solvers
Significance, Cont. • Image Processing Example • Exhibits Energy Compaction • Drop small amplitude coefficients Original Image DCT Transformed Image
Implementation Platform NVIDIA CUDA Version 2.0
Implementation Platform, Cont. • What Happened to the Cell/BE? • Too many technical challenges compared to the deadline. • Algorithm is embarrassingly parallel • Conducive of launching hundreds of threads → GPU • Algorithm requires too much data per pass compared to local store size. • Would have to be creative with DMA and no guarantee of bottleneck mitigation.
Algorithm Walk Through • Mathematical Basis • 1D Version: • Where: • 2D Version: • Where α(u) and α(v) are defined as shown in the 1D case.
Algorithm Walk Through • CPU Version – 1D DCT
Algorithm Walk Through • CPU Version – 2D DCT
Algorithm Walk Through • Problem • 1D DCT is O(n2) • 2D DCT is O(n3) • Additionally, the Algorithm uses calls to calculate the cosine and square root. • Long Latency ALU Operations
Algorithm Walk Through • CUDA Version – 1D DCT
Algorithm Walk Through • CUDA Version – 2D DCT
Algorithm Walk Through • Solution • 1D DCT is now O(n) • 2D DCT is now O(n2) • Parallelization key to success with this algorithm
Testing • Platform • Intel Core 2 Duo E6700 @ 2.66 GHz. • Gigabyte GA-P35-DQ6 Motherboard • 2 GB RAM • 2 NVIDIA GeForce 8600 GTS Superclocked GPUs • 720 MHz. Core Clock • 256 MB GDDR3 Memory • 4 Multiprocessors → 32 Streaming Processors • Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare 178.24 Drivers
Future Work • Multiple GPU version • Have a dual card setup to test this with. • Need to find efficient way to split the problem between the two cards without incurring a large I/O penalty. • Still interested in trying a Cell/BE version of the algorithm. • Need to improve at CBEA programming. • DMA & local store size is the limiting factor for this particular problem.
References • NVIDIA CUDA Programming Guide, Version 2.1 • http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf • The Discrete Cosine Transform (DCT): Theory and Application • http://www.egr.msu.edu/waves/people/Ali_files/DCT_TR802.pdf • CDA 6938 Lecture Notes and Slides