90 likes | 257 Views
JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems. Ang Li University of Wisconsin-Madison. Outline. Brief Introduction of Background Implementation Evaluation Conclusion. Background. JPEG Encoding Parallelism Seeking Pre-processing: Color Conversion Block Encoding/Decoding.
E N D
JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems AngLi University of Wisconsin-Madison
Outline • Brief Introduction of Background • Implementation • Evaluation • Conclusion NVIDIA GTC 2013
Background • JPEG Encoding • Parallelism Seeking • Pre-processing: Color Conversion • Block Encoding/Decoding NVIDIA GTC 2013
Implementation • Step 1 – Find target functions • Encode: encode_mcu_huff, encode_one_block, emit_bits_s • Decode: decode_mcu_DC_first, decode_mcu_DC_refine • Profiling to find other functions • Using GPROF • Encode: rgb_ycc_convert • Decode: ycc_rgb_convert • Both take small half of the total execution time of encoding/decoding NVIDIA GTC 2013
Implementation – Cont’d • Step 2 – Parallel with CUDA • First, implementing in OpenMP to make sure the understandings are correct • E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results! • Secondly, make a baseline GPGPU implementation to all critical functions • Thirdly, optimize GPGPU implementations • Using constant memory for (k = 1; k <= Se; k++) { … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … } NVIDIA GTC 2013
Evaluation • Evaluation Environment • CPU: Intel Nehalem Xeon E5520 2.26GHz processor • GPU: Tesla K20c • Picture used • My favorite picture • Compressing: 1280 x 768 pixels • Decompressing: the products after compressing • Correctness checked by ``diff’’ NVIDIA GTC 2013
Evaluation – Cont’d • Timings are in milliseconds, averagin 10 times of execution • Four threads are forked for OpenMP implementation • For both GPU implementations, configurations are tuned to be optimized • Results discussion • OpenMP is fastest. GPGPU basically degrades the performance while `optimized’ version degrades more (due to serialized constant memory accesses). • Observations after hacking the code: • Each kernel launch deals with at most 250 elements, too fine-grained. • Kernel launch is expensive (allocation & copying the data) • Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial. NVIDIA GTC 2013
Conclusion • For JPEG encoding/decoding core system, GPGPU basically degrades the performance. • Coarser-grained parallelism is required. • OpenMP acceleration can be easily applied to gain some performance. NVIDIA GTC 2013
Thank you. Ang Li <ali28@wisc.edu> NVIDIA GTC 2013