1 / 34

Dongyue Mou and Zeng Xing

Dongyue Mou and Zeng Xing. cujpeg. A Simple JPEG Encoder With CUDA Technology. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. Outline. JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion. JPEG Algorithm.

kamran
Download Presentation

Dongyue Mou and Zeng Xing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dongyue Mou and Zeng Xing cujpeg A Simple JPEG EncoderWith CUDA Technology

  2. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  3. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  4. JPEG Algorithm JPEG is a commonly used method for image compression.JPEG Encoding Algorithm is consist of 7 steps: • Divide image into 8x8 blocks • [R,G,B] to [Y,Cb,Cr] conversion • Downsampling (optional) • FDCT(Forward Discrete Cosine Transform)‏ • Quantization • Serialization in zig-zag style • Entropy encoding (Run Length Coding & Huffman coding)

  5. JPEG Algorithm -- Example This is an example

  6. Divide into 8x8 blocks This is an example

  7. Divide into 8x8 blocks This is an example

  8. RGB vs. YCC The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

  9. Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with sampling factor (1, 1, 1)

  10. Downsampling Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels 4 blocks 16 x16 pixel MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately MCU with sampling factor (2, 1, 1)

  11. Apply FDCT 2D IDCT: Bottleneck, the complexity of thealgorithm is O(n^4) 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

  12. Apply FDCT Shift operations From [0, 255] To [-128, 127] Meaning of each position in DCT result- matrix DCT Result

  13. Quantization Quantization Matrix (adjustable according to quality)‏ DCT result Quantization result

  14. Zigzag reordering / Run Length Coding Quantization result [ Number of Zero before me, my value]

  15. Huffman encoding RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00b] [0,4,1100b] [0,2,00b] ...... EOB First Huffman coding (i.e. for [0,2,00b]): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output

  16. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  17. Traditional Encoder CPU Image Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

  18. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  19. Algorithm Analyse 1x full 2D DCT scan O(N4) 8x Row 1D DCT scan 8x Column 1D DCT scanO(N3) 8 threads can paralell work

  20. Algorithm Analyse

  21. DCT In Place __device__void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__void vectorDCTInPlace(float *Vect0, int Step) { float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float*Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }

  22. Allocation Desktop PC • CPU:1 P4 Core, 3.0GHz • RAM: 2GB Graphic Card • GPU: 16 Core575MHz8SP/Core, 1.35GHz • RAM: 768MB

  23. Binding Huffman Encoding • many conditions/branchs • intensive bit operating • less computing Color conversion, DCT, Quantize • intensive computing • less conditions/branchs

  24. Binding Hardware:16KB Shared Memory Problem: 1 MCU contains702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDABlock = 504 Threads

  25. cujpeg Encoder CPU GPU Image Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

  26. Color Conversion Load image In Place DCT Quantize Reorder Encoding cujpeg Encoder cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost); CPU for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f); GPU Texture Memory Shared Memory Image Global Memory Quantization Reorder Result Host Memory int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb; cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); .jpg

  27. Y Y Cb Cb Cr Cr x24 x24 x24 Scheduling RGB Data For each MCU: • 24 threads • Convert 2 pixel • 8 threads • Convert rest 2 pixel • 24 threads • Do 1x row vector DCT • Do 1x column vector DCT • Quantize 8x scalar value YCC Block DCT Block Quantized/Reordered Data

  28. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  29. GPU Occupancy

  30. Benchmark ( Q = 80, Sample = 1:1:1 )

  31. Benchmark

  32. Benchmark Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes 262144 MCUs. Total ops: 262144*24*210 = 1509949440 flops Speed: (Total ops) /0.043 = 35.12Gflops

  33. Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

  34. Conclusion CUDA can obviouslyaccelerate the JPEG compression. The over-all performance • Depends on the system speed • More bandwidth • Besser encoding routine • Support downsample

More Related