Dongyue Mou and Zeng Xing

Dongyue Mou and Zeng Xing cujpeg A Simple JPEG EncoderWith CUDA Technology

Outline • JPEG Algorithm • Traditional Encoder • What's new in cujpeg • Benchmark • Conclusion

JPEG Algorithm JPEG is a commonly used method for image compression.JPEG Encoding Algorithm is consist of 7 steps: • Divide image into 8x8 blocks • [R,G,B] to [Y,Cb,Cr] conversion • Downsampling (optional) • FDCT(Forward Discrete Cosine Transform)‏ • Quantization • Serialization in zig-zag style • Entropy encoding (Run Length Coding & Huffman coding)

JPEG Algorithm -- Example This is an example

Divide into 8x8 blocks This is an example

RGB vs. YCC The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with sampling factor (1, 1, 1)

Downsampling Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels 4 blocks 16 x16 pixel MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately MCU with sampling factor (2, 1, 1)

Apply FDCT 2D IDCT: Bottleneck, the complexity of thealgorithm is O(n^4) 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

Apply FDCT Shift operations From [0, 255] To [-128, 127] Meaning of each position in DCT result- matrix DCT Result

Quantization Quantization Matrix (adjustable according to quality)‏ DCT result Quantization result

Zigzag reordering / Run Length Coding Quantization result [ Number of Zero before me, my value]

Huffman encoding RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00b] [0,4,1100b] [0,2,00b] ...... EOB First Huffman coding (i.e. for [0,2,00b]): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output

Traditional Encoder CPU Image Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

Algorithm Analyse 1x full 2D DCT scan O(N4) 8x Row 1D DCT scan 8x Column 1D DCT scanO(N3) 8 threads can paralell work

Algorithm Analyse

DCT In Place __device__void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__void vectorDCTInPlace(float *Vect0, int Step) { float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float*Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }

Allocation Desktop PC • CPU:1 P4 Core, 3.0GHz • RAM: 2GB Graphic Card • GPU: 16 Core575MHz8SP/Core, 1.35GHz • RAM: 768MB

Binding Huffman Encoding • many conditions/branchs • intensive bit operating • less computing Color conversion, DCT, Quantize • intensive computing • less conditions/branchs

Binding Hardware:16KB Shared Memory Problem: 1 MCU contains702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDABlock = 504 Threads

cujpeg Encoder CPU GPU Image Load image Color conversion DCT Quantization Zigzag Reorder Encoding .jpg

Color Conversion Load image In Place DCT Quantize Reorder Encoding cujpeg Encoder cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost); CPU for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f); GPU Texture Memory Shared Memory Image Global Memory Quantization Reorder Result Host Memory int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb; cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); .jpg

Y Y Cb Cb Cr Cr x24 x24 x24 Scheduling RGB Data For each MCU: • 24 threads • Convert 2 pixel • 8 threads • Convert rest 2 pixel • 24 threads • Do 1x row vector DCT • Do 1x column vector DCT • Quantize 8x scalar value YCC Block DCT Block Quantized/Reordered Data

GPU Occupancy

Benchmark ( Q = 80, Sample = 1:1:1 )

Benchmark

Benchmark Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes 262144 MCUs. Total ops: 262144*24*210 = 1509949440 flops Speed: (Total ops) /0.043 = 35.12Gflops

Conclusion CUDA can obviouslyaccelerate the JPEG compression. The over-all performance • Depends on the system speed • More bandwidth • Besser encoding routine • Support downsample

Dongyue Mou and Zeng Xing

Dongyue Mou and Zeng Xing

Presentation Transcript

Zeng Yi and Zhenglian Wang

MOU

MOU

COE/MOU System

Yu Zeng Duke University

IEMA / IESMA MOU

Dongyue Mou and Zeng Xing

NSSDC MOU history

TETRA MoU

IBL MoU

MoU Preparation Timetable

MoU Implementation

MoEDAL MoU

Amr Ahmed and Eric P. Xing

IBL MoU

Zeng Yi

MoU Slide

Knight Vicky Zeng

Koichi Sakaguchi and Xubin Zeng

MoU Status and Roadmap

Xing Data Scraper