A near real time decoding for LDPC based distributed video coding using CUDA

A near real time decoding for LDPC based distributed video coding using CUDA CUDA架構下針對低密度奇偶校驗碼為基礎之分散式編碼的近即時解碼設計 Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟教授 2011/6/9 CMLab, CSIE, NTU

Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

Conventional Video Codec Heavyweight ENCODER Lightweight DECODER MPEG-2, H.264, HEVC(H.265) CMLab, CSIE, NTU

Distributed Video Coding(DVC) Heavyweight Lightweight ENCODER DECODER A new paradigm for video compression CMLab, CSIE, NTU

Application of DVC Cloud Computational Resource H.264 encoded bitstream Realtime system DVC to H.264 Transcoder DVC encoded bitstream H.264 decoder (Low Complexity) DVC encoder (Low Complexity) Video conferencing with mobile devices CMLab, CSIE, NTU

Distributed Video Coding LDPC Encoder Channel Decoder LDPC Decoder Channel Encoder D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.

Decoding Complexity of DVC Heavyweight DECODER • Our DVC codec (state-of-the-art) • Parallelized with OpenMPand CUDA • 12 core + GPGPU(Fermi) • ~1FPS CMLab, CSIE, NTU

Amdahl's law QCIF 29%~36% 15.39 fps 86%~94% • Maximum speedup can be reached by improving the most critical part of the system • LDPC decoding in the DVC decoder. CMLab, CSIE, NTU

LDPC decodingSum-Product Algorithm(Message Passing) Side Information (real number) +  0 -  1 a b c d e f g decode output hard decision 1 2 3 4 5 6 7 甲乙丙 a25 b25 c25 d25 e25 f25 g25 0 1 1 a1 b1 c1 d1 e1 f1 g1 a b c d e f g 4 6 7 1 2 3 5 Horizontal processing Vertical processing 甲乙丙 0 1 1 From DVC encoder (syndrome bits) Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm.IEEE Trans. Inform. Theory

Sum-Product AlgorithmVertical Processing b c d e f g a K + F + a P = 0 1 1 A B C D E P F + P + a Z = F G H I J K L M N O Z CMLab, CSIE, NTU

Sum-Product AlgorithmHorizontal Processing a b c d e f g 0 1 1 P Q R S T H K U V W X Y Z A B C D CMLab, CSIE, NTU

LDPC Accumulate (LDPCA) codes Rate adaptivity D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006

65 LDPC codes 3

Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA (Kernel Design) Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

Previous CUDA implementation • Vertical Processing Kernel (VPK) • Column degree is constant  3 • regular LDPC • Shared memory CUDA thread 3 Shared Memory Shared Memory CUDA thread Block(0) CUDA thread Block(1) Message data 0 1 2 3 4 5 6 7 8 9 10 11 A B C D E Index data F G H 0 1 2 3 4 5 6 7 8 9 10 11 1 4 0 6 7 5 9 10 11 8 2 3 5 8 I J K L Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia • Horizontal Processing Kernel (HPK) • Each message can be update by one thread (SIMD) • Variable row degree in each LDPC code • Data structure: Circular link list

CUDA ImplementationStrategy 1 Reduction of Φ Function in HPK Texture memory in VPK

Texture Binding in VPK t0 t0 Global Memory A D E F G H I J K L B C Global Memory Non-coalescing read Texture Binding 5 8 29 Speedup on both 1.x and 2.x compute capability CMLab, CSIE, NTU

LDPCA decoding time in previous CUDA implementation … … … CMLab, CSIE, NTU

Reduction of Φ Function in HPK A D E F G H B C 1 4 0 6 7 5 2 3 t3 t5 t6 t2 t4 t7 t1 t0 Shared Memory CUDA thread Block(0) Copy to shared memory A D E F G H I J K L B C Global Memory 1 4 0 6 7 5 9 10 11 8 2 3 CMLab, CSIE, NTU

Reduction of Φ Function in HPK 1 4 0 6 7 5 2 3 t0 t1 t2 t3 t5 t6 t7 t4 Shared Memory CUDA thread Block(0) Calculate functions before copying to shared memory A D E F G H I J K L B C Number of φ(x): row degree Global Memory 1 4 0 6 7 5 9 10 11 8 2 3 2 CMLab, CSIE, NTU

LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup LDPCA Time

CUDA ImplementationStrategy 2 ParallelPartial Reduction in HPK

Parallel Reduction Values (shared memory) Step 1 Stride 8 Thread IDs 0 1 2 3 4 5 6 7 Values Step 2 Stride 4 Thread IDs 0 1 2 3 Values Step 3 Stride 2 Thread IDs 0 1 Values Step 4 Stride 1 Thread IDs 0 Values Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Sequential addressing is conflict free CMLab, CSIE, NTU

Computation Overlapping in HPK t2 t1 t3 t4 t0 Parallel partial reduction = Magnitude CMLab, CSIE, NTU

Parallel Partial Reduction Global Memory rowDeg = 8 rowDeg = 4 message index t3 t9 t0 t1 t4 t8 t10 t2 t0 t1 t2 t3 t0 t9 t8 t1 t2 t3 t8 t9 Log(rowDeg) = 3 t0 t1 t0 t1 t8 t8 Shared Memory CUDA thread Block(0) t0 t0 idle threads Mag0 Mag1 CMLab, CSIE, NTU

CUDA ImplementationStrategy 3 Check Node Re-ordering • Completely Unrolling

Check Node Re-ordering Shared Memory 3 3 CUDA thread Block(1) CUDA thread Block(0) rowDeg =8 rowDeg = 4 rowDeg = 8 3 CUDA thread Block(0) CUDA thread Block(1) 2 Variable node Variable node 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 2 1 0 Check node Check node CMLab, CSIE, NTU

Completely unrolling • Redundant if else& __syncthreads() inti = threadIdx.x; Int half = rowDeg >> 1; float myMag = s_mag[i] ; char mySign = s_sign[i] ; do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads(); }while(half); Int base = i - rowPos; myMag = s_mag[base] - myMag; mySign = s_sign[base] ^ mySign; inti = threadIdx.x; float myMag = s_mag[i] ; char mySign = s_sign[i] ; If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; } else if ( rowDeg == 8 ){ ….. } int base = i - rowPos; myMag = s_mag[base] - myMag; mySign = s_sign[base] ^ mySign; Branch divergence Nobranch divergence harm performance Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

CUDA ImplementationStrategy 4 • Combination of VPK and HPK

Kernel Launch Overhead VPK VPK VPK HPK HPK HPK 1. Parallelism is broken (Implicit Inter-Block Synchronization) 2. Extra global memory traffic VPK+HPK =UMK VPK+HPK =UMK VPK+HPK =UMK NVIDIA CUDA PROGRAMMING GUIDE(3.2)  5.2.1 CMLab, CSIE, NTU

LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup Time

Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA (CUDAAPI) Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

CUDA ImplementationStrategy 5 Early Stop Detection

Early Stop Detectionin Sum-Product Algorithm time . . . CPU SPA iteration 1 E SPA iteration 2 SPA iteration 100 Horizontal Processing + Vertical Processing Early stop detection time Terminated at iteration 30 Successfully decoded Converge to wrong codeword iter.1 Check iter. 1 iter.2 Check iter. 2 . . . GPU UMK UMK UMK UMK UMK UMK UMK UMK UMK UMK EDK PCI-E transfer EDK PCI-E transfer CPU Transmit codeword&decoded info Early stop Detection Kernel Check decode info CMLab, CSIE, NTU

Combination of EDK and UMK a b c d e f g 0 1 1 time . . . GPU UMK UMK UMK UMK UMK UMK UMK UMK EDK PCI-E transfer EDK PCI-E transfer CPU • The SPA algorithm is memory intensive in CUDA • The index data of UMK is also used by early stop detection (EDK) • EDK+UMK = EDUMK • 14% additional complexity in terms of execution time CMLab, CSIE, NTU

Concurrent Kernel Execution and Data Transfer Early Stop Detection for iter.1 Run UMK for iter.2 Early Stop Detection for iter.5 Run UMK for iter.6 iter.9 iter.3 UMK EDUMK UMK UMK UMK EDUMK UMK UMK UMK GPU iter.1 PCI-E transfer PCI-E transfer Receive decode info &codeword for iter.5 Receive decode info & codeword for iter.1 time CPU Ideal Timeline CMLab, CSIE, NTU

Practical CUDA Implementation for Early Stopping Detection time #overlap = 3 #overlap = 3 UMK EDUMK Stream 0 EDUMK EDUMK . . . UMK UMK UMK UMK UMK UMK Stream 1 PCI-E transfer PCI-E transfer Stream 2 ~~~~~~~~~ ~~~~~~~~~ host Explicitsynchronization • Use 1 CPU thread, 1 GPU • Use CUDA Driver API instead of Runtime API • Nearly no Stream Management instructions • cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent() CMLab, CSIE, NTU

Speed-up ratio of early stop detection Total number of LDPCA iterations Theoretical speedup Actual speedup Early stop detection Fix iteration overhead 2.0x 1.8x 10% 20000 10000 Overhead on GPU Using Runtime API Overhead on GPU Using driver API Overhead on CPU 7% 5% 20% CMLab, CSIE, NTU

LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup Time 449.63x faster than sequential program!

Test condition • 12 CPU, 24 processor • Intel(R) Xeon(R) CPU X5650 @ 2.67GHz • GPU: Tesla M2050 • 14 (MP) x 32 (Cores/MP) = 448 (Cores) • CUDA capability 2.0 • Shared memory: 48K • Maximum threads in block: 1024 • Concurrent copy and execution • Concurrent kernel execution

Test condition Soccer Foreman Coastguard Hall Monitor Low High Motion Test sequences: QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance component CMLab, CSIE, NTU

Speedup Ratio of LDPCA decoder Using CUDA 0.2% bit rate↑ 0.96 fps 1.05 fps 15.35 ↑ LDPCA 22.51 ↑ LDPCA 7.43 ↑ 9.8 ↑ 7.14 fps 10.29 fps 1.14 fps 0.79 fps 36.91 ↑ LDPCA 12.88 ↑ LDPCA 13.5 ↑ 6.32 ↑ 15.39 fps 4.99 fps

LDPCA decoding time comparison Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, 1-5.

Realtime Decoding Quality Original Sequence 29.21db, 93.17kbps Original Sequence 27.44db, 76kbps Original Sequence Original Sequence 39.46db, 147.64kbps 35.34db, 263.52kbps

Conclusion Fully parallelized LDPCA decoder using CUDA with various features The proposed early stop detection mechanism reduces the latency between the CPU and the GPU Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss CMLab, CSIE, NTU

Future Work Soft input a1 b1 c1 d1 e1 f1 g1 a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3 4 6 7 1 2 3 5 Vitor Silva Horizontal processing Vertical processing 1 2 3 03 13 03 02 12 02 01 11 01 syndrome • Bitplane level parallelization for LDPCA • UV component • Frame level parallelization

Thank You CMLab, CSIE, NTU

A near real time decoding for LDPC based distributed video coding using CUDA

A near real time decoding for LDPC based distributed video coding using CUDA

Presentation Transcript

Error-Resilient Coding and Decoding Strategies for Video Communication

Distributed Real-Time Embedded Video Processing

Block-LDPC: A Practical LDPC Coding System Design Approach

Transform Domain Distributed Video Coding

Distributed Video Coding

Massively Parallel LDPC Decoding on GPU

Massively LDPC Decoding on Multicore Architectures

Real-Time Foveation Techniques for Low Bit Rate Video Coding

Erasure Coding for Real-Time Streaming

FPGA BASED REAL TIME VIDEO PROCESSING

A video-based real-time vehicle counting system using adaptive background method

LDPC Coding for 45GHz

Semi-Parallel Reconfigurable Architecture for Real-time LDPC decoding

Error Correction and LDPC decoding

A Scalable Architecture for LDPC Decoding

Distributed Video Coding

FPGA BASED REAL TIME VIDEO PROCESSING

Distributed Video Coding

Real Time Coding Video Streaming

Coding & Decoding Shortcuts

Error Correction and LDPC decoding

Distributed Real-Time Embedded Video Processing