240 likes | 269 Views
Video Coding Technology Proposal. Dake He, Gergely Korodi, Gaelle Martin-Cocher, En-hui Yang, Xiang Yu, and Jinwen Zan Research In Motion Limited. 1 st JCT-VC meeting, Dresden, Germany, April 16, 2010. New Technical Tools. Complexity Reduction Entropy Coding Parallel Framework
E N D
Video Coding Technology Proposal Dake He, Gergely Korodi, Gaelle Martin-Cocher, En-hui Yang, Xiang Yu, and Jinwen Zan Research In Motion Limited 1st JCT-VC meeting, Dresden, Germany, April 16, 2010
New Technical Tools • Complexity Reduction • Entropy Coding • Parallel Framework • Variable-Length-to-Variable-Length (V2V) codes • In-loop Filtering • Encoder only deblocking • Rate Distortion Performance • Soft-Decision Quantization (SDQ) • Iterative Coding • Implemented and Tested in JM11.0 KTA2.6r1
CABAC • CABAC = Binarization* + Context Modeling + Binary Arithmetic Coding • After binarization, for each bit bi to be encoded or decoded • Context modeling generates a probability pi; • BAC encodes the bit bi with the probability pi. Context Modeling BAC xn b0 b1 b2 b3 b4… encoded bitstream p0 p1 p2 p1 p0… Illustration of the CABAC encoding process *Thanks to Detlev.
Parallel Framework • Decouple context modeling from BAC • Any entropy coding method (engine) might be used with any context models. • Encoding may be parallelized. encoded bitstream 0 Entropy encoderwith p0 b0,0 b0,1 b0,2 … MUX Context Modeling and DEMUX xn encoded bitstream 1 Entropy encoder with p1 b1,0 b1,1 b1,2 … encoded bitstream encoded bitstream N bN,0 bN,1 bN,2 … Entropy encoder with pN Illustration of the encoding process in the proposed framework
Parallel Framework • Decoding may be parallelized as well. encoded bitstream 0 b0,0 b0,1 b0,2 … Entropy decoder with p0 D E M U X Context Modeling and MUX xn encoded bitstream 1 Entropy decoder with p1 b1,0 b1,1 b1,2 … encoded bitstream encoded bitstream N bN,0 bN,1 bN,2 … Entropy decoder with pN Illustration of the decoding process in the proposed framework
V2V Coding • VLC (Huffman coding) • Fixed length input -> variable length output • V2V • Variable length input -> variable length output • Includes VLC as a special case • Input is a prefix set of finite source strings • Output is a prefix set of binary codewords • Five components • Code generator, encoder tree, decoder tree • Encoder buffer, code selector (applied to serial mode only)
A Sample V2V Code • p=0.20 • 14 leaves • Compress rate = 0.7250. [0.42% higher than the entropy 0.7219]
Encoder Tree • Assumptions • Source alphabet is binary. • The probability of the LPS = p. • A binary V2V code corresponds to a full binary tree T. • Each path in T corresponds to a source string • Every node on the path has a probability given by pu(1-p)v, where u is the # of LPS symbols and v is the # of MPS symbols. • Design Constraints of T • No leaf node has a probability < 2-16 • The # of leaves in T < 4096 • There exists an efficient Huffman code for the leaves in T. • T and the associated Huffman code allow for 30-bit representation.
Two Huffman codes • Let T be the encoder tree of a V2V code • The primary Huffman code encodes the leaves in T (labeled blue) • The secondary Huffman code encodes the internal nodes in T (labeled red) • Only used when the encoding process is terminated at an internal node (incomplete parsing) possibly due to • the source sequence ends • buffer overflows • Can be very simple if not used often • May not be necessary in the parallel framework.
Decoder Tree • For a V2V code with encoder tree T, the decoding process uses two binary decoder trees Tp and Ts. • Each path from root to leaf in Tpcorresponds to a primary codeword • Each path from root to leaf in Ts corresponds to a secondary codeword • Each leaf in Tp and Ts stores a source string represented .
V2V: Compression performance and computational complexity • Standalone Bernoulli tests on binary i.i.d. sources with known probabilities between 0 and 0.5. • In all cases, V2V is well within 1% of BAC and the entropy • Encoding in software implementation is 3-5 times as fast as BAC • Decoding in software implementation is at least twice as fast as BAC
Some discussions on hardware implementation • V2V is predictable (based on table lookups), and can decode multiple bits (bins) at a time. • We have designed trees that can sustain high throughput at more than 6 bits/second • reaching more than 1Gbps at 200MHz clock rate • V2V is estimated to be more power efficient than BAC • Longer battery life in wireless communications • V2V is well suited in the parallel framework • It is estimated that the throughput can be easily doubled by instantiating two V2V decoder blocks with a small increase in area cost (a conservative estimate show 20%) • One set of tables (trees) can be shared between the two decoders. • Unbalanced load?
Load Balancing • d=2 entropy decoders available at the decoder. • Each compressed bit takes a unit time to decode. • Problem: the red line above might be in the middle of a codeword. • Solution: use additional bits to move the red line to the start of the next codeword (the blue line). • The above simple method is generalized to work in the following cases • d is an arbitrary finite number; • d is not even known to the encoder. • BAC, V2V, VLC
V2V in serial mode: Encoder Buffer • Interleaving codewords of serial mode of operation • uses the same interface as CABAC • Binarization and context modeling are the same as in CABAC • BAC is replaced by V2V • Problem: • Each bit may use a different V2V code from the last bit. • For correct decoding, codewords must enter the output stream in the order their first bit is referenced, which may differ from the order of code completion (i.e. when their last bit is referenced) • Solution: A cyclic buffer at the encoder. • When parsing of a code tree starts, the next free entry in the cyclic buffer is allocated to the incomplete codeword. • Once a codeword is complete, the codeword is written in the buffer to its appropriate position. • The oldest entry in the buffer is written out when it is complete, and the code selector (see below) decides that writing is possible. • If the buffer is full, a request for a new entry forces a flush on the oldest code tree, using its secondary code.
V2V in serial mode: Code selector • Code selector communicates to the decoder which codewords are primary and which are secondary. • the secondary codewords are used much less frequently than the primary codewords • The frequency difference may be used to reduce the associated overhead.
Some comments • Previous attempts at combining context modeling with VLC involves a master student, a Shannon award winner, and a Turing award winner • N. Faller, ``An adaptive system for data compression,’’ in Proc. 7th Asilomar Conf. Circuits, Systems, and Computers, 1973. • R. G. Gallager, ``Variations on a theme by Huffman,’’ vol. IT-24, 1978 (25th anniversary of Huffman coding). • D. E. Knuth, ``Dynamic Huffman coding,’’ vol. 6, J. ACM, 1985. • Unfortunately, adaptive Huffman coding has found little practical use, largely due to its implementation complexity. • CABAC, which leads directly to this work, is remarkable in its own right.
Soft-decision quantization • Problem: Fix the decoder. Find the reconstruction sequence (encoder output) that minimizes the rate distortion cost. • Solution: Suppose that a CABAC decoder is used. • Fix the context states Ω and optimize the RD cost over the quantization outputs u, i.e., minu [|| c - uq ||2 + r(u | Ω)]. • Let u* denote the solution. Update context states Ω according to the obtained quantization output u*. RIM, “Rate distortion optimization for interframe coding in hybrid video compression,” COM16-C305, Oct. 2009
Trellis and the Viterbi algorithm • Within one DCT block, the search for optimal u is in a 1D vector space: dynamic programming is possible (the Viterbi algorithm) • Trellis structure is determined by the context model. The one shown below is for the context model in CABAC. • Complexity can be easily controlled by trimming the trellis below, for example, keeping only the paths close to the hard decision quantization output.
Iterative coding • Notation • Original frame: X • Prediction: P • reconstructed residuals: U • Reconstruction = P+U • modes: m; reference frames: f; motion vectors: V • [Motion estimation/Find P] For given residual reconstruction U, we compute (m, f , V) by solving min [ d(X -P(m, f, V), U) + (r(m) + r (f) + r(V) ) ] • [Residual coding/Find U] For given P(m, f , V), the SDQ as described above is applied to optimize the quantization outputs, which in turn determine the reconstructed residuals U. • Repeat Steps 1 and 2 until the change of the actual RD cost is less than a given threshold.
Deblocking • An example to show how to apply the idea of iterative coding to deblocking/loop filtering. • Problem: • In-loop filtering takes about 1/3 of total decoding time [List et. al. 2003] • Original frames are not referenced in in-loop filtering: for the current frame, deblocking is part of post-processing. • Deblocking does improve psnr in most cases, and thus cannot be simply skipped.
Encoder only deblocking: Inter-coding Case • Notation • Original frame: X • Prediction from motion estimation: P • Reconstructed Residuals: U • Reconstruction: Y = P+U • Reconstruction after deblocking: Z • Not integrated to generate the submitted streams. • Algorithm [based on the concept of iterative coding] • Generate F such that at Fi, j= Zi, j if |Xi,j-Zi, j|< |Xi,j-Zi, j| or Yi, j otherwise. • Calculate X’ = X-(F-P) (F-P can be regarded as residuals), and use X’ as the original frame to find P’ through motion estimation. • Calculate X-P’, and do transform, quantization, and reconstruct the new residuals U’. • Transmit P’ and U’ to the decoder as if they were P, U.