Scalable Media Coding Applications and New Directions

Scalable Media CodingApplications and New Directions David Taubman School of Electrical Engineering & TelecommunicationsUNSW Australia

Overview of Talk • Backdrop and Perspective • interactive remote browsing of media • JPEG2000, JPIP and emerging applications • Approaches to scalable video • fully embedded approaches based on wavelet lifting • partially embedded approaches based on inter-layer prediction • important things to appreciate • Key challenges for scalable video and other media • focus on motion, depth and natural models • Early work on motion for scalable coding • focus: reduce artificial boundaries • Current focus: everything is imagery • depth, motion, boundary geometry • scalable compression of breakpoints (sparse innovations) • estimation algorithms for joint discovery of depth, motion and geometry • temporal flows based on breakpoints • Summary Taubman; PCS’13, San Jose

Emerging Trends • Video formats • QCIF (25 Kpel), CIF (100 Kpel),4CIF/SDTV (½ Mpel), HDTV (2 Mpel)UHDTV 4K (10 Mpel) and 8K (32 Mpel) • Cinema: 24/48/60 fps; UHDTV: potentially up to 120 fps • HFR: helmet cams & professional cameras now do 240fps • Displays • “retina” resolutions (200 to 400 pixels/inch) • what resolution video do I need for an iPad? (2048x1536?) • Internet and mobile devices • vast majority of internet traffic is video • 100’s of separate non-scalable encodings of a video • http://gigaom.com/2012/12/18/netflix-encoding • New media: multi-view video, depth, … Taubman; PCS’13, San Jose

Embedded media • One code-stream, many subsets of interest Compressed bit-stream Low res Medium res High res • Heterogeneous clients (classic application) • match client bandwidth, display, computation, …. • Graceful degradation (another classic) • more important subsets protected more heavily, … • Robust offloading of sensitive media • backup most important elements first Low quality Medium quality High quality Low frame rate Medium frame rate High frame rate Taubman; PCS’13, San Jose

Our focus: Interactive Browsing • Image quality progressively improves over time • Video quality improves each time we go back • Region (window) of interest accessibility • in space, in time, across views, … • can yield enormous effective compression gains • prioritized streaming based on “degree of relevance” • some elements contribute only partially to the window of interest • Related application: Early retrieval of surveillance • can’t send everything over the air • can’t wait until the plane lands Taubman: PCS’13, San Jose

Scalable images – things that work well • Multi-resolution transforms • 2D wavelet transforms work well • Embedded coding • Successive refinement through bit-plane coding • Multiple coding passes/bit-plane improve embedding 2 coding passesper bit-plane Bit-plane coding(truncation) ECDZQ R-D curve(step size modulation) • Accessibility through partitioned coding of subbands • Region of interest access without any blocking artefacts Taubman; PCS’13, San Jose

JPEG2000 – more than compression Decoupling and embedding embedded code-block bit-streams LL2 HL2 HL1 HH2 LH2 embedded code-block bit-streams LH1 HH1 Taubman; PCS’13, San Jose

JPEG2000 – more than compression Spatial random access Taubman; PCS’13, San Jose

JPEG2000 – more than compression Quality and resolution scalability LL2 HL2 HL1 HH2 LH2 layer 3 layer 2 layer 1 LH1 HH1 quality layers Taubman; PCS’13, San Jose

JPEG2000 – JPIP interactivity (IS15444-9) JPIP stream + response headers • Client sends “window requests” • spatial region, resolution, components, … • Server sends “JPIP stream” messages • self-describing, arbitrarily ordered • pre-emptable, server optimized data stream • Server typically models client cache • avoids redundant transmission • JPIP also does metadata • scalability & accessibility for text, regions, XML, … window Application JPIP Server JPIP Client status window request window imagery Target(file or code-stream) Cache Model Client Cache Decompress/render Taubman; PCS’13, San Jose

What can you do with JPIP? • Highly efficient interactive navigation within • large images (giga-pixel, even tera-pixel) • medical volumes • virtual microscopy • window of interest access, progressive to lossless • interactive metadata • Interactive video • frame of interest • region of interest • frame rate and resolution of interest • quality improves each time we go back over content Aerial Demo Catscan Demo Album Demo Campus Demo Panoramic Video Demo Taubman; PCS’13, San Jose

Wavelet-like approaches Motion Compensated Temporal Lifting • Lifting factorization exists for any FIR temporal wavelet transform • Warp frames prior to lifting filters, j • motion compensation aligns frame features • invertibility unaffected (any motion model) Even frames Low-pass frames warpframes *l2 warpframes *lL * l1 warpframes *lL-1 warpframes Odd frames High-pass frames Taubman; PCS’13, San Jose

Motion Compensated Temporal Lifting • Equiv. to filtering along motion trajectories • subject to good MC interpolation filters • subject to existence of 2D motion trajectories • Handles expansive/contractive motion flows • motion trajectory filtering interpretation still valid for all but highest spatial freqs • subject to disciplined warping to minimize aliasing effects • Need good motion! • smooth, bijective mappings wherever possible • avoid unnecessary motion discontinuities (e.g., blocks) Taubman; PCS’13, San Jose

H1 H1 SpatialDWT SpatialDWT MCLifting MCLifting H2 H2 SpatialDWT SpatialDWT MCLifting MCLifting input frames L2 L2 SpatialDWT SpatialDWT Spatio-Temporal Subbands t+2D Synthesis t+2D Analysis t+2D Structure with MC Lifting Temporalresolution Dimensions of Scalability: Quality NB: other interestingstructures exist Spatial resolution Taubman; PCS’13, San Jose

Inter-layer prediction (SVC/SHEVC) 4 3 4 High ResolutionLayer 2 4 1 2 0 0 • Multiple predictors • pick one? • blend? 0 4 Implications for embedding … 3 4 2 4 1 2 0 0 0 Low ResolutionLayer Taubman: PCS’13, San Jose

Important Principles • Scalability is a multi-dimensional phenomenon • not just a sequential list of layers • Quality scalability critical for embedded schemes • low res subset needs much higher quality at high res • Physical motion scales naturally • not generally true for block-based approximations • but, 2D motion discontinuous at boundaries • Prediction alone is sub-optimal • full spatio-temporal transforms preferred • improve the quality of embedded resolutions and frame rates • noise orthogonalisation • ideally orthogonal transforms • note body of work by Flierl et al. (MMSP’06 thru PCS’13) Taubman: PCS’13, San Jose

Redundant spanningof low-pass content byboth channels  High-pass quantization noise has unnecessarilyhigh energy gain. 1 1 ½ ½ 1 1 1 0 0 -½ -½ Temporal transforms: Why prediction alone is sub-optimal Bi-directionalprediction evenframes residual oddframes forward transform quantization reverse transform Taubman: PCS’13, San Jose

Reduced noise power through lifting • Inject –ve fraction of high band into low band synthesis path • removes low freq. noise power from synthesized high band • Add compensating step in the forward transform • does not affect energy compacting properties of prediction evenframes oddframes 1 0 0 Taubman: PCS’13, San Jose

Motion for Scalable Video • Fully scalable video requires scalable motion • reduce motion bit-rate as video quality reduces • reduce motion resolution as video resolution reduces • First demonstration (Taubman & Secker, 2003) • 16x16 triangular mesh motion model • Wavelet transform of mesh node vectors • EBCOT coding of mesh subbands • Model-based allocation of motion bits to quality layers • Pure t+2D motion-compensated temporal lifting Taubman; PCS’13, San Jose

Scalable motion – very early results 32 34 CIF Bus CIF Bus at QCIF resolution 32 30 30 28 28 26 Luminance PSNR (dB) Luminance PSNR (dB) 26 Non-scalable Non-scalable 24 Brute-force Brute-force 24 Model-based Model-based Lossless motion Lossless motion 22 22 20 20 Bit-Rate (kbit/s) Bit-Rate (kbit/s) 18 18 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 38 H.264 high complexity 36 CIF mobile • H.264 results • CABAC • 5 prev, 3 future ref frames • multi-hypothesis testing • (courtesy of Marcus Flierl) 34 32 30 Luminance PSNR (dB) 5/3 temporal lifting 28 26 1/3 (hierarchical B-frames) 24 22 20 Bit-Rate (Mbit/s) 2.0 1.8 1.6 1.4 0.6 1.2 0.4 1.0 0.2 0.8 Taubman; PCS’13, San Jose

Motion challenges • Issues: • smooth motion fields scale well • mesh is guaranteed to be smooth and invertible everywhere • but, real motion fields have discontinuities • Hierarchical block-based schemes • produce a massive number of artificial discontinuities • not invertible – i.e., there are no motion trajectories • non-physical – hence, not easy to scale • but, easy to optimize for energy compaction • particularly effective at lower bit-rates • Depth/disparity has all the same issues as motion • both tend to be piecewise smooth media • NB: bandlimited sampling considerations may not apply • boundary discontinuity modeling more important for these media Taubman; PCS’13, San Jose

Aliasing challenges Fundamental constraint:(for perfect reconstruction) 1 half-band filter 0 0 Analysis filter responses of the popular 9/7 wavelet transform Spatial aliasing Extract LLsubband Taubman; PCS’13, San Jose

Spatial scalability – t+2D • Temporal transform uses full spatial resolution temporal update • At reduced spatial resolution • Temporal synthesis steps missing high-resolution info • If motion trajectories wrong/non-physical  ghosting • If trajectories valid  temporal synthesis reduces aliasing • less aliasing than regular spatial DWT at reduced resolution temporal predict Taubman; PCS’13, San Jose

Overview of Talk • Backdrop and Perspective • interactive remote browsing of media • JPEG2000, JPIP and emerging applications • Approaches to scalable video • fully embedded approaches based on wavelet lifting • partially embedded approaches based on inter-layer prediction • important things to appreciate • Key challenges for scalable video and other media • focus on motion, depth and natural models • Early work on motion for scalable coding • focus: reduce artificial boundaries • Current focus: everything is imagery • depth, motion, boundary geometry • scalable compression of breakpoints (sparse innovations) • estimation algorithms for joint discovery of depth, motion and geometry • high level view of the coding framework we are pursuing • Summary Taubman; PCS’13, San Jose

Block-based schemes with merging (Mathew and Taubman, 2006) • Linear & affine models • encourages larger blocks • Merging of quad-tree nodes • encourages larger regions and improves efficiency • merging approach later picked up by the HEVC standard • Hierarchical coding • works very well with merging; provides resolution scalability leafmerging Pure translation Linear motion model Affine motion model Taubman; PCS’13, San Jose

M2 M1 Boundary geometry and merging • Model motion & boundary • No merging • Hung et al. (2006) • Escoda et al. (2007) • With merging • Mathew & Taubman (2007) • separate quad-trees (2008) Motion Comp (M2) Motion only Motion + boundary Separate quad-trees Combined result Motion Comp (M1) Taubman; PCS’13, San Jose

Indicative Performance • Things that reduce artificial discontinuities: • modeling geometry as well as motion • separately pruned trees for geometry and motion • merging nodes from the pruned quad-trees • These schemes are practical and resolution scalable • readily optimized across the hierarchy Two quad-trees:motion, geometry, merging Single quad-tree:motion + merging Single quad-tree:motion, geometry, merging Single quad-tree:motion only Taubman; PCS’13, San Jose

Overview of Talk • Backdrop and Perspective • interactive remote browsing of media • JPEG2000, JPIP and emerging applications • Approaches to scalable video • fully embedded approaches based on wavelet lifting • partially embedded approaches based on inter-layer prediction • important things to appreciate • Key challenges for scalable video and other media • focus on motion, depth and natural models • Early work on motion for scalable coding • focus: reduce artificial boundaries • Current focus: everything is imagery • depth, motion, boundary geometry • scalable compression of breakpoints (sparse innovations) • estimation algorithms for joint discovery of depth, motion and geometry • temporal flows induced by breakpoints • Summary Taubman; PCS’13, San Jose

Geometry from arc breakpoints • Breaks can represent segmentation contours • but don’t need a segmentation • breaks are all we need for adaptive transforms • direct RD optimisation is possible • Natural resolution scalability • finer resolution arcs embedded in coarser arcs Coarse grid Finer grid Gridpoints Arcs Taubman: PCS’13, San Jose

Breakpoint induction and scalability Coarse grid • Explicitly identify subset of breakpoints – “vertices” • remaining breakpoints get induced • position of break on its arc impacts induction • Representation is fully scalable • resolution improves as we add vertices at finer scales • quality improves as we add precision to breakpoint positions • Breakpoint representation is image-like • vertex density closely related to image resolution • vertex accuracy like sample accuracy in an image hierarchy Finer grid direct inductiononto sub-arcs spatial inductionontoroot-arcs spatial inductiononto root-arcs Taubman: PCS’13, San Jose

Breakpoint Adaptive Transforms (BPA-DWT) – sequence of non-separable 2D lifting steps P1-step U1-step P2-step U2-step Original field samples Arc BreakpointPyramid Field SamplePyramid • Breakpoints drive an adaptive DWT • Basis functions do not cross discontinuity along an arc • Max of one breakpoint per arc • Adaptive transform well defined Taubman: PCS’13, San Jose

Arc-bands and sub-bands Sub-bands (BPA-transformed data) Arc-bands (vertices) Level 2 Level 1 codeblocks non-root arcs Level 0 root arcs Taubman: PCS’13, San Jose

Embedded Block Coding– for scalability and ROI accessibility • Sub-band stream (field samples) • Sub-bands divided into code blocks • Coded using EBCOT (JPEG2000) • Bitplanes assigned to quality layers • Vertex Stream • Arc-bands divided into code blocks • Coding scheme similar to EBCOT • Bitplanes refine vertex locations • Bitplanes assigned to quality layers X X X X 0 0 0 0 λ1 0 0 0 0 1 0 0 0 λ2 1 1 1 0 0 0 0 0 λ3 LSB 1 1 0 1 Sub-band stream Vertex stream Taubman: PCS’13, San Jose

Fully Scalable Depth Coding– field samples are depth; breaks are depth discontinuities • JPEG 2000, 50 k bits • Resolution scalable • Quality scalable • No blocks Poorly suited todiscontinuities indepth/motion fields • Proposed, 50 k bits • Resolution scalable • Quality scalable • No blocks Well suited todiscontinuousdepth/motion fields Taubman: PCS’13, San Jose

R-D Results for Depth Coding JPEG2000: 5 levels of decomposition with 5/3 DWT Breakpoint-adaptive: vertex stream is truncated at a quality level sub-band stream decoded progressively Taubman: PCS’13, San Jose

Model-based quality layering • Scaled by discarding sub-band and arc-band quality layers • fully automatic model-based quality layer formation • model-based interleaving of all quality layers for optimal embedding Taubman: PCS’13, San Jose

Model-based quality layering • Compared with segmentation based approach (Zanuttigh & Cortelazzo, 2009) • not scalable; sensitive to initial choice of segmentation complexity Taubman: PCS’13, San Jose

Fully scalable motion coding – preliminary • Field samples are motion vectors • motion coded using EBCOT after BPA-DWT • Breakpoints and motion jointly estimated • compression-regularised optical flow: • Coded length L provides the prior (regularisation) • reflects cost of scalably coding arc breakpoints • plus cost of coding motion wavelet coeffs after BPA-DWT • Distortion D provides the observation model Taubman: PCS’13, San Jose

Optimisation framework • Approach based on loopy belief propagation • Breakpoints/vertices connected via inducing rules • turns out to be very efficient • complexity grows only linearly with image size • Breakpoints/motion connected via BPA-DWT • initial simplification: motion at each pixel location drawn from a small finite set (cardinality 5) • initial motion candidates generated by block search with varying block sizes • System always converges rapidly • Use modes of the marginal beliefs at each node Taubman: PCS’13, San Jose

Initial scalable motion results • 2 frames, inter-predict only; occlusions removed • newer work eliminates restriction to finite motion set H.264 trunc motion bits trunc vertexand motion bits JPEG2000 Taubman: PCS’13, San Jose

Visualising the breaks & motion Taubman: PCS’13, San Jose

Temporal induction of motion– preliminary • Given: • motion from f1 to f2 • plus breakpoint fields in f1 and f2 • Induce: • motion from f2 to f1 • resolve ambiguities, find occlusions f1 f2 Taubman: PCS’13, San Jose

Synthetic example: Inferred vertical motion Inferred occlusion Inferred horizontal motion Resolvedambiguities Taubman: PCS’13, San Jose

Temporal induction of breaks– preliminary • Given: • motion from frame f1 to frame f3 • plus breakpoint fields in f1 and f3 • plus motion from f1 to f2(real or estimated) • Induce: • breakpoint field in frame f2 • hence estimate all other motion fields • note: induced occlusions are temporal breaks f1 f2 f3 Taubman: PCS’13, San Jose

Synthetic example: Horizontal breaksFrame 2 induced Horizontal breaksFrame 3 Horizontal breaksFrame 1 Taubman: PCS’13, San Jose

Things we are working on • Message approximation strategies for joint inference of breakpoint fields with motion • compression regularized optical flow • continuous motion, notionally at infinite precision • Advanced breakpoint coding techniques • including differential coding schemes • i.e., geometry transforms • including inter-block conditional coding schemes that do not break the EBCOT paradigm • Breakpoint adaptive spatio-temporal transforms • breakpoints potentially address most of the open issues surrounding motion compensated lifting schemes • Integration within our JPEG2000 SDK (Kakadu) • interactive access, JPIP for remote browsing, etc. Taubman: PCS’13, San Jose

Demo placeholder • Interactive browsing of breakpoint media over JPIP Taubman: PCS’13, San Jose

Arc breakpoints vs graph edges • arcs = possible edges on spatial graph • breaks = missing edges • main point of divergence • arc breakpoints have locations (key to induction & scalability) • graph edges may be weighted (equivalent to “soft” breaks) • Related approaches: “Graph based transforms for depth video coding” Kim, Narang and Ortega, ICASSP’12 “Video coder based on lifting transforms on graphs” Martinez-Enriquez, Diaz-de-Maria & Ortega, ICIP’11 • Edges (binary breaks) from segmentation (JBIG encoded) • Motion (block based) produces temporal edges • Spatio-temporal video transform adapts to graph • hierarchical, but not scalable in the usual sense Taubman: PCS’13, San Jose

Scalable Media Coding Applications and New Directions