300 likes | 459 Views
34. “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound. Kidron, Schechner, Elad, CVPR 2005. 47. Audio-Visual Analysis: Applications. Lip reading – detection of lips (or person) Slaney, Covell (2000) Bregler, Konig (1994) Analysis and synthesis of music from motion
E N D
34 “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR 2005
47 Audio-Visual Analysis: Applications • Lip reading – detection of lips (or person) • Slaney, Covell (2000) • Bregler, Konig (1994) • Analysis and synthesis of music from motion • Murphy, Andersen, Jensen (2003) • Source separation based on vision • Li, Dimitrova, Li, Sethi (2003) • Smaragdis, Casey (2003) • Nock, Iyengar, Neti (2002) • Fisher, Darrell, Freeman, Viola (2001) • Hershey, Movellan (1999) • Tracking • Vermaak, Gangnet, Blake, Pérez (2001) • Biological systems • Gutfreund, Zheng, Knudsen (2002)
47 audio-visual analysis microphone camera Problem: Different Modalities Audio data 44.1 KHz, few bands Not stereophonic Visual data 25 frames/sec Each frame: 576 x 720 pixels Kidron, Schechner, Elad, Pixels that Sound
54 Not Typical • Cluster of pixels - • linear superposition • Canonical Correlation Analysis (CCA) • Smaragdis, Casey (2003) • Li, Dimitrova, Li, Sethi (2003) • Slaney, Covell (2000) Ill-posed (lack of data) • Mutual Information (MI) • Fisher et. al. (2001) • Cutler, Davis (2000) • Bregler,Konig (1994) highly complex Previous Work • Pointwise correlation • Nock, Iyengar, Neti (2002) • Hershey, Movellan (1999)
49 Pixel #2 Band #2 Band #1 Pixel #1 Pixel #3 CCA Optimal Optimal visual components Projection Projection Video Audio Kidron, Schechner, Elad, Pixels that Sound
40 Visual Projection v • Video features • Pixels intensity • Transform coeff (wavelet) • Image differences 1D variable 3 40 120 52 68 74 36 859 Projection
41 Audio Projection a 1D variable • Audio features • Average energy per frame • Transform coeffs per frame Projection
42 Audio Video Canonical Correlation Representation Projections (per time window) Random variables (time dependent) Correlation coefficient
43 Canonical Correlation Largest Eigenvalue equivalent to Corresponding Eigenvectors • yield an eigenvalue problem: • Knutsson, Borga, Landelius (1995) CCA Formulation Projections
51 t (frames) Spatial Location (pixels intensities) Visual Data Kidron, Schechner, Elad, Pixels that Sound
44 t (frames) Spatial Location (pixels intensities) = Rank Deficiency Kidron, Schechner, Elad, Pixels that Sound
45 Estimation of Covariance Rank deficient
46 Impossible to invert !!! Ill-Posedness • Prior solutions: • Use many more frames poor temporal resolution. • Aggressive spatial pruning poor spatial resolution. • Trivial regularization
47 Large number of weights AGeneral Problem Small amount of data The problem is ILL-POSED Over fitting is likely
48 Minimizing Maximizing An Equivalent Problem
49 A has a single column, and Known data Minimizing Single Audio Band (The denominator is non-zero)
52 Full correlation if a(1) a(2) a(ti) a(30) = Time a V Underdetermined system ! Kidron, Schechner, Elad, Pixels that Sound end
52 “Out of clutter, find simplicity. From discord, find harmony.” Albert Einstein Detected correlated pixels end
53 • Non-convex • Exponential complexity minimum -norm Sparse Solution
54 • Sparse • Convex • Polynomial complexity minimum -norm in common situations The -norm criterion Donoho, Elad (2005)
55 -norm (pseudo-inverse, SVD, QR) Solving using Energy spread minimum -norm The Minimum Norm Solution
56 Audio-visual events No parameters to tweak Maximum correlation: Eigenproblem Minimum objective function G Linear programming Fully correlated Sparse Polynomial
57 • Convex • Linear -ball Multiple Audio Bands - Solution The optimization problem: Non-convex constraint
58 Optimization over each face is: S2 S1 S3 S4 No parameters to tweak Multiple Audio Bands • Each face: linear programming
Frame 9 Frame 42 Frame 68 Frame 115 Frame 146 Frame 169 Sharp & Dynamic, Despite Distraction
Frame 51 Frame 106 Frame 83 Frame 177 Performing in Audio Noise • Sparse • Localization on the proper elements • False alarm – temporally inconsistent • Handling dynamics
56 –norm: Energy Spread Frame 146 Frame 83 Movie #1 Movie #2
57 –norm: Localization Frame 146 Frame 83 Movie #1 Movie #2
The “Chorus Ambiguity” Synchronized talk Who’s talking? • Possible solutions: • Left • Right • Both Not unique (ambiguous)
feature 2 feature 2 Both feature 1 feature 1 -norm -norm The “Chorus Ambiguity”