170 likes | 187 Views
Audiovisual Event Detection & Recognition. Audiovisual speech recognition Manifold Discriminant Features Fusion using boosted combination of DBNs TRECVid and PASCAL Competitions, 2009 GMM supervector normalizes inter-session variability
E N D
Audiovisual Event Detection & Recognition • Audiovisual speech recognition • Manifold Discriminant Features • Fusion using boosted combination of DBNs • TRECVid and PASCAL Competitions, 2009 • GMM supervector normalizes inter-session variability • Sparse coding to model manifold of low level features • Non-speech audiovisual event detection • Over-generate features, select, tandem NN+HMM, and compensate variability using a GMM supervector
Lip Rectangle Dimensionality Reduction using Local Discriminant Graph • Maximize Local Inter-Manifold Interpolation Errors, • subject to a constant Same-Class Interpolation Error: Find P to maximize DD||PT(xi-ckyk)||2, ykЄ KNN(xi), other classes Subject to DS = constant, DS =||PT(xi-cjxj)||2, xjЄ KNN(xi), same class
Lip Reading Results (Digits) DCT=discrete cosine transform; PCA=principal components analysis; LDA=linear discriminant analysis; LEA=local eigenvector analysis; LDG=local discriminant graph
Audiovisual Speech RecognitionWord Error Rate (Connected Digits)
TREC VIDEO RETRIEVAL EVALUATIONandPASCAL VISUAL OBJECT CLASS CHALLENGING 2009
TRECVID: NIST competition on Text and Video retrieval Task: surveillance video classification PASCAL: PATTERN ANALYSIS, STATISTICAL MODELING AND COMPUTATIONAL LEARNING Task: predict at least one object of a given class is present in the image. 20 classes are selected including person, animals, vehicles, and indoor objects.
Variability Compensation using WCCN • Treat log likelihoods, Zj=-log p(x|j), as a high-dimensional pseudo feature vector, called the “supervector” • Z-normalize the supervector to reduce the effect of irrelevant variability using a robust regularized covariance matrix: S=(g S+(1-g )I) • Z-normalization results is better linear separability
RESULTS Our methods: Gaussian Mixtures (GMM) models distribution of patches in the image Local sparse coding to model manifold of image patches (1) + (2) combined at the kernel level TRECVid: Illinois/NEC team ranks #1 out of 16 teams in TRECVid 2009 Surveillance video task PASCAL: Illinois/NEC team ranks #1 in the classification task out of 48 entered methods from 20 groups worldwide.
AED: Why is it Hard? DIFFICULTIES - Unknown spectral structure - Different spectral structure for each events - Low SNR (speech as background noise)
AED: Solution System Overview Result: Illinois team ranked #1 out of 6 teams in CLEAR AED 2007
Current Research: Audiovisual Fusion • (feature selection)+(ANN)+(HMM)+(supervector compensation) • Likelihood-space fusion of audio and video features