1 / 15

Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition. Chair for Image Understanding and Knowledge-based Systems Institute for Informatics Technische Universität München Sylvia Pietzsch sylvia.pietzsch@cs.tum.edu. Overview. Video low-level descriptors

ulric
Download Presentation

Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition Chair for Image Understanding and Knowledge-based Systems Institute for Informatics Technische Universität München Sylvia Pietzsch sylvia.pietzsch@cs.tum.edu

  2. Overview • Video low-level descriptors • Model-based image interpretation • Structural features • Temporal features • Audio low-level descriptors • Combining video and audio descriptors • Experimental results • Conclusion and outlook Technische Universität München Sylvia Pietzsch

  3. Model-based Image Interpretation • The model The model contains a parameter vector that represents the model’s configuration. • The objective functionCalculates a value that indicates how accurately a parameterized model matches an image. • The fitting algorithmSearches for the model parameters that describe the image best, i.e. it minimizes the objective function. Technische Universität München Sylvia Pietzsch

  4. Local Objective Functions Technische Universität München Sylvia Pietzsch

  5. Ideal Objective Functions P1: Correctness property:Global minimum corresponds to the best fit. P2: Uni-modality property:The objective function has no local extrema. ¬ P1 P1 ¬P2 P2 • Don’t exist for real-world images • Only for annotated images: fn( I , x ) = | cn – x | Technische Universität München Sylvia Pietzsch

  6. Learning the Objective Function • Ideal objective function generates training data • Machine Learning technique generates calculation rules x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Technische Universität München Sylvia Pietzsch

  7. Skin Color Extraction • Location of contour lines and skin colored parts • Adaptive to image context conditions original image fixed classifier adapted classifier Correctly detected pixels: • fixed classifier: 90.4% 74.8% 40.2% • adapted classifier: 97.5% 87.5% 97.0% Technische Universität München Sylvia Pietzsch

  8. Structural Features • Deformation parameters describe a distinctive state of the face. Technische Universität München Sylvia Pietzsch

  9. Temporal Features • Facial expressions emerge from muscle activity. • Optical flow vectors are calculated at equally distributed feature points connected to the shape model. Technische Universität München Sylvia Pietzsch

  10. Audio Low-level Descriptors • Aiming at independence of phonetic content and speaker • Coverage of prosodic, articulatory, and voice quality aspects • 20ms frames, 50% overlap, Hamming window function • Zero crossing rate (ZCR) • Pitch • 7 formants • Energy • Spectral development • Harmonics-to-Noise-Ratio (HNR) • Durations of voiced sounds by HNR • Durations of silences by bi-state energy • SMA filtering of LLDs • Addition of 1st and 2nd order LLD regression coefficients Technische Universität München Sylvia Pietzsch

  11. Combining Audio and Video LLDs • Time series constructed for LLDs (audio, video separately) • Application of functionals to combined low-level descriptors • Linear moments (mean, std. deviation) • Quartiles • Durations • Resulting feature vector: • 276 audio features • 1048 video features SVM Technische Universität München Sylvia Pietzsch

  12. Experimental Results (1) • Database: Airplane Behavior Corpus • Guided storyline • 8 subjects (25 to 48 years old) • 11.5 hours of video in total • 10-fold stratisfied cross validation • Feature pre-selection by SVM-SFFS (sequential forward floating search) Technische Universität München Sylvia Pietzsch

  13. Experimental Results (2) • Main confusions: • neutral, nervous • cheerful, intoxicated • Aggressive behavior recognized best Technische Universität München Sylvia Pietzsch

  14. Conclusion and Outlook • Combined feature set superior over individual audio or video feature set • Future work: • Investigation on further data sets • Comparison to late fusion approaches • Performance of asynchronous feature fusion • Application of hierarchical functionals Technische Universität München Sylvia Pietzsch

  15. Thank you! Technische Universität München Sylvia Pietzsch

More Related