1 / 21

Multimodal Integration for Meeting Action Segmentation

Explore how meetings can be structured, modeled, and segmented using audio-visual features, multi-layer HMMs, and DBNs for noise-robust recognition. Learn about action lexicons and feature extraction for accurate segmentation and classification.

joannl
Download Presentation

Multimodal Integration for Meeting Action Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview • Introduction: Structuring of meetings • Audio-visual features • Models and experimental results • Segmentation using semantic features (BIC-like) • Multi-layer HMM • Multi-stream DBN • Noise robustness with a mixed-state DBN • Summary M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 2

  2. A6 A2 A1 A2 A3 Time Structuring of Meetings Meetings can be modelled as a sequence of events, group, or individual actions from a set A = {A1, A2, A3, …, AN} M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 3

  3. Structuring of Meetings Meetings can be modelled as a sequence of events, group, or individual actions from a set A = {A1, A2, A3, …, AN} Different sets represent different meeting views Person 1 Idle Writing Speaking Idle Group Interest Neutral High Low Discussion Phase Monologue Presentation Discussion Group task Information sharing Decision Information sharing Time M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 4

  4. Group Individual • Discussion Phase • Discussion • Monologue (with ID) • Note-taking • Presentation • Whiteboard • Monologue (ID) + Note-taking • Presentation + Note-taking • Whiteboard + Note-taking • Task • Brainstorming • Decision making • Information sharing • Actions • Speaking • Writing • Idle • Interest Level • High • Neutral • Low • Interest Level • High • Neutral • Low Meeting Views Further sets could represent the current agenda item or the topic discussed, but may require higher semantic knowledge for an automatic analysis. M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 5

  5. Group • Discussion Phase • Discussion • Monologue (with ID) • Note-taking • Presentation • Whiteboard • Monologue (ID) + Note-taking • Presentation + Note-taking • Whiteboard + Note-taking Action Lexicon Action lexicon I Only single actions 8 action classes Action lexicon II Single actions and combinations of parallel actions 14 action classes M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 6

  6. M4 Meeting Corpus • Recorded at IDIAP • Three fixed cameras • Lapel microphones • Microphone array • Four participants • 59 videos • Each 5 minutes M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 7

  7. Microphone Array Cameras Lapel Microphones … Multimodal Feature Extraction Action Segmentation Action Classification System Overview M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 8

  8. Global motion For each position Centre of motion Changes in motion (dynamics) Mean absolute deviation Intensity of motion Visual Features Skin colour blobs (GMM) For each person • Head: orientation and motion • Hands: Position, size, orientation, and motion • Moving blobs from background subtraction M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 9

  9. Audio Features Lapel microphones For each person • For each speech segment • Energy • Pitch (SIFT algorithm) • Speaking rate (combination of estimators) • MFC-Coefficients Microphone array For each position • For each seat (4), whiteboard, and projector screen • SRP-PHAT measure to estimate a speech activity • And a speech and silence segmentation M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 10

  10. Speech transcription (or output of ASR) Gesture transcription (or output of an automatic gesture recognizer) Lexical Features For each person • Gesture inventory • Writing • Pointing • Standing up • Sitting down • Nodding • Shaking head M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 11

  11. BIC-like approach • Strategy similar to the Bayesian information criterion (BIC) • Segmentation and classification in one step • Two windows with variable length shifted over the time scale • Inner border is shifted from left to right • Classify each window: different results for left and right window -> inner border is considered as a boundary of a meeting event • No boundary is detected -> enlarge the whole window • Repeat until right border reaches end of meeting M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 12

  12. BIC-like approach Experimental setup: 57 meetings using “Lexicon 1” M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 13

  13. Individual action layer: I-HMM Speaking, Writing, Idle Group action layer: G-HMM Discussion, Monologue, … Each layer trained independently Simultaneous segmentation and recognition Two-layer HMM Individual Group HMM Person 1 features I-HMM 1 Person 2 features I-HMM 2 Person 3 features I-HMM 3 Person 4 features I-HMM 4 Group features Multi-layer Hidden Markov Model M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 14

  14. Smaller observation space-> stable for limited training data I-HMMs are person independent-> much more training data from different persons available G-HMM less sensitive to small changes in the low-level features Two layers are trained independently-> different HMM combinations can be explored Two-layer HMM Individual Group HMM Person 1 features I-HMM 1 Person 2 features I-HMM 2 Person 3 features I-HMM 3 Person 4 features I-HMM 4 Group features Multi-layer Hidden Markov Model M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 15

  15. Multi-layer Hidden Markov Model Experimental setup: 59 meetings using “Lexicon 2” M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 16

  16. Counter structure Improves state-duration modelling Action counter C is incremented after each “action transition” Hierarchical approach State-space decomposition: Two feature related sub-action nodes (S1 and S2) Each sub-state corresponds to a cluster of feature vectors Unsupervised training Independent modality processing …. C0 Ct Ct+1 E0 Et Et+1 …. A0 At At+1 …. S01 St1 St+11 …. S02 St2 St+12 …. Y01 Yt1 Yt+11 Y02 Yt2 Yt2 Yt+12 Multi-stream DBN Model M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 17

  17. Model Feature setup HMM Multi-stream Mult.+ Count. 6x6 7x7 6x6 UEDIN (Turn.+Pros.) 48.4 17.1 11.0 18.9 IDIAP (Audio + Video) 61.6 26.7 16.7 24.9 TUM (Beamf. + Video) 92.9 21.4 21.4 21.4 Experimental Results Experimental setup: cross-validation over 53 meetings using “Lexicon 1” M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 18

  18. OAt OAt+1 CoupledSystem Video Audio stream LDS HMM Video Audio HAt+1 HAt Multi-stream HMM HVt HVt+1 UVt UVt+1 Visual stream XVt XVt+1 LDS OVt OVt+1 Time t Mixed-State DBN • Couple a multi-stream HMM with a linear dynamical system (LDS) • HMM is driving input for the LDS • With the HMM as driving input disturbances can be compensated • System can be described as DBN M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 19

  19. Mixed-State DBN Experimental setup: 59 meetings using “Lexicon 1” M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 20

  20. Summary • Joint effort of TUM, IDIAP, and UEDIN • Modelling meetings as a sequence of individual and group actions • Large number of audio-visual features • Four different group action modelling frameworks: • Algorithm based on BIC • Multi-layer HMM • Multi-stream DBN • A mixed-state DBN • Good performances of the proposed frameworks • Future work • Space for further improvements: Both in the feature domain and in the model structures • Investigate combinations of the proposed models M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll, and D. Zhang Multimodal Integration for Meeting Group Action Segmentation and Recognition 21

More Related