280 likes | 430 Views
Dynamic Bayesian Networks for Meeting Structuring. Alfred Dielmann, Steve Renals (University of Sheffield). Introduction. GOAL. Automatic analysis of meetings through “multimodal events” recognition. Using objective measures and statistical methods.
E N D
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)
Introduction GOAL Automatic analysis of meetings through “multimodal events” recognition Using objective measures and statistical methods events which involve one or more communicative modalities, and represent a single participant or a whole group behaviour
Multimodal Recognition Meeting Room Knowledge Database ……… Audio Video ……… Feature Extraction Signal Pre-processing Information Retrieval Specialised Recognition Systems (Speech,Video,Gestures) Models “Multimodal Events” Recognition
Group Actions • The machine observes group behaviours through objective measures (“external observer”) • Results of this analysis are “structured” into a sequence of symbols (“coding system”) • Exhaustive (covering the entire meeting duration) • Mutually exclusive (non overlapping symbols) We used the coding system adopted by the “IDIAP framework”, composed by 5 “meeting actions”: • Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard derived from different comunicative modalities
Corpus • 60 meetings (30x2 set) collected in the “IDIAP Smart Meeting Room”: • 30 meetings are used for the training • 23 meetings are used for the testing • 7 meetings will be used for the results validation • 4 participants per meeting • 5 hours of multi-channel Audio-Visual recordings: • 3 fixed cameras • 4 lapel microphones + 8 element circular microphones array • Meeting agendas are generated “a priori” and strictly followed, in order to have an average of 5 “meeting actions” for each meeting • Available for public distribution http://mmm.idiap.ch/
Features (1) Only features derived from audio are currently used... Speaker Turns Dimension reduction Mic. Array Beam-forming Prosody and Acoustic Lapel Mic. Pitch baseline Energy Rate Of Speech …..
L1 L2 L3 L4 0.1 0.4 0.6 0.3 t-3 t-2 0.3 0.5 0.5 0.3 t-1 0.2 0.4 0.7 0.2 t 0.2 0.3 0.7 0.1 Features (2) Speaker Turns Li(t)*Lj(t-1)*Lk(t-2) i k j Location based “Speech activities” (SRP-PHAT beamforming) Kindly provided by IDIAP Speaker Turns Features
Features (3) Mask Features using “Speech activity” RMS Energy Pitch Pitch extractor Filters (*) Lapel Mic. MRATE Rate Of Speech Mic. Array Beam-forming (*) Histogram, median and interpolating filter
Features (4) We’d like to integrate other features….. Participants Motion features Video Image Processing Other blob positions … Gestures and Actions … Transcripts Audio. ASR … Everything that could be automatically extracted from a recorded meeting … Other …
Dynamic Bayesian Networks (1) Bayesian Networks are a convenient graphical way to describe statistical (in)dependencies among random variables A F Direct Acyclic Graph Conditional Probability Tables C S Given a set of examples, EM learning algorithms (ie: Baum-Welch) could be used to train CPTs L Given a set of known evidence nodes, the probability of other nodes can be computed through inference O
Dynamic Bayesian Networks (2) • DBN are an extension of BNs with random variables that evolves in time: • Instancing a static BN for each temporal slice t • Explicating temporal dependences between variables C S C S C S L L L …….. O O O t=0 t=+1 t=T
Dynamic Bayesian Networks (3) Hidden Markov Models, Kalman Filter Models and other state-space models are just a special case of DBNs : p A Q0 Qt Qt+1 …. …. Representation of an HMM as an instance of a DBN Y0 Yt Yt+1 B t=0 t t+1
Dynamic Bayesian Networks (4) Representing HMMs in terms of DBNs makes easy to create variations on the basic theme …. Z0 Zt Zt …. X0 Xt Xt+1 …. Z0 Zt Zt+1 …. V0 Vt Vt Q0 Qt Qt+1 …. Q0 Qt Qt …. Y0 Yt Yt+1 Y0 Yt Yt Factorial HMMs Coupled HMMs
Dynamic Bayesian Networks (5) Use of DBN and BN present some advantages: • Intuitive way to represent models graphically, with a standard notation • Unifiedtheory for a huge number of models • Connecting different models in a structured view • Making easier to study new models • Unified set of instruments (ie: GMTK) to work with them (training, inference, decoding) • Maximizes resources reuse • Minimizes “setup” time
First Model (1) “Early integration” of features and modelling through a 2-level Hidden Markov Model Hidden Meeting Actions A0 At At+1 AT …. …. Hidden Sub-states S0 St St+1 ST …. …. Observable Features Vector Y0 Yt Yt+1 YT
First Model (2) The main idea behind this model is to decompose each “meeting action” in a sequence of “sub actions” or substates (Note that different actions are free to share the same sub-state) • The structure is composed by two Ergodic HMM chains: • The top chain links sub-states {St} with “actions” {At} • The lower one maps directly the feature vectors {Yt} into a sub-state {St} A0 At …. S0 St …. Y0 Yt
First Model (3) • The sequence of actions {At} is known a priori • The sequence {St} is determined during the training process,and the meaning of each substate is unknown • The cardinality of {St} is one of the model’s parameters • The mapping of observable features {Yt} into hidden sub-states {St} is obtained through Gaussian Mixture Models A0 At …. S0 St …. Y0 Yt
Second Model (1) Action Counter Multistream processing of features through two parallel and independent Hidden Markov Models …. …. C0 C0 C0 C0 E0 E0 E0 Enable Transitions A0 At At+1 AT …. …. Meeting Actions Hidden Sub-states S01 St1 St+11 ST1 …. …. S02 St2 St+12 ST2 …. …. Prosodic Features Y01 Yt1 Yt+11 YT1 Y02 Yt2 Yt+12 YT2 Speaker Turns Features
Second Model (2) Each features-group (or modality) Ym, is mapped into an independent HMM chain, therefore every group is evaluated independently and mapped into an hidden sub-state {Stn} As in the previous model, there is another HMM layer (A), witch represents “meeting actions” A0 At …. S01 St1 …. The whole sub-state {St1 x St2 x … Stn} is mapped into an action {At} S02 St2 …. Y01 Yt1 Y02 Yt2
Second Model (3) It is a variable-duration HMM with explicit enable node: • At represents “meeting actions” as usual • Ct counts “meeting actions” • Et is a binary indicator variable that enables states changes inside the node At …. …. C0 C0 C0 E0 E0 E0 A0 At At+1 …. ….
Second Model (4) • Training: when {At} changes {Ct} is incremented and is set on for a single frame {Et} (At ,Et and Ct are part of the training dataset) Behaviours of {Et} and {Ct} learned during the training phase are then exploited during the decoding • Decoding: {At} is free to change only if {Et} is high, and • then according to {Ct} state
Results Using the two models previously described, results obtained using only audio derived features: The second model reduces effectively both the number of Substitutions and the number of Insertions Equivalent to the Word Error Rate measure, used to evaluate speech recogniser performances
Conclusions • A new approach has been proposed • Achieved results seem to be promising, and in the future we’d like to: • Validate them with the remaining part of the test-set (or eventually an independent test-set) • Integrate other features: • video, ASR transcripts, Xtalk, …. • Try new experiments with existing models • Develop new DBNs based models
Multimodal Recognition (2) Knowledge sources: Approaches: A standalone hi-level recogniser operating on low level raw data • Raw Audio • Raw Video • Acoustic Features • Visual Features • Automatic Speech Recognition • Video Understanding • Gesture Recognition • Eye Gaze Tracking • Emotion Detection • …. Fusion of different recognisers at an early stage, generating hybrid recognisers (like AVSR) Integration of recognisers outputs through an “high level” recogniser