1 / 36

Generative Modeling and Classification of Dialogs by Low-Level Features

1 S t-1. 1 S t+4. 1 S t. …. 2 S t-1. 2 S t+4. 2 S t. A. Pentland. BLA. BLA. BLA. Generative Modeling and Classification of Dialogs by Low-Level Features. BLA. BLA. BLA. BLA. A. Markov. Marco Cristani , Anna Pesarin, Alessandro Tavano,

lavender
Download Presentation

Generative Modeling and Classification of Dialogs by Low-Level Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1St-1 1St+4 1St … 2St-1 2St+4 2St A. Pentland BLA BLA BLA Generative Modeling and Classification ofDialogs by Low-Level Features BLA BLA BLA BLA A. Markov Marco Cristani, Anna Pesarin, Alessandro Tavano, Carlo Drioli, Alessandro Perina, Vittorio Murino PRINT ME IN GRAYSCALE

  2. Goal • Introduction • Our approach • Experiments • Conclusions Summary

  3. To model and to classify dyadic conversational audio situations • The situations are characterized by: • the kind of subjects involved within (adults, children) • a predominant mood (flat or arguing discussion) • Examples Goal 3 2 1

  4. Our guidelines for the modeling are: • to exploit the conversational turn-taking • to not model the content of the conversations (too difficult) • Our contribute • A novel kind of features (the Steady Conversational Periods, SCP) + a very simple generative framework • In practice… • We are able to finely characterize the turn-taking encoding also the timing of the turns Goal (2)

  5. Our aim can be cast as social signalling problem • Social signalling • recent formalization • Social signals [Vinciarelli et al. 2008] • the expression of one’s attitude towards social situation and interplay • manifested through a multiplicity of non-verbal behavioural cues (facial expressions, gestures, and vocal outbursts) Introduction – Social signalling Social Psychology Social Signalling Pattern Recognition

  6. Bricks for social signals, [Vinciarelli et al. 2008] Introduction (2) – social signals OUR FOCUS

  7. A taxonomy for the social signals • behavioural/social cues (or thin slice of behavior) • a set of temporal changes in neuromuscular and physiological activity that last for short intervals of time (milliseconds to minutes) • social signals(or social behaviours) • multiple behavioural cues • attitudes towards others or specific social situations that can last minutes to hours Introduction (3) - Definitions

  8. Introduction (5) – Turn taking • Turn taking • includes the regulation of the conversations, and the coordination (or the lack of it) during the speaker transitions

  9. No Yes • Turn-taking • coordination • timed coordination • more interesting Introduction (6) – Turn taking examples

  10. T 1 … St-1 St St+4 … • Turn taking in a statistical way: Markov chaining • Ergodic Markov model of states Our approach - preliminaries

  11. 1St-1 1St 1St+4 … T 2St-1 2St 2St+4 • Markov chaining for multiple agents: connections • The core of the model is the transition probability (c,d=1,2) Our approach (2)- Markov structures single process states joint process states • Problem: computational burden • for C processes, the joint states give transition matrices of O(NCxNC), where N is the number of states for the single processes

  12. 1St-1 1St+4 1St … 2St 2St-1 2St+4 • High-order Markov models [Meyn 2005] • each single process choses the next state independently from the other single process(es) – reasonable! • O(NCxN) space complexity, still hard to deal with Our approach (3) – Markov relaxations

  13. 1St-1 1St+4 1St … 2St-1 2St+4 2St • Mixed Memory processes, (Observed) Influence model (OIM) [Saul et al. 99, Asavathiratham 2000] • each single process choses the next state not considering the choral effect of the system at the previous time step • instead, pairwise state dependencies plus influence factors {θ}are introduced Our approach (4) – Influence model

  14. We have weighted convex combination of probabilities • intra-chain transition: • inter-chain transition: 1St-1 1St Our approach (5) – Influence model 2St-1 2St self-influence 1St-1 1St 2St-1 2St other’s influence • Transition tables of O(CN2)+ influence matrix θ of O(C2)

  15. We focused on two-person conversations • The conversation originates a couple of synchronized audio signals sampled at 44100 Hz • NO source separationissues (see later) • short-term energies of the speech signals was computed on frames of 10 msec • speech (T)/silence (S) classification via k-means Our approach (6) - Setting T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T 10 msec

  16. T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T • How to instantiate the (Observed) Influence Model ? • at each frame (10 msec) (no inter-chain trans. are depicted for clarity) • OUTPUT • we have more autotransions than effective changes • the parameters of the Markov chains are not informative (highly diagonal) • the length of the speech/silence segments is lost due to the 1-st order dependence Our approach (7) – Choose a strategy

  17. Whenever a change in the system does occurr, a novel SCP begins, for each chain/process • OUTPUT • we have features, addressing system’s changes • we introduce a synchronization • at each SCP are associated two information • the SPEECH (T) – SILENCE (S) label • the time length Frame Our approach (8) – Steady Conversational Periods T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T SCP SCP SCP SCP SCP SCP SCP SCP <label, time length>

  18. How to exploit SCPs for a Markov modelling? • By addressing a state renaming • <1,S>  1 |<1,T>  2 |<2,S>  3 | …. • Training a OIM  STATE SPACE EXPLOSION, SPARSITY!!! <8,T> <5,S> <3,T> <5,S> <9,T> <4,S> Our approach (9) – Steady Conversational Periods <8,S> <4,S> <5,S> <3,S> <9,T> <5,T> <16> <9> <6> <9> <18> <7> <15> <7> <9> <5> <18> <10>

  19. We consider SCP histograms Gaussian clustering Our approach (9) –SCP exploitation Maximum Likelihood (ML) labeling

  20. The state space decreases in size <16> <9> <6> <9> <18> <7> Our approach (10) – SCP exploitation <15> <7> <9> <5> <18> <10> <4> <1> <3> <1> <4> <1> <2> <1> <1> <1> <4> <3>

  21. At this point the couple of sequences and are used to train the OIM λ, obtaining: Our approach (11) – Classification (by counting state occurrences) (by counting state occurrences) (by gradient ascent) • Two intra-chain matrices • they tell how each agent produces a set of SCP states • Two inter-chain matrices • they tell how each SCP state of one chain is conditioned on each state of the other chain • An influence matrix • it tells how the two chains influence each other

  22. Given a OIM, we can evaluate the likelihood Our approach (12) – Remarks • IMPORTANT: the order with which the sequences and • are evaluated by the system influences Ag.1 Ag.2 Ag.1 Agent 1 Agent 1 influences influences Ag.2 Agent 2 Agent 2

  23. Once a model Ψ={ϴ,λ} and a test dialog I (an ordered pair of arrays O1 and O2 composed by {S,T} symbols) are provided, we want the likelihood P(I| Ψ) = P(O1 , O2 | Ψ) • SCP are extracted • SCP Gaussian labels are estimated from ϴ, originating , (ϴ act as a codebook) • The OIM, final likelihood is estimated as Our approach (13) - Classification

  24. Twofold aim: • how the statistical signature explains turn-taking • how our model is effective in the classification task • Analysis of the models parameters: restricted dataset • 27 healthy subjects (10 males, 17 females) • two age groups: • 14 preschool children ranging from 4 to 6 years (so, 14 dialogs) • 13 adults ranging from 22 to 40 years (13 dialogs) • semi-structured dialogs (lasting about 10 minutes): an adult human operator asks the subject (child or adult) to talk about predetermined topics: • (school time/work, hobbies, friends, food, family) Experiments - preliminaries

  25. influences • High self-influence: • different intra-chain sequences of speech/silence SCP states characterize the subjects • such sequences occurr independently Experiments (2) – Influence factors 1 3 3 3 4 1 4 4 3 3 2 4 1 3 4 1 influences • Low self-influence: • different intra-chain sequences of speech/silence SCP states characterize the subjects • such sequences occurr co-ordinated in time 1 3 1 4 4 3 4 3 3 1 3 4 2 3 2 3

  26. INTRA CHAIN MATRICES • The child shows a high tendency to converge to a short silence state • The moderatoralternates from a state of silence to a speech state, either long or short, with high probability Experiments (3) |adult-child conv.

  27. INTER-CHAIN MATRICES • the child utters a sentence whether the moderator speaks for a long time (he get bored of the moderator…) • the moderator utters a sentence whenever the child remains silent for a long time (he encourages the child…) Experiments (4) |adult-child conv.

  28. INTRA CHAIN MATRICES • The subject tends to speak continuously • The moderatoralternates from a state of silence to a speech state, either long or short, with high probability Experiments (5) |adult-adult conv.

  29. INTER-CHAIN MATRICES • the moderator interacts with the subject mostly by talking to him (whether to ask questions or stopping him) Experiments (6) |adult-adult conv.

  30. Restricted  extended dataset: • We add conversations • 5 flat non-structured conversations • 9 disputes between adults (an operator pushed for fighting, the other subject naturally reacted) Experiments (7) - Classification • We instantiate 4 classification tasks • (A) flat vs dispute - (cat:1 vs cat:3); • (B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3); • (C) with vs without child - (cat:2 vs cat:1); • (D) all vs all; • We gather three categories of dialogs • Flat dialog between adults (18 samples) • Flat dialog between a child and an adult (14 samples) • Dispute (9 samples, only between adults)

  31. Comparative strategies • SCP histograms (SCP) • normalized histogram of the SCPs (silence, speech) as signature • Bhattacharyya distance for the classification • Turn taking influence model (TTIM) • In practice, it is as we had “SCP” with the same duration [Basu et al. 01] • Mixture of Gaussian classifier on a set of acoustic cues (MOG) [Shriberg 98] [Fernandez et al. 02] : • pitch range measure (for the intonation) • “enrate” speech rate (articulation velocity) • spectral flatness measure (SFM) • drop-off of spectral energy above 1000 Hz (DO1000) for the emotion modelling Experiments (8) – Classification

  32. Results: • (A)flat vs dispute - (cat:1 vs cat:3); • (B)flat vs dispute, general - ((cat:1 U cat:2) vs cat:3); • (C)with vs without child - (cat:2 vs cat:1); • (D)all vs all; Experiments (9) – Classification • lower accuracy in the task A • some flat conversations are misclassified • sometimes timing of flat conversations is built by subjects which utters very short sentences, similar to dispute • this behavior is captured by our model and disregarded by TTIM • SOLUTION: augment the features, not only SCPs!

  33. A novel way to model dialogs has been proposed • The main contributions are • Steady Conversational Periods (SCP), as a way to synchronize a dialog, making feasible first-order Markov treatment • The embedding of SCP in an Observed Influence Model, resulting in a detailed way to describe the turn taking of a conversation • The future improvements • From a methodological point of view • Inserting uncertainty in the SCP states, i.e., move to a full Influence Model • Enrich the model with different prosodic features • From a practical point of view • Enlarge the data set • Try novel situations Conclusions

  34. A.Pesarin, M.Cristani, V.Murino, C.Drioli and A.Perina,A statistical signature for automatic dialogue classification. In proceedings of the International Conference on Pattern Recognition (ICPR 2008) Tampa, Florida. • M.Cristani, A.Pesarin, C.Drioli, A.Tavano, A.Perina, V.Murino, Auditory Dialog Analysis and Understanding by Generative Modelling of Interactional DynamicsIn proceedings of the Second IEEE Workshop on CVPR 2009 for Human Communicative Behavior Analysis. • M.Cristani, A.Tavano, A.Pesarin, C.Drioli, A.Perina, V.Murino, Generative Modeling and Classification of Dialogs by Low-Level Features, submitted to System Man and Cybernetics:Part B (under review) Publications

  35. [Vinciarelli et al. 2008] Vinciarelli, A., Pantic, M., Bourlard, H., and Pentland, A. 2008. Social signal processing: state-of-the-art and future perspectives of an emerging domain. In Proceeding of the 16th ACM international Conference on Multimedia MM '08. • [Choudhury et al. 2004] T. Choudhury and S. Basu. Modeling conversational dynamics as a mixed memory markov process. In Proc. NIPS, 2004. • [Meyn 2005] S. P. Meyn and R.L. Tweedie, 2005. Markov Chains and Stochastic Stability. Second edition to appear, Cambridge University Press, 2008 • [Asavathiratham 2000] C. Asavathiratham, “A tractable representation for the dynamics of networked markov chain,” Ph.D. dissertation, Dept. of ECS, MIT, 2000. • [Saul et al. 99] L. Saul and M. Jordan, “Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones,” Machine Learning, vol. 37, no. 1, pp. 75–87, 1999. • [Basu et al. 01] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Learning human interaction with the influence model,” MIT MediaLab, Tech. Rep. 539, 2001. • [Shriberg 98] E. Shriberg, “Can prosody aid the automatic classification of dialog acts in conversational speech?” Language and Speech, vol. 41, no. 4, pp. 439–487, 1998. • [Fernandez et al. 02] R. Fernandez and R. Picard, “Dialog act classification from prosodic features using support vector machines,” in Proc. of Speech Prosody, 2002. References Thanks!!!

More Related