360 likes | 618 Views
1 S t-1. 1 S t+4. 1 S t. …. 2 S t-1. 2 S t+4. 2 S t. A. Pentland. BLA. BLA. BLA. Generative Modeling and Classification of Dialogs by Low-Level Features. BLA. BLA. BLA. BLA. A. Markov. Marco Cristani , Anna Pesarin, Alessandro Tavano,
E N D
1St-1 1St+4 1St … 2St-1 2St+4 2St A. Pentland BLA BLA BLA Generative Modeling and Classification ofDialogs by Low-Level Features BLA BLA BLA BLA A. Markov Marco Cristani, Anna Pesarin, Alessandro Tavano, Carlo Drioli, Alessandro Perina, Vittorio Murino PRINT ME IN GRAYSCALE
Goal • Introduction • Our approach • Experiments • Conclusions Summary
To model and to classify dyadic conversational audio situations • The situations are characterized by: • the kind of subjects involved within (adults, children) • a predominant mood (flat or arguing discussion) • Examples Goal 3 2 1
Our guidelines for the modeling are: • to exploit the conversational turn-taking • to not model the content of the conversations (too difficult) • Our contribute • A novel kind of features (the Steady Conversational Periods, SCP) + a very simple generative framework • In practice… • We are able to finely characterize the turn-taking encoding also the timing of the turns Goal (2)
Our aim can be cast as social signalling problem • Social signalling • recent formalization • Social signals [Vinciarelli et al. 2008] • the expression of one’s attitude towards social situation and interplay • manifested through a multiplicity of non-verbal behavioural cues (facial expressions, gestures, and vocal outbursts) Introduction – Social signalling Social Psychology Social Signalling Pattern Recognition
Bricks for social signals, [Vinciarelli et al. 2008] Introduction (2) – social signals OUR FOCUS
A taxonomy for the social signals • behavioural/social cues (or thin slice of behavior) • a set of temporal changes in neuromuscular and physiological activity that last for short intervals of time (milliseconds to minutes) • social signals(or social behaviours) • multiple behavioural cues • attitudes towards others or specific social situations that can last minutes to hours Introduction (3) - Definitions
Introduction (5) – Turn taking • Turn taking • includes the regulation of the conversations, and the coordination (or the lack of it) during the speaker transitions
No Yes • Turn-taking • coordination • timed coordination • more interesting Introduction (6) – Turn taking examples
T 1 … St-1 St St+4 … • Turn taking in a statistical way: Markov chaining • Ergodic Markov model of states Our approach - preliminaries
1St-1 1St 1St+4 … T 2St-1 2St 2St+4 • Markov chaining for multiple agents: connections • The core of the model is the transition probability (c,d=1,2) Our approach (2)- Markov structures single process states joint process states • Problem: computational burden • for C processes, the joint states give transition matrices of O(NCxNC), where N is the number of states for the single processes
1St-1 1St+4 1St … 2St 2St-1 2St+4 • High-order Markov models [Meyn 2005] • each single process choses the next state independently from the other single process(es) – reasonable! • O(NCxN) space complexity, still hard to deal with Our approach (3) – Markov relaxations
1St-1 1St+4 1St … 2St-1 2St+4 2St • Mixed Memory processes, (Observed) Influence model (OIM) [Saul et al. 99, Asavathiratham 2000] • each single process choses the next state not considering the choral effect of the system at the previous time step • instead, pairwise state dependencies plus influence factors {θ}are introduced Our approach (4) – Influence model
We have weighted convex combination of probabilities • intra-chain transition: • inter-chain transition: 1St-1 1St Our approach (5) – Influence model 2St-1 2St self-influence 1St-1 1St 2St-1 2St other’s influence • Transition tables of O(CN2)+ influence matrix θ of O(C2)
We focused on two-person conversations • The conversation originates a couple of synchronized audio signals sampled at 44100 Hz • NO source separationissues (see later) • short-term energies of the speech signals was computed on frames of 10 msec • speech (T)/silence (S) classification via k-means Our approach (6) - Setting T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T 10 msec
T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T • How to instantiate the (Observed) Influence Model ? • at each frame (10 msec) (no inter-chain trans. are depicted for clarity) • OUTPUT • we have more autotransions than effective changes • the parameters of the Markov chains are not informative (highly diagonal) • the length of the speech/silence segments is lost due to the 1-st order dependence Our approach (7) – Choose a strategy
Whenever a change in the system does occurr, a novel SCP begins, for each chain/process • OUTPUT • we have features, addressing system’s changes • we introduce a synchronization • at each SCP are associated two information • the SPEECH (T) – SILENCE (S) label • the time length Frame Our approach (8) – Steady Conversational Periods T T T T T T T T S S S S S S S S S S S S S S T T T T T T T T T T T T S S S S S S S S S S S S T T T T T S S S S S S S S T T T T T T T T T SCP SCP SCP SCP SCP SCP SCP SCP <label, time length>
How to exploit SCPs for a Markov modelling? • By addressing a state renaming • <1,S> 1 |<1,T> 2 |<2,S> 3 | …. • Training a OIM STATE SPACE EXPLOSION, SPARSITY!!! <8,T> <5,S> <3,T> <5,S> <9,T> <4,S> Our approach (9) – Steady Conversational Periods <8,S> <4,S> <5,S> <3,S> <9,T> <5,T> <16> <9> <6> <9> <18> <7> <15> <7> <9> <5> <18> <10>
We consider SCP histograms Gaussian clustering Our approach (9) –SCP exploitation Maximum Likelihood (ML) labeling
The state space decreases in size <16> <9> <6> <9> <18> <7> Our approach (10) – SCP exploitation <15> <7> <9> <5> <18> <10> <4> <1> <3> <1> <4> <1> <2> <1> <1> <1> <4> <3>
At this point the couple of sequences and are used to train the OIM λ, obtaining: Our approach (11) – Classification (by counting state occurrences) (by counting state occurrences) (by gradient ascent) • Two intra-chain matrices • they tell how each agent produces a set of SCP states • Two inter-chain matrices • they tell how each SCP state of one chain is conditioned on each state of the other chain • An influence matrix • it tells how the two chains influence each other
Given a OIM, we can evaluate the likelihood Our approach (12) – Remarks • IMPORTANT: the order with which the sequences and • are evaluated by the system influences Ag.1 Ag.2 Ag.1 Agent 1 Agent 1 influences influences Ag.2 Agent 2 Agent 2
Once a model Ψ={ϴ,λ} and a test dialog I (an ordered pair of arrays O1 and O2 composed by {S,T} symbols) are provided, we want the likelihood P(I| Ψ) = P(O1 , O2 | Ψ) • SCP are extracted • SCP Gaussian labels are estimated from ϴ, originating , (ϴ act as a codebook) • The OIM, final likelihood is estimated as Our approach (13) - Classification
Twofold aim: • how the statistical signature explains turn-taking • how our model is effective in the classification task • Analysis of the models parameters: restricted dataset • 27 healthy subjects (10 males, 17 females) • two age groups: • 14 preschool children ranging from 4 to 6 years (so, 14 dialogs) • 13 adults ranging from 22 to 40 years (13 dialogs) • semi-structured dialogs (lasting about 10 minutes): an adult human operator asks the subject (child or adult) to talk about predetermined topics: • (school time/work, hobbies, friends, food, family) Experiments - preliminaries
influences • High self-influence: • different intra-chain sequences of speech/silence SCP states characterize the subjects • such sequences occurr independently Experiments (2) – Influence factors 1 3 3 3 4 1 4 4 3 3 2 4 1 3 4 1 influences • Low self-influence: • different intra-chain sequences of speech/silence SCP states characterize the subjects • such sequences occurr co-ordinated in time 1 3 1 4 4 3 4 3 3 1 3 4 2 3 2 3
INTRA CHAIN MATRICES • The child shows a high tendency to converge to a short silence state • The moderatoralternates from a state of silence to a speech state, either long or short, with high probability Experiments (3) |adult-child conv.
INTER-CHAIN MATRICES • the child utters a sentence whether the moderator speaks for a long time (he get bored of the moderator…) • the moderator utters a sentence whenever the child remains silent for a long time (he encourages the child…) Experiments (4) |adult-child conv.
INTRA CHAIN MATRICES • The subject tends to speak continuously • The moderatoralternates from a state of silence to a speech state, either long or short, with high probability Experiments (5) |adult-adult conv.
INTER-CHAIN MATRICES • the moderator interacts with the subject mostly by talking to him (whether to ask questions or stopping him) Experiments (6) |adult-adult conv.
Restricted extended dataset: • We add conversations • 5 flat non-structured conversations • 9 disputes between adults (an operator pushed for fighting, the other subject naturally reacted) Experiments (7) - Classification • We instantiate 4 classification tasks • (A) flat vs dispute - (cat:1 vs cat:3); • (B) flat vs dispute, general - ((cat:1 U cat:2) vs cat:3); • (C) with vs without child - (cat:2 vs cat:1); • (D) all vs all; • We gather three categories of dialogs • Flat dialog between adults (18 samples) • Flat dialog between a child and an adult (14 samples) • Dispute (9 samples, only between adults)
Comparative strategies • SCP histograms (SCP) • normalized histogram of the SCPs (silence, speech) as signature • Bhattacharyya distance for the classification • Turn taking influence model (TTIM) • In practice, it is as we had “SCP” with the same duration [Basu et al. 01] • Mixture of Gaussian classifier on a set of acoustic cues (MOG) [Shriberg 98] [Fernandez et al. 02] : • pitch range measure (for the intonation) • “enrate” speech rate (articulation velocity) • spectral flatness measure (SFM) • drop-off of spectral energy above 1000 Hz (DO1000) for the emotion modelling Experiments (8) – Classification
Results: • (A)flat vs dispute - (cat:1 vs cat:3); • (B)flat vs dispute, general - ((cat:1 U cat:2) vs cat:3); • (C)with vs without child - (cat:2 vs cat:1); • (D)all vs all; Experiments (9) – Classification • lower accuracy in the task A • some flat conversations are misclassified • sometimes timing of flat conversations is built by subjects which utters very short sentences, similar to dispute • this behavior is captured by our model and disregarded by TTIM • SOLUTION: augment the features, not only SCPs!
A novel way to model dialogs has been proposed • The main contributions are • Steady Conversational Periods (SCP), as a way to synchronize a dialog, making feasible first-order Markov treatment • The embedding of SCP in an Observed Influence Model, resulting in a detailed way to describe the turn taking of a conversation • The future improvements • From a methodological point of view • Inserting uncertainty in the SCP states, i.e., move to a full Influence Model • Enrich the model with different prosodic features • From a practical point of view • Enlarge the data set • Try novel situations Conclusions
A.Pesarin, M.Cristani, V.Murino, C.Drioli and A.Perina,A statistical signature for automatic dialogue classification. In proceedings of the International Conference on Pattern Recognition (ICPR 2008) Tampa, Florida. • M.Cristani, A.Pesarin, C.Drioli, A.Tavano, A.Perina, V.Murino, Auditory Dialog Analysis and Understanding by Generative Modelling of Interactional DynamicsIn proceedings of the Second IEEE Workshop on CVPR 2009 for Human Communicative Behavior Analysis. • M.Cristani, A.Tavano, A.Pesarin, C.Drioli, A.Perina, V.Murino, Generative Modeling and Classification of Dialogs by Low-Level Features, submitted to System Man and Cybernetics:Part B (under review) Publications
[Vinciarelli et al. 2008] Vinciarelli, A., Pantic, M., Bourlard, H., and Pentland, A. 2008. Social signal processing: state-of-the-art and future perspectives of an emerging domain. In Proceeding of the 16th ACM international Conference on Multimedia MM '08. • [Choudhury et al. 2004] T. Choudhury and S. Basu. Modeling conversational dynamics as a mixed memory markov process. In Proc. NIPS, 2004. • [Meyn 2005] S. P. Meyn and R.L. Tweedie, 2005. Markov Chains and Stochastic Stability. Second edition to appear, Cambridge University Press, 2008 • [Asavathiratham 2000] C. Asavathiratham, “A tractable representation for the dynamics of networked markov chain,” Ph.D. dissertation, Dept. of ECS, MIT, 2000. • [Saul et al. 99] L. Saul and M. Jordan, “Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones,” Machine Learning, vol. 37, no. 1, pp. 75–87, 1999. • [Basu et al. 01] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Learning human interaction with the influence model,” MIT MediaLab, Tech. Rep. 539, 2001. • [Shriberg 98] E. Shriberg, “Can prosody aid the automatic classification of dialog acts in conversational speech?” Language and Speech, vol. 41, no. 4, pp. 439–487, 1998. • [Fernandez et al. 02] R. Fernandez and R. Picard, “Dialog act classification from prosodic features using support vector machines,” in Proc. of Speech Prosody, 2002. References Thanks!!!