Learning Relationships from Conversational Patterns

Learning Relationships from Conversational Patterns Tanzeem Choudhury+ and Sumit Basu* +MIT Media Lab, Intel Research (Seattle) *MIT EECS/Media Lab, Microsoft Research (Redmond)

From Conversations to Relationships • Modeling Relationships via Conversations: • conversations are a key part of our interactions • contextual features: how often we have conversations, who we have them with • behavioral features: how we act during a converation • can we use these features to model relationships? • Our Approach: • robust, unobtrusive sensing method: detect conversations and extract conversational features as in S. Basu, Conversational Scene Analysis • probabilistic learning techniques that model prominence in the network and effects of individuals’ dynamics on interaction dynamics.

The Sociometer • Measurements: • Face-to-face proximity (sampling rate 17Hz – sensor IR) • Speech information (8KHz - microphone) • Motioninformation (50Hz - accelerometer) Factors that contribute towards the wearability of a wearable: Shape, size, attachment, weight, movement, aesthetics People involved in the hardware and design: Brian Clarkson, Rich DeVaul, Vadim Gerasimov, Josh Weaver

The Experiment • 23 subjects wore sociometers for 2 weeks – 6 hours everyday • 66 hours of data per subject: total 1518 hours of interaction data • 4 different groups distributed throughout the lab

Aims of the Experiment • We want to identify: • When people are facing each other • When two people are conversing regardless of what they are saying • From that analyze: • The communication patterns within the community • Various social network properties • Individual turn-taking style • How people influence each other’s turn-taking style ?

Auditory Features • Why these features: (Scherer et al., 1972) • These features are sufficient to recover emotional content Speech Segments Voicing Segments Speaking Rate Pitch Track Energy Spectrogram of Telephone Speech (8 kHz, 8 bit)

What the Microphone Gets • Close talking, 16 kHz, quiet surroundings • Mic 6” away, 8 kHz, noisy environment

St-1 St+1 St Vt+1 Vt Vt-1 Ot-1 Ot+1 Ot Modeling the Dynamics of Speech • Transitions between Voiced/Unvoiced (V/UV) • Consistent within speech despite features • Different transitions for speech/non-speech • The “linked” HMM: (Saul and Jordan ’95) Speech vs. non-speech Voiced vs. unvoiced Observations HMM LHMM

LHMM: Computational Complexity • Cliques of 3: • Cost of Exact Inference (per timestep) • LHMM: • HMM : • For binary states: 36 vs. 12 operations St-1 St St+1 Vt+1 Vt Vt-1 Vt+1 Vt Vt-1 Ot-1 Ot+1 Ot Ot-1 Ot+1 Ot HMM LHMM

Features: Spectral Entropy • Spectral Entropy • Higher values => spectrum is more “random” Voiced: Unvoiced:

Features: Noisy Autocorrelation • Normalized Autocorrelation: • Reject banded energy by adding noise to s[n] • Use max peak, number of peaks before periodic noise after

Performance • Example • Versus HMM (–14 dB of noise, not shown) LHMM HMM

Performance: Noise Speech/Non-Speech Error Voiced/Unvoiced Error < 4% error @ 0dB < 2% error @ 10dB SSNR (dB) SSNR (dB)

Performance: Distance from Microphone Speech/Non-Speech Error Voiced/Unvoiced Error < 10% error @ 20 ft < 10% error @ 20 ft Distance from Mic (feet) Distance from Mic (feet)

More Features: Regularized Energy -13dB +20dB RawEnergy Regularized Energy

Estimating Speaking Rate • Productive Segments: following (Pfau and Ruske 1998) • But: only measure within speech segments Reading passage in 21 seconds Articulation Rate (segs/sec) Passage length (seconds) Reading passage in 46 seconds

Speaker Segmentation • Regularize Energy Ratio over voicing segments • 6” from mic, 2’ of separation => 4:1 mixing ratio • Regularized log energy ratio: Regularized energy 1 2 Segmentation with raw energy (-15 dB) Raw energy (still using V/UV) 1 2 Segmentation with reg. energy (-15 dB) Segmentation performance in noise

Segmenting Speakers: “Real World” • Two subjects wearing sociometers • 4 feet of separation, 6 feet from interfering speaker mic Reg Reg Raw (still using V/UV) Raw (still using V/UV) ROC: One Sociometer (1 mic) ROC: Two Sociometers (2 mics)

Finding Conversations • Consider two voice segment streams • How tightly synchronized are they? • Alignment measure based on Mutual Information 1.6 seconds 16 seconds 2.5 minutes 30 minutes k = 7500 (2 minutes)

How Well Does It Work? • Callhome (telephone) conversations • Data: 5 hours of conversational speech • Performance in noise (two-minute segments): SSNR Values -- : 20 dB O : -12.7dB V : -14.6 dB + : -17.2 dB * : -20.7 dB Top: PD=0.992 PFA = 0.0075

Why Does It Work So Well? • Voicing segs: pseudorandom bit sequence • The conversational partner is a noisy complement aligned random

How About On Mixed Streams? • Sociometer Data • PERFECT!! (PD=1.00, PFA=0.00) with 15 seconds • BUT… aligned

Accuracy of Real-World Interaction Data • Low consistency across subjects using survey data: • Both acknowledge having conversation 54% • Both acknowledge having same number of conversations 29% • Per conversation analysis not possible – one survey per day • We thus evaluated the performance of our algorithms against • hand-labeled data: 4 subjects labeled 2 days worth of data each • Data was labeled in 5 minute chunks

Interaction Matrix (Conversations) Each row corresponds to a different person. The color value indicates, for each subject, the proportion of their total interactions that they have with each of the other subjects

Social Network Based on multi-dimensional scaling of geodesic distances

Effects of Distance Probability of Interaction X-axis label Distance 0 office mates 1 1-2 offices away 2 3-5 offices away 3 offices on the same floor 4 offices separated by a floor 5 offices separated by two floor Distance

Within/Cross Group Interactions Fraction of Interaction

Identifying Prominent People Betweenness centrality:based on how often one lies in between other individuals in the network gjk are the number of geodesics linking two actors and all the geodesics are equally likely to be chosen. If individual i is involved in gjk(ni) geodesics between j and k, the betweenness for i is calculated as follows: Individuals with high betweenness play a role in keeping the community connected and removing someone who has high betweenness can result in isolated subgroups. A measure to estimate how much control an individual has over the interaction of other individuals who are not directly connected

Betweenness Centrality of Participants Betweenness centrality of individuals in the interaction network

Probability of person giving up turn to conversation partner Probability of person holding turn Probability of conversation partner giving up turn Probability of conversation partner holding turn Beyond Overall Network Characteristics:Exploring the Dynamics of Interaction Moving from who to how

Turn-taking Matrix • Person A converses with a given conversation partner. We can estimate: • Turn-taking matrix for A • And turn-taking matrix for the partner Probability of person giving up turn to conversation partner Probability of person holding turn Probability of conversation partner holding turn Probability of conversation partner giving up turn Probability of conversation partner giving up turn Probability of person giving up turn to conversation partner Probability of person holding turn Probability of conversation partner holding turn Person A’s turn-taking behavior Partner’s turn-taking behavior

aBB aAA aAB Person A Person B Mixture of Speaker Dynamics When two people interact do they affect each other’s interaction style? If they do, how do we model the effect? aBA Person A A’s “average-self” B’s “average partner” Person B B’s “average-self” A’s “average partner”

Does Mixing Speaker Dynamics Lead to a Better Model? Using eighty different conversations Average conversation duration 5 minutes • KL divergence between true model vs. average speaker model and true model vs. mixture model • KL divergence reduced by 32% • The mixture model is a statistically significantly better model (F-test, p<0.0001)

Who are the people with large avalues? Influence values calculated for a subset of users: people who interacted with at least 4 different people more than once.

Correlating Influence Valueswith Centrality Scores Correlation: 0.90 p<0.0004 This opens the possibility that a person’s style during one-on-one conversations maybe indicative of the person’s overall prominence in the network. Betweenness centrality indices best measure which individuals in the network are most frequently viewed as leaders (Freeman, L.C., Roeder, D., and Mulholland, R.R., Centrality in social networks: II. Experimental results. Social Networks, 1980. 2: p. 119-141.)

Future Work • Can we find quantitative effects of other aspects of conversational behavior on conversational partners in terms of pitch, speaking rate, and dominance as well as turn taking? • Can we make finer distinctions between individual relationships – family members vs. friends, etc.? • Can we infer classes of relationships in an unsupervised manner? Talking to father Talking to mother daughter daughter father mother conversation between two parents and their daughter showing differences in dominance, style

Learning Relationships from Conversational Patterns