190 likes | 326 Views
Turn-Yielding Cues in Task-Oriented Dialogue. Agust ín Gravano 1,2 Julia Hirschberg 1. Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina. Introduction. Interactive Voice Response Systems. Quickly spreading. “Uncomfortable”, “awkward”.
E N D
Turn-Yielding Cuesin Task-Oriented Dialogue Agustín Gravano1,2 Julia Hirschberg1 • Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina
Introduction Interactive Voice Response Systems • Quickly spreading. • “Uncomfortable”, “awkward”. • ASR+TTS account for most IVR problems. • Other problems revealed. • Coordination of system-user exchanges. • Long pauses after user turns; interruptions. • Modeling turn-taking behavior should lead to improved system-user coordination. Agustín Gravano SIGdial 2009
Introduction Goal • Learn when the speaker is likely to end her/his conversational turn. • Find turn-yielding cues. • Cues displayed by the speaker when approaching a potential turn boundary. • This should improve the coordination of IVRs: • Speech understanding: Detect the end of the user’s turn. • Speech generation: Display cues signalling the end of system’s turn. Agustín Gravano SIGdial 2009
Talk Outline • Previous work • Material • Method • Results • Conclusions Agustín Gravano SIGdial 2009
Previous Work on Turn-Taking • Duncan 1972, 1973, 1974, inter alia. • Hypothesized 6 turn-yielding cues in face-to-face dialogue. • Conjectured a linear relation between the number of displayed cues and the likelihood of a turn-taking attempt. • Studies formalized and verified some of Duncan’s hypotheses.[For&Tho96; Wen&Sie03; Cut&Pea86; Wic&Cas01] • Implementations of turn-boundary detection. • Simulations[Fer&al.02,03; Edl&al.05; Sch06; Att&al.08; Bau08] • Actual systems: Let’s Go![Rau&Esk08] • Exploiting turn-yielding cues improves performance. Agustín Gravano SIGdial 2009
Material Columbia Games Corpus • 12 task-oriented spontaneous dialogues. • Standard American English. • 13 subjects: 6 female, 7 male. • Series of collaborative computer games. • No eye contact. No speech restrictions. • 9 hours of dialogue. • Manual orthographic transcription, alignment. • Manual prosodic annotations (ToBI). Agustín Gravano SIGdial 2009
Material Columbia Games Corpus Player 1: Describer Player 2: Follower Agustín Gravano SIGdial 2009
Turn-Yielding Cues • Cues displayed by the speaker when approaching a potential turn boundary. Agustín Gravano SIGdial 2009
Hold Smooth switch IPU1 IPU2 Speaker A: IPU3 Speaker B: Turn-Yielding Cues Method • IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence ≥ 50ms. • Smooth switch: Speaker A finishes her utterance; speaker B takes the turn with no overlapping speech. • Trained annotators distinguished Smooth switches from Interruptions and Backchannels using a scheme based on Ferguson 1977, Beattie 1982. Agustín Gravano SIGdial 2009
Hold Smooth switch IPU1 IPU2 Speaker A: IPU3 Speaker B: Turn-Yielding Cues Method • To find turn-yielding cues, we compare: • IPUs preceding Holds, • IPUs preceding Smooth switches. • ~200 features: acoustic, prosodic, lexical, syntactic. Agustín Gravano SIGdial 2009
Turn-Yielding Cues Individual Cues • Final intonation: • Falling (L-L%) or high-rising (H-H%). • Faster speaking rate. • Reduction of final lengthening. • Lower intensity level. • Lower pitch level. • Higher jitter, shimmer, NHR. • Related to perception of voice quality. • Longer IPU duration (seconds and #words). Agustín Gravano SIGdial 2009
Before smooth switches: Before holds: Incomplete 18% Complete 47% 53% 82% (X2 test, p ~ 0) Turn-Yielding Cues Individual Cues • Textual completion (independent of intonation). (1) Manually annotated a portion of the data. Labelers read up to the end of a target IPU (no right context), judged whether it could constitute a ‘complete’ utterance. 400 tokens. K=0.81. (2) Trained an SVM classifier.19 lexical + syntactic features.Accuracy: 80%. Maj-class baseline: 55%. Human agreement: 91%. (3) Labeled all IPUs in the corpus with the SVM model. Agustín Gravano SIGdial 2009
Turn-Yielding Cues Individual Cues • Final intonation: L-L% or H-H%. • Faster speaking rate. • Lower intensity level. • Lower pitch level. • Higher jitter, shimmer, NHR. • Longer IPU duration. • Textual completion. Agustín Gravano SIGdial 2009
Turn-Yielding Cues Defining Presence of a Cue • 2-3 representative features for each cue: • Define presence/absence based on whether the value is closer to the mean before S or H. Agustín Gravano SIGdial 2009
Top Frequencies of Complex Cues digit == cue present dot == cue absent Turn-yielding cues: 1: Final intonation 2: Speaking rate 3: Intensity level 4: Pitch level 5: IPU duration 6: Voice quality 7: Completion Agustín Gravano SIGdial 2009
Turn-Yielding Cues Combined Cues r2=0.969 Percentage of turn-taking attempts Number of cues conjointly displayed Agustín Gravano SIGdial 2009
Turn-Yielding Cues IVR Systems • After each IPU from the user: if estimated likelihood > threshold then take the turn • To signal the end of a system’s turn: Include as many cues as possible in the system’s final IPU. Agustín Gravano SIGdial 2009
Summary • Study of turn-yielding cues. • Objective, automatically computable. • Combined cues. • Improve turn-taking decisions of IVR systems. • Results drawn from task-oriented dialogues. • Not necessarily generalizable. • Suitable for most IVR domains. • Interspeech 2009: Study of backchannel-inviting cues. Agustín Gravano SIGdial 2009
Special thanks to… • Julia Hirschberg • Thesis Committee Members • Maxine Eskenazi, Kathy McKeown, Becky Passonneau, Amanda Stent. • Speech Lab at Columbia University • Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob Coyne, Frank Enos, Martin Jansche, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg. • Collaborators • Gregory Ward and Elisa Sneed German (Northwestern U); Ani Nenkova (UPenn); Héctor Chávez, David Elson, Michel Galley, Enrique Henestroza, Hanae Koiso, Shira Mitchell, Michael Mulley, Kristen Parton, Ilia Vovsha, Lauren Wilcox. Agustín Gravano SIGdial 2009