Belief Updating in Spoken Dialog Systems

Belief Updating in Spoken Dialog Systems Dan Bohus www.cs.cmu.edu/~dbohus dbohus@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA, 15217

problem • stems mostly from speech recognition • spans most domains and interaction types spoken language interfaces lack robustness when faced with understanding errors.

more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

NON understanding MIS understanding non- and misunderstandings S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

approaches for increasing robustness • gracefully handle errors through interaction • fix recognition • detect the problems • develop a set of recovery strategies • know how to choose between them (policy)

misunderstandings non-understandings detection strategies policy six not-so-easy pieces …

belief updating • construct more accurate beliefs by integrating information over multiple turns misunderstandings detection S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

belief updating: problem statement • given: • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief: • Pupdated(C) ← f (Pinitial(C), SA, R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

outline • related work • a restricted version • data • user response analysis • experiments and results • some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work

confidence annotation + heuristic updates • confidence annotation • traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar] • more recently: semantic confidence annotation [Walker, San-Segundo, Bohus] • machine learning approach • results fairly good, but not perfect • heuristic updates • explicit confirmation: no → don’t trust ; yes → trust • implicit confirmation: no → don’t trust ; o/w → trust • suboptimal for several reasons related work : restricted version : data : user response analysis : experiment & results : caveats & future work

correction detection • detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] • machine learning approach • features from different knowledge sources in the system • results fairly good, but not perfect related work : restricted version : data : user response analysis : experiment & results : caveats & future work

integration • confidence annotation and correction detection are useful tools • but separately, neither solves the problem • bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : experiment & results : caveats & future work

belief updating: general form • given: • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief: • Pupdated(C) ← f (Pinitial(C), SA, R) related work : restricted version : data : user response analysis : experiment & results : caveats & future work

restricted version: 2 simplifications • compact belief • system unlikely to “hear” more than 3 or 4 values • single vs. multiple recognition results • in our data: max = 3 values, only 6.9% have >1 value • confidence score of top hypothesis • updates after confirmation actions • reduced problem • ConfTopupdated(C) ← f (ConfTopinitial(C), SA, R) related work : restricted version : data : user response analysis : experiment & results : caveats & future work

data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • search and negotiation • explicit and implicit confirmations • confidence threshold model (+ some exploration) • unplanned implicit confirmations • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? related work : restricted version : data : user response analysis : experiment & results : caveats & future work

corpus • user study • 46 participants (naïve users) • 10 scenario-based interactions each • compensated per task success • corpus • 449 sessions, 8848 user turns • orthographically transcribed • rich annotation: correct concepts, corrections, etc. related work : restricted version : data : user response analysis : experiment & results : caveats & future work

user response types • following Krahmer and Swerts • study on Dutch train-table information system • 3 user response types • YES: yes, right, that’s right, correct, etc. • NO: no, wrong, etc. • OTHER • cross-tabulated against correctness of confirmations related work : restricted version : data : user response analysis : experiment & results : caveats & future work

~10% user responses to explicit confirmations • from transcripts [numbers in brackets from Krahmer&Swerts] • from decoded related work : restricted version : data : user response analysis : experiment & results : caveats & future work

other responses to explicit confirmations • ~70% users repeat the correct value • ~15% users don’t address the question • attempt to shift conversation focus related work : restricted version : data : user response analysis : experiment & results : caveats & future work

user responses to implicit confirmations • Transcripts [numbers in brackets from Krahmer&Swerts] • Decoded related work : restricted version : data : user response analysis : experiment & results : caveats & future work

ignoring errors in implicit confirmations • users correct later (40% of 118) • users interact strategically • correct only if essential related work : restricted version : data : user response analysis : experiment & results : caveats & future work

machine learning approach • need good probability outputs • low cross-entropy between model predictions and reality • cross-entropy = negative average log posterior • logistic regression • sample efficient • stepwise approach → feature selection • logistic model tree for each action • root splits on response-type related work : restricted version : data : user response analysis : experiment & results : caveats & future work

features. target. • initial situation • initial confidence score • concept identity, dialog state, turn number • system action • other actions performed in parallel • features of the user response • acoustic / prosodic features • lexical features • grammatical features • dialog-level features • target: was the value correct? related work : restricted version : data : user response analysis : experiment & results : caveats & future work

baselines • initial baseline • accuracy of system beliefs before the update • heuristic baseline • accuracy of heuristic rule currently used in the system • oracle baseline • accuracy if we knew exactly when the user is correcting the system related work : restricted version : data : user response analysis : experiment & results : caveats & future work

results: explicit confirmation Hard error (%) Soft error related work : restricted version : data : user response analysis : experiment & results : caveats & future work

results: implicit confirmation Hard error (%) Soft error related work : restricted version : data : user response analysis : experiment & results : caveats & future work

results: unplanned implicit confirmation Hard error (%) Soft error related work : restricted version : data : user response analysis : experiment & results : caveats & future work

informative features • initial confidence score • prosody features • barge-in • expectation match • repeated grammar slots • concept id related work : restricted version : data : user response analysis : experiment & results : caveats & future work

outline • related work • a reduced version. approach • data • user response analysis • experiments and results • some caveats and future work related work : restricted version : data : user response analysis : experiment & results : caveats & future work

eliminate simplification 1 • current restricted version • belief = confidence score of top hypothesis • only 6.9% of cases had more than 1 hypothesis • extend to • Nhypotheses + 1 (other), where N is a small integer (2 or 3) • approach: multinomial generalized linear model • use information from multiple recognition hypotheses related work : restricted version : data : user response analysis : experiment & results : caveats & future work

eliminate simplification 2 • current restricted version • only updates following system confirmation actions • users might correct the system at any point • extend to • updates after all system actions related work : restricted version : data : user response analysis : experiment & results : caveats & future work

misunderstandings non-understandings detection strategies policy shameless self promotion - rejection threshold adaptation - nonu impact on performance [Interspeech-05] - comparative analysis of 10 recovery strategies [SIGdial-05] • wizard experiment • towards learning nonu recovery policies [Sigdial-05]

shameless CMU promotion • Ananlada (Moss) Chotimongkol • automatic concept and task structure acquisition • Antoine Raux • turn-taking, conversation micro-management • Jahanzeb Sherwani • multimodal personal information management • Satanjeev Banerjee • meeting understanding • Stefanie Tomko • universal speech interface • Thomas Harris • multi-participant dialog • DoD / Young Researchers’ Roundtable

thankyou!

a more subtle caveat • distribution of training data • confidence annotator + heuristic update rules • distribution of run-time data • confidence annotator + learned model • always a problem when interacting with the world • hopefully, distribution shift will not cause large degradation in performance • remains to validate empirically • maybe a bootstrap approach?

Belief Updating in Spoken Dialog Systems