constructing accurate beliefs in task-oriented spoken dialog systems

constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213

problem spoken language interfaces lack robustness when faced with understanding errors • errors stem mostly from speech recognition • typical word error rates: 20-30% • significant negative impact on interactions

more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

NONunderstanding MISunderstanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………

approaches for increasing robustness • gracefully handle errors through interaction • improve recognition • detect the problems • develop a set of recovery strategies • know how to choose between them (policy)

misunderstandings non-understandings detection strategies policy six not-so-easy pieces …

today’s talk … • construct more accurate beliefs by integrating information over multiple turns in a conversation misunderstandings detection S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

belief updating: problem statement • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}

outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work

current solutions • most systems only track values, not beliefs • new values overwrite old values • use confidence scores yes → trust hypothesis • explicit confirm + no → delete hypothesis “other” → non-understanding • implicit confirm: not much “users who discover errors through incorrect implicitconfirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work

confidence / detecting misunderstandings • traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others] • recently: detecting misunderstandings[Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others] • machine learning approach: binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

detecting corrections • detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] • machine learning approach binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, prosody, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work

integration • confidence annotation and correction detection are useful tools • but separately, neither solves the problem • bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work

outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work

belief updating: general form • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) related work : restricted version : data : user response analysis : results : current and future work

two simplifications 1. belief representation • system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session • in our data [considering only top hypothesis from recognition] • max = 3 (conflicting values heard) • only in 6.9% of cases, more than 1 value heard • compressed beliefs: top-K concept hypotheses + other • for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work

{boston/0.65; austin/0.11; … } + ExplicitConfirm( Boston ) + [NOW] {boston/ ?} belief updating: reduced version • given • an initial confidence score for the current top hypothesis Confinit(thC) for concept C • a system confirmation action SA • a user response R • construct an updated confi-dence score for that hypothesis • Confupd(thC) ← f (Confinit(thC), SA, R) related work : restricted version : data : user response analysis : results : current and future work

data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • explicit and implicit confirmations • confidence threshold model (+ some exploration) • unplanned implicit confirmations • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? related work : restricted version : data : user response analysis : results : current and future work

corpus • user study • 46 participants (naïve users) • 10 scenario-based interactions each • compensated per task success • corpus • 449 sessions, 8848 user turns • orthographically transcribed • manually annotated • misunderstandings • corrections • correct concept values related work : restricted version : data : user response analysis : results : current and future work

user response types • following [Krahmer and Swerts, 2000] • study on Dutch train-table information system • 3 user response types • YES: yes, right, that’s right, correct, etc. • NO: no, wrong, etc. • OTHER • cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work

~10% user responses to explicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

other responses to explicit confirmations • ~70% users repeat the correct value • ~15% users don’t address the question • attempt to shift conversation focus • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

user responses to implicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work

ignoring errors in implicit confirmations • explanation • users correct later (40% of 118) • users interact strategically / correct only if essential • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work

machine learning approach • problem: Confupd(thC) ← f (Confinit(thC), SA, R) • need good probability outputs • low cross-entropy between model predictions and reality • logistic regression • sample efficient • stepwise approach → feature selection • logistic model tree for each action • root splits on response-type related work : restricted version : data : user response analysis : results : current and future work

features. target. • target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work

baselines • initial baseline • accuracy of system beliefs before the update • heuristic baseline • accuracy of heuristic update rule used by the system • oracle baseline • accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work

results: explicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 31.15 30% 0.6 0.51 20% 0.4 0.19 10% 0.2 8.41 0.12 3.57 2.71 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

results: implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 30.40 1.0 30% 0.8 23.37 0.67 0.61 20% 0.6 16.15 15.33 0.43 0.4 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

results: unplanned implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 20% 0.6 15.40 0.46 14.36 0.43 12.64 0.4 0.34 10.37 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work

informative features • initial confidence score • prosody features • barge-in • expectation match • repeated grammar slots • concept identity related work : restricted version : data : user response analysis : results : current and future work

summary • data-driven approach for constructing accurate system beliefs • integrate information across multiple turns • bridge together detection of misunderstandings and corrections • performs better than current heuristics • user response analysis • users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work

k hyps + other • multinomial GLM • all actions • confirmation (expl/impl) • request • unexpected features • added priors current extensions • top hypothesis + other • logistic regression model belief representation system action • confirmation actions related work : restricted version : data : user response analysis : results : current and future work

2 hypotheses + other 15.49% 30.83% 30.46% 30% 30% 15.15% 14.02% 26.16% 12.95% 22.69% 12% 21.45% 10.72% 20% 20% 17.56% 16.17% 8% 10% 10% 7.86% 4% 6.06% 5.52% 0% 0% 0% unplanned impl. conf. implicit confirmation explicit confirmation 80.00% 98.14% initial heuristic lmt(basic) lmt(basic+concept) oracle 45.03% 12% 9.64% 40% 9.49% 8% 25.66% 6.08% 19.23% 20% 4% 0% 0% unexpected update request related work : restricted version : data : user response analysis : results : current and future work

other work misunderstandings non-understandings • belief updating [ASRU-05] • costs for errors • rejection threshold adaptation • nonu impact on performance [Interspeech-05] • transfering confidence annotators across domains [in progress] detection • comparative analysis of 10 recovery strategies [SIGdial-05] strategies • impact of policy on performance • towards learning non-understanding recovery policies [SIGdial-05] policy • RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05] related work : restricted version : data : user response analysis : results : current and future work

thank you! questions …

a more subtle caveat • distribution of training data • confidence annotator + heuristic update rules • distribution of run-time data • confidence annotator + learned model • always a problem when interacting with the world! • hopefully, distribution shift will not cause large degradation in performance • remains to validate empirically • maybe a bootstrap approach?

KL-divergence & cross-entropy • KL divergence: D(p||q) • Cross-entropy: CH(p, q) = H(p) + D(p||q) • Negative log likelihood

logistic regression • regression model for binomial (binary) dependent variables • fit a model using max likelihood (avg log-likelihood) • any stats package will do it for you • no R2 measure • test fit using “likelihood ratio” test • stepwise logistic regression • keep adding variables while data likelihood increases signif. • use Bayesian information criterion to avoid overfitting

logistic regression

regression tree, but with logistic models on leaves logistic model tree f f=0 f=1 g g<=10 g>10

user study • 46 participants, 1st time users • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success

constructing accurate beliefs in task-oriented spoken dialog systems

constructing accurate beliefs in task-oriented spoken dialog systems

Presentation Transcript

belief updating in spoken dialog systems

SDC: The Spoken Dialog Challenge

Spoken dialog

Belief Updating in Spoken Dialog Systems

Error Awareness and Recovery in Task-Oriented Spoken Dialogue Systems

Spoken Dialog Systems

User Interactions in Spoken Dialog systems

Belief Updating in Spoken Dialog Systems

Characterizing Task-Oriented Dialog using a Simulated ASR Channel

Constructing a Task List

Research Challenges for Spoken Language Dialog Systems

Spoken Dialog System Architecture

Research Challenges for Spoken Language Dialog Systems

Review of Spoken Language Understanding in Dialog Systems

Stochastic Language Generation for Spoken Dialog Systems

Belief Updating in Spoken Dialog Systems

Improving User Interaction with Spoken Dialog Systems via Shaping

Belief Updating in Spoken Dialog Systems