460 likes | 476 Views
This paper discusses the problem of understanding errors in spoken language interfaces and proposes approaches for increasing robustness through interaction, recognition improvement, and recovery strategies.
E N D
constructing accurate beliefs in task-oriented spoken dialog systems Dan Bohus Computer Science Department www.cs.cmu.edu/~dbohus Carnegie Mellon University dbohus@cs.cmu.edu Pittsburgh, PA 15213
problem spoken language interfaces lack robustness when faced with understanding errors • errors stem mostly from speech recognition • typical word error rates: 20-30% • significant negative impact on interactions
more concretely … S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
NONunderstanding MISunderstanding two types of understanding errors S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] S: Sorry I’m not sure I understood what you said. What city are you leaving from ? U: Urbana Champaign [FOR MINUTE SINCE HEY] S: I’m still having trouble understanding you. To go on, I need you to answer the following question. What city are you leaving from? U: Chicago [CHICAGO] S: traveling from Chicago. Where would you like to go? U: Huntsville [SEOUL] S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M] S: traveling in the afternoon. Okay what day would you be departing chicago U: /uh/ the tenth of august [AT THE TENTH OF AUGUST] S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm, arrives Seoul at 5pm the next day. Do you want to take that? U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND] S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm arrives Seoul at ………
approaches for increasing robustness • gracefully handle errors through interaction • improve recognition • detect the problems • develop a set of recovery strategies • know how to choose between them (policy)
misunderstandings non-understandings detection strategies policy six not-so-easy pieces …
today’s talk … • construct more accurate beliefs by integrating information over multiple turns in a conversation misunderstandings detection S: Where would you like to go? U: Huntsville [SEOUL / 0.65] destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}
belief updating: problem statement • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) destination = {seoul/0.65} S: traveling to Seoul. What day did you need to travel? [THE TRAVELING TO BERLIN P_M / 0.60] destination = {?}
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work
current solutions • most systems only track values, not beliefs • new values overwrite old values • use confidence scores yes → trust hypothesis • explicit confirm + no → delete hypothesis “other” → non-understanding • implicit confirm: not much “users who discover errors through incorrect implicitconfirmations have a harder time getting back on track” [Shin et al, 2002] related work : restricted version : data : user response analysis : results : current and future work
confidence / detecting misunderstandings • traditionally focused on word-level errors [Chase, Cox, Bansal, Ravinshankar, and many others] • recently: detecting misunderstandings[Walker, Wright, Litman, Bosch, Swerts, San-Segundo, Pao, Gurevych, Bohus, and many others] • machine learning approach: binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work
detecting corrections • detect if the user is trying to correct the system [Litman, Swerts, Hirschberg, Krahmer, Levow] • machine learning approach binary classification • in-domain, labeled dataset • features from different knowledge sources • acoustic, prosody, language model, parsing, dialog management • ~50% relative reduction in classification error related work : restricted version : data : user response analysis : results : current and future work
integration • confidence annotation and correction detection are useful tools • but separately, neither solves the problem • bridge together in a unified approach to accurately track beliefs related work : restricted version : data : user response analysis : results : current and future work
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work
belief updating: general form • given • an initial belief Pinitial(C) over concept C • a system action SA • a user response R • construct an updated belief • Pupdated(C) ← f (Pinitial(C), SA, R) related work : restricted version : data : user response analysis : results : current and future work
two simplifications 1. belief representation • system unlikely to “hear” more than 3 or 4 values for a concept within a dialog session • in our data [considering only top hypothesis from recognition] • max = 3 (conflicting values heard) • only in 6.9% of cases, more than 1 value heard • compressed beliefs: top-K concept hypotheses + other • for now, K=1 2. updates following system confirmation actions related work : restricted version : data : user response analysis : results : current and future work
{boston/0.65; austin/0.11; … } + ExplicitConfirm( Boston ) + [NOW] {boston/ ?} belief updating: reduced version • given • an initial confidence score for the current top hypothesis Confinit(thC) for concept C • a system confirmation action SA • a user response R • construct an updated confi-dence score for that hypothesis • Confupd(thC) ← f (Confinit(thC), SA, R) related work : restricted version : data : user response analysis : results : current and future work
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work
data • collected with RoomLine • a phone-based mixed-initiative spoken dialog system • conference room reservation • explicit and implicit confirmations • confidence threshold model (+ some exploration) • unplanned implicit confirmations • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? • I found 10 rooms for Friday between 1 and 3 p.m. Would like a small room or a large one? related work : restricted version : data : user response analysis : results : current and future work
corpus • user study • 46 participants (naïve users) • 10 scenario-based interactions each • compensated per task success • corpus • 449 sessions, 8848 user turns • orthographically transcribed • manually annotated • misunderstandings • corrections • correct concept values related work : restricted version : data : user response analysis : results : current and future work
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work
user response types • following [Krahmer and Swerts, 2000] • study on Dutch train-table information system • 3 user response types • YES: yes, right, that’s right, correct, etc. • NO: no, wrong, etc. • OTHER • cross-tabulated against correctness of system confirmations related work : restricted version : data : user response analysis : results : current and future work
~10% user responses to explicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work
other responses to explicit confirmations • ~70% users repeat the correct value • ~15% users don’t address the question • attempt to shift conversation focus • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work
user responses to implicit confirmations [numbers in brackets from Krahmer&Swerts] related work : restricted version : data : user response analysis : results : current and future work
ignoring errors in implicit confirmations • explanation • users correct later (40% of 118) • users interact strategically / correct only if essential • how often users correct the system? related work : restricted version : data : user response analysis : results : current and future work
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work
machine learning approach • problem: Confupd(thC) ← f (Confinit(thC), SA, R) • need good probability outputs • low cross-entropy between model predictions and reality • logistic regression • sample efficient • stepwise approach → feature selection • logistic model tree for each action • root splits on response-type related work : restricted version : data : user response analysis : results : current and future work
features. target. • target: was the top hypothesis correct? related work : restricted version : data : user response analysis : results : current and future work
baselines • initial baseline • accuracy of system beliefs before the update • heuristic baseline • accuracy of heuristic update rule used by the system • oracle baseline • accuracy if we knew exactly what the user said related work : restricted version : data : user response analysis : results : current and future work
results: explicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 31.15 30% 0.6 0.51 20% 0.4 0.19 10% 0.2 8.41 0.12 3.57 2.71 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work
results: implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 30.40 1.0 30% 0.8 23.37 0.67 0.61 20% 0.6 16.15 15.33 0.43 0.4 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work
results: unplanned implicit confirmation initial heuristic logistic model tree oracle Hard error (%) Soft error 20% 0.6 15.40 0.46 14.36 0.43 12.64 0.4 0.34 10.37 10% 0.2 0% 0.0 related work : restricted version : data : user response analysis : results : current and future work
informative features • initial confidence score • prosody features • barge-in • expectation match • repeated grammar slots • concept identity related work : restricted version : data : user response analysis : results : current and future work
summary • data-driven approach for constructing accurate system beliefs • integrate information across multiple turns • bridge together detection of misunderstandings and corrections • performs better than current heuristics • user response analysis • users don’t correct unless the error is critical related work : restricted version : data : user response analysis : results : current and future work
outline • related work • a restricted version • data • user response analysis • experiments and results • current and future work related work : restricted version : data : user response analysis : results : current and future work
k hyps + other • multinomial GLM • all actions • confirmation (expl/impl) • request • unexpected features • added priors current extensions • top hypothesis + other • logistic regression model belief representation system action • confirmation actions related work : restricted version : data : user response analysis : results : current and future work
2 hypotheses + other 15.49% 30.83% 30.46% 30% 30% 15.15% 14.02% 26.16% 12.95% 22.69% 12% 21.45% 10.72% 20% 20% 17.56% 16.17% 8% 10% 10% 7.86% 4% 6.06% 5.52% 0% 0% 0% unplanned impl. conf. implicit confirmation explicit confirmation 80.00% 98.14% initial heuristic lmt(basic) lmt(basic+concept) oracle 45.03% 12% 9.64% 40% 9.49% 8% 25.66% 6.08% 19.23% 20% 4% 0% 0% unexpected update request related work : restricted version : data : user response analysis : results : current and future work
other work misunderstandings non-understandings • belief updating [ASRU-05] • costs for errors • rejection threshold adaptation • nonu impact on performance [Interspeech-05] • transfering confidence annotators across domains [in progress] detection • comparative analysis of 10 recovery strategies [SIGdial-05] strategies • impact of policy on performance • towards learning non-understanding recovery policies [SIGdial-05] policy • RavenClaw: dialog management for task-oriented systems - RoomLine, Let’s Go Public!, Vera, LARRI, TeamTalk, Sublime [EuroSpeech-03, HLT-05] related work : restricted version : data : user response analysis : results : current and future work
a more subtle caveat • distribution of training data • confidence annotator + heuristic update rules • distribution of run-time data • confidence annotator + learned model • always a problem when interacting with the world! • hopefully, distribution shift will not cause large degradation in performance • remains to validate empirically • maybe a bootstrap approach?
KL-divergence & cross-entropy • KL divergence: D(p||q) • Cross-entropy: CH(p, q) = H(p) + D(p||q) • Negative log likelihood
logistic regression • regression model for binomial (binary) dependent variables • fit a model using max likelihood (avg log-likelihood) • any stats package will do it for you • no R2 measure • test fit using “likelihood ratio” test • stepwise logistic regression • keep adding variables while data likelihood increases signif. • use Bayesian information criterion to avoid overfitting
regression tree, but with logistic models on leaves logistic model tree f f=0 f=1 g g<=10 g>10
user study • 46 participants, 1st time users • 10 scenarios, fixed order • presented graphically (explained during briefing) • participants compensated per task success