320 likes | 697 Views
Sorry, I didn’t catch that …. Non-understandings and recovery in spoken dialog systems Part I: Issues, Data Collection, Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005. ASR Errors & Spoken Dialog. Call RoomLine!. 1-412-268-1084. Call Let’s Go!.
E N D
Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I: Issues, Data Collection, Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005
ASR Errors & Spoken Dialog Call RoomLine! 1-412-268-1084 Call Let’s Go! 1-412-268-1185 intro: data collection : issues under investigation
S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON-understanding • System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [ BERLIN PM] MIS-understanding • System extracts incorrect information from the user’s turn Non-understandings and Misunderstandings • Recognition errors can lead to 2 types of problems in a spoken dialog system: intro: data collection : issues under investigation
S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON-understanding • System cannot extract any meaningful information from the user’s turn Non-understandings and Misunderstandings • How can we prevent non-understandings? • How can we recover from them? • Detection • Set of recovery strategies • Policy for choosing between them intro: data collection : issues under investigation
Current State of Affairs • Detection / Diagnosis • Systems know when a non-understanding happens • There was a detected user turn but no meaningful information • Systems decide to reject because of low confidence • Not much exists in terms of diagnosis • Set of recovery strategies • Repeat the question again • “Can you repeat that?” • “Sorry I didn’t catch that …” • Policy for choosing between them • Traditionally, simple heuristics are used intro: data collection : issues under investigation
Data Collection Experiment Questions Under Investigation • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation
Data Collection: Experimental Design • Subjects interact over the telephone with RoomLine • Performed 10 of scenario-based tasks • Between-subjects experiment, 2 groups: • Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies • Wizard: policy is determined at runtime by a human (wizard) • 46 subjects, balanced gender x native intro: data collection : issues under investigation
Non-understanding Strategies S: For when do you need the room? U: [non-understanding] MOVE-ON 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that ... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … HELP REPEAT NOTIFY REPROMPT intro: data collection : issues under investigation
Experimental Design: Scenarios • 10 scenarios, fixed order • Presented graphically (explained during briefing) intro: data collection : issues under investigation
Experimental Design: Evaluation • Participants filled in a SASSI evaluation questionnaire • 35 questions, 1-7 Likert scale; 6 factors: • response accuracy, likeability, cognitive demand, annoyance, habitability, speed • Overall user satisfaction score: 1-7 • What did you like best / least? • What would you change first? intro: data collection : issues under investigation
Corpus Statistics / Characteristics • 46 users; 449 sessions; 8278 user turns • User utterances transcribed & checked • Annotated with: • Concept transfer & Misunderstandings • Correctly, incorrectly, deleted, substituted concepts • Correct concept values at each turn • Transcript grammaticality labels • OK, OOR, OOG, OOS, OOD, VOID, PART • Corrections • User response to non-understanding recovery • Repeat, Rephrase, Contradict, Change, Other intro: data collection : issues under investigation
Corpus intro: data collection : issues under investigation
General corpus statistics intro: data collection : issues under investigation
Back to the Issues • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation
Next … • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : rejection threshold
Utterance Rejection • Systems use confidence scores to assess reliability of inputs • A widely used design pattern: • If confidence is very low (i.e. below a certain threshold), reject the utterance altogether • Genuine non-understandings + rejections • This creates a tradeoff between non-understandings and misunderstandings intro: data collection : rejection threshold
Nonu- / Mis-understanding Tradeoff • The nonu- vs. mis-understanding tradeoff threshold intro: data collection : rejection threshold
An Alternative, More Informative View • Number of Concepts transferred • Correctly (CTC) or Incorrectly (ITC) threshold intro: data collection : rejection threshold
Current solutions • Set the threshold like the ASR manual says: • In all likelihood … • Mismatch: ASR confidence optimization is probably for WER • The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states • Go for the break-even point • Acknowledge the tradeoff; solve it by postulating costs • Misunderstandings cost twice as much as rejections intro: data collection : rejection threshold
Proposed Approach • Use a data-driven approach to establish the costs, then optimize threshold • Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) • Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction • Build a regression model m logit(TC) ← C0 + CCTC•CTC + CITC•ITC • Optimize threshold to maximize performance th* = argmax (CCTC•CTC + CITC•ITC) th intro: data collection : rejection threshold
State-specific costs & thresholds • The costs are potentially different at different points in the dialog • Count CTC and ITC at different states with different variables logit(TC) ← C0 + CCTCstate1•CTCstate1 + CITCstate1•ITCstate1+ CCTCstate2•CTCstate2 + CITCstate2•ITCstate2+ CCTCstate3•CTCstate3 + CITCstate3•ITCstate3+ … • Optimize separate threshold for each state th*/state_x = argmax (CCTCstate_x•CTCstate_x + CITCstate_x•ITCstate_x) th intro: data collection : rejection threshold
States Considered • Open request • How may I help you? • Request(bool) • Did you want a reservation for this room? • Request(non-bool) • Starting at what time do you need the room? • Finer granularity is desirable • Can be achieved given more data intro: data collection : rejection threshold
Model 1: Resulting fit and coefficients intro: data collection : rejection threshold
Model 1: Threshold optimization Open-request Request(non-bool) Request(bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool): 61.00 intro: data collection : rejection threshold
Anecdotal evidence from data collection indicates too many false rejections on open requests Data analysis confirms this view Results do confirm expectations intro: data collection : rejection threshold
What would change? Remains to be seen … intro: data collection : rejection threshold
Model 2: Description • Global performance metric • Task duration (successful tasks) - # turns (poisson variable) • Generalized linear model / poisson Log(TD) ← C0 + CCTC•CTC + CITC•ITC • But different tasks have different durations, so you’d want to normalize: Log(TDx/TDx) ← C0 + CCTC•CTC + CITC•ITC • Instead, use regression offsets Log(TDx) ← 1•Log(TDx) + C0 + CCTC•CTC + CITC•ITC • Tradeoff variables: same as before intro: data collection : rejection threshold
Model 2: Resulting fit and coefficients intro: data collection : rejection threshold
Model 2: Resulting fit and coefficients R^2 = 0.56 intro: data collection : rejection threshold
Model 1: Threshold optimization Open-request Request(non-bool) Request(bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool): 61.00 intro: data collection : rejection threshold
Conclusion • Model for tuning rejection • Really data-driven • Relates state-specific costs of rejection to global dialog performance • Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain • More data would allow even finer-grained distinctions • Expected performance improvements remain to be verified intro: data collection : rejection threshold
Next time … • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation