1 / 32

Sorry, I didn’t catch that …

Sorry, I didn’t catch that …. Non-understandings and recovery in spoken dialog systems Part I: Issues, Data Collection, Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005. ASR Errors & Spoken Dialog. Call RoomLine!. 1-412-268-1084. Call Let’s Go!.

ruggiero
Download Presentation

Sorry, I didn’t catch that …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sorry, I didn’t catch that … Non-understandings and recovery in spoken dialog systems Part I: Issues, Data Collection, Rejection Tuning Dan Bohus Sphinx Lunch Talk Carnegie Mellon University, March 2005

  2. ASR Errors & Spoken Dialog Call RoomLine! 1-412-268-1084 Call Let’s Go! 1-412-268-1185 intro: data collection : issues under investigation

  3. S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON-understanding • System cannot extract any meaningful information from the user’s turn S: What city are you leaving from? U: Birmingham [ BERLIN PM] MIS-understanding • System extracts incorrect information from the user’s turn Non-understandings and Misunderstandings • Recognition errors can lead to 2 types of problems in a spoken dialog system: intro: data collection : issues under investigation

  4. S: What city are you leaving from? U: Urbana Champaign [OKAY IN THAT SAME PAY] NON-understanding • System cannot extract any meaningful information from the user’s turn Non-understandings and Misunderstandings • How can we prevent non-understandings? • How can we recover from them? • Detection • Set of recovery strategies • Policy for choosing between them intro: data collection : issues under investigation

  5. Current State of Affairs • Detection / Diagnosis • Systems know when a non-understanding happens • There was a detected user turn but no meaningful information • Systems decide to reject because of low confidence • Not much exists in terms of diagnosis • Set of recovery strategies • Repeat the question again • “Can you repeat that?” • “Sorry I didn’t catch that …” • Policy for choosing between them • Traditionally, simple heuristics are used intro: data collection : issues under investigation

  6. Data Collection Experiment Questions Under Investigation • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation

  7. Data Collection: Experimental Design • Subjects interact over the telephone with RoomLine • Performed 10 of scenario-based tasks • Between-subjects experiment, 2 groups: • Control: system uses a random (uniform) policy for engaging the non-understanding recovery strategies • Wizard: policy is determined at runtime by a human (wizard) • 46 subjects, balanced gender x native intro: data collection : issues under investigation

  8. Non-understanding Strategies S: For when do you need the room? U: [non-understanding] MOVE-ON 1. MOVE-ON Sorry, I didn’t catch that. For which day you need the room? 2. YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … 3. TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … 4. FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … 5. ASK REPEAT (AREP) Could you please repeat that? 6. ASK REPHRASE (ARPH) Could you please try to rephrase that? 7. NOTIFY (NTFY) Sorry, I didn’t catch that ... 8. YIELD TURN (YLD) … 9. REPROMPT (RP) For when do you need the conference room? 10. DETAILED REPROMPT (DRP) Right now I need to know the date and time for when you need the reservation … HELP REPEAT NOTIFY REPROMPT intro: data collection : issues under investigation

  9. Experimental Design: Scenarios • 10 scenarios, fixed order • Presented graphically (explained during briefing) intro: data collection : issues under investigation

  10. Experimental Design: Evaluation • Participants filled in a SASSI evaluation questionnaire • 35 questions, 1-7 Likert scale; 6 factors: • response accuracy, likeability, cognitive demand, annoyance, habitability, speed • Overall user satisfaction score: 1-7 • What did you like best / least? • What would you change first? intro: data collection : issues under investigation

  11. Corpus Statistics / Characteristics • 46 users; 449 sessions; 8278 user turns • User utterances transcribed & checked • Annotated with: • Concept transfer & Misunderstandings • Correctly, incorrectly, deleted, substituted concepts • Correct concept values at each turn • Transcript grammaticality labels • OK, OOR, OOG, OOS, OOD, VOID, PART • Corrections • User response to non-understanding recovery • Repeat, Rephrase, Contradict, Change, Other intro: data collection : issues under investigation

  12. Corpus intro: data collection : issues under investigation

  13. General corpus statistics intro: data collection : issues under investigation

  14. Back to the Issues • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation

  15. Next … • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can we refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : rejection threshold

  16. Utterance Rejection • Systems use confidence scores to assess reliability of inputs • A widely used design pattern: • If confidence is very low (i.e. below a certain threshold), reject the utterance altogether • Genuine non-understandings + rejections • This creates a tradeoff between non-understandings and misunderstandings intro: data collection : rejection threshold

  17. Nonu- / Mis-understanding Tradeoff • The nonu- vs. mis-understanding tradeoff threshold intro: data collection : rejection threshold

  18. An Alternative, More Informative View • Number of Concepts transferred • Correctly (CTC) or Incorrectly (ITC) threshold intro: data collection : rejection threshold

  19. Current solutions • Set the threshold like the ASR manual says: • In all likelihood … • Mismatch: ASR confidence optimization is probably for WER • The tradeoff between misunderstandings and rejections probably varies across domains, and even across dialog states • Go for the break-even point • Acknowledge the tradeoff; solve it by postulating costs • Misunderstandings cost twice as much as rejections intro: data collection : rejection threshold

  20. Proposed Approach • Use a data-driven approach to establish the costs, then optimize threshold • Identify a set of variables involved in the tradeoff CTC(th) vs. ITC(th) • Choose a dialog performance metric TC – task completion (binary, kappa); TD – task duration (# turns), US – user satisfaction • Build a regression model m logit(TC) ← C0 + CCTC•CTC + CITC•ITC • Optimize threshold to maximize performance th* = argmax (CCTC•CTC + CITC•ITC) th intro: data collection : rejection threshold

  21. State-specific costs & thresholds • The costs are potentially different at different points in the dialog • Count CTC and ITC at different states with different variables logit(TC) ← C0 + CCTCstate1•CTCstate1 + CITCstate1•ITCstate1+ CCTCstate2•CTCstate2 + CITCstate2•ITCstate2+ CCTCstate3•CTCstate3 + CITCstate3•ITCstate3+ … • Optimize separate threshold for each state th*/state_x = argmax (CCTCstate_x•CTCstate_x + CITCstate_x•ITCstate_x) th intro: data collection : rejection threshold

  22. States Considered • Open request • How may I help you? • Request(bool) • Did you want a reservation for this room? • Request(non-bool) • Starting at what time do you need the room? • Finer granularity is desirable • Can be achieved given more data intro: data collection : rejection threshold

  23. Model 1: Resulting fit and coefficients intro: data collection : rejection threshold

  24. Model 1: Threshold optimization Open-request Request(non-bool) Request(bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool): 61.00 intro: data collection : rejection threshold

  25. Anecdotal evidence from data collection indicates too many false rejections on open requests Data analysis confirms this view Results do confirm expectations intro: data collection : rejection threshold

  26. What would change? Remains to be seen … intro: data collection : rejection threshold

  27. Model 2: Description • Global performance metric • Task duration (successful tasks) - # turns (poisson variable) • Generalized linear model / poisson Log(TD) ← C0 + CCTC•CTC + CITC•ITC • But different tasks have different durations, so you’d want to normalize: Log(TDx/TDx) ← C0 + CCTC•CTC + CITC•ITC • Instead, use regression offsets Log(TDx) ← 1•Log(TDx) + C0 + CCTC•CTC + CITC•ITC • Tradeoff variables: same as before intro: data collection : rejection threshold

  28. Model 2: Resulting fit and coefficients intro: data collection : rejection threshold

  29. Model 2: Resulting fit and coefficients R^2 = 0.56 intro: data collection : rejection threshold

  30. Model 1: Threshold optimization Open-request Request(non-bool) Request(bool) Thresholds: Open-request: 0.00 Req(bool): 0.00 Req(non-bool): 61.00 intro: data collection : rejection threshold

  31. Conclusion • Model for tuning rejection • Really data-driven • Relates state-specific costs of rejection to global dialog performance • Bridge mismatch between off-the-shelf confidence annotation scheme and particular characteristics of system’s domain • More data would allow even finer-grained distinctions • Expected performance improvements remain to be verified intro: data collection : rejection threshold

  32. Next time … • Data Collection • Detection / Diagnosis • What are the main causes (sources) of non-understandings? • What is their impact on global performance? • Can we diagnose non-understandings at run-time? • Can we optimize the rejection process in a more principled way? • Set of recovery strategies • What is the relative performance of different recovery strategies? • Can refine current strategies and find new ones? • Policy for choosing between them • Can we improve performance by making smarter choices? • If so, can we learn how to make these smarter choices? intro: data collection : issues under investigation

More Related