1 / 19

Spoken Dialogue Systems

This study explores error avoidance, error detection, and error handling in spoken dialogue systems, focusing on cues provided by the user and the system's ability to recognize and correct misrecognitions. It also evaluates the effectiveness of different error handling strategies and discusses the impact of feedback in improving system performance. The study utilizes machine learning techniques and human performance data to identify and classify problematic dialogues and utterances. The results provide insights for designing more robust and user-friendly dialogue systems.

sarahrivera
Download Presentation

Spoken Dialogue Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spoken Dialogue Systems Julia Hirschberg CS 6998

  2. Issues • Error avoidance • Error detection • From the system side: how likely is it the system made an error? • From the user side: what cues does the user provide to indicate an error? • Error handling: what can the system do when it thinks an error has occurred? • Evaluation: how do you know what needs fixing most?

  3. Avoiding misunderstandings • By imitating human performance • Timing and grounding (Clark ’03)

  4. Recognizing Problematic Dialogues • Hastie et al, “What’s the Trouble?” ACL 2002.

  5. Recognizing Problematic Utterances (Hirschberg et al ’99--) • Collect corpus from interactive voice response system • Identify speaker ‘turns’ • incorrectly recognized • where speakers first aware of error • that correct misrecognitions • Identify prosodic features of turns in each category and compare to other turns • Use Machine Learning techniques to train a classifier to make these distinctions automatically

  6. Turn Types TOOT: Hi. This is AT&T Amtrak Schedule System. This is TOOT. How may I help you? User: Hello. I would like trains from Philadelphia to New York leaving on Sunday at ten thirty in the evening. TOOT: Which city do you want to go to? User: New York. misrecognition correction aware site

  7. Results • Reduced error in predicting misrecognized turns to 8.64% • Error in predicting ‘awares’ (12%) • Error in predicting corrections (18-21%)

  8. Evidence from Human Performance • Users provide explicit positive and negative feedback • Corpus-based vs. laboratory experiments – do these tell us different things? • Bell & Gustafson ’00 • What do we learn from this? • What functions does feedback serve? • Krahmer et al • ‘go on’ and ‘go back’ signals in grounding situations (implicit/explicit verification)

  9. Pos: short turns, unmarked word order, confirmation, answers, no corrections or repetitions, new info • Neg: long turns, marked word order, disconfirmation, no answer, corrections, repetitions, no new info • Hypotheses supported but… • Can these cues be identified automatically? • How might they affect the design of SDS?

  10. Error Handling Strategies • Goldberg et al ’03: how should systems best inform the user that they don’t understand? • System rephrasing vs. repetitions vs. statement of not understanding • Apologies • What behaviors might these produce? • Hyperarticulation • User frustration • User repetition or *rephrasing

  11. What lessons do we learn? • What produces least frustration? • Best recognized input?

  12. Evaluating Dialogue Systems • PARADISE framework (Walker et al ’00) • “Performance” of a dialogue system is affected both by whatgets accomplished by the user and the dialogue agent and howit gets accomplished Maximize Task Success Minimize Costs Efficiency Measures Qualitative Measures

  13. Task Success • Task goals seen as Attribute-Value Matrix • ELVIS e-mail retrieval task(Walker et al ‘97) • “Find the time and place of your meeting with Kim.” Attribute Value Selection Criterion Kim or Meeting Time 10:30 a.m. Place 2D516 • Task success defined by match between AVM values at end of with “true” values for AVM

  14. Metrics • Efficiency of the Interaction:User Turns, System Turns, Elapsed Time • Quality of the Interaction: ASR rejections, Time Out Prompts, Help Requests, Barge-Ins, Mean Recognition Score (concept accuracy), Cancellation Requests • User Satisfaction • Task Success: perceived completion, information extracted

  15. Experimental Procedures • Subjects given specified tasks • Spoken dialogues recorded • Cost factors, states, dialog acts automatically logged; ASR accuracy,barge-in hand-labeled • Users specify task solution via web page • Users complete User Satisfaction surveys • Use multiple linear regression to model User Satisfaction as a function of Task Success and Costs; test for significant predictive factors

  16. Was Annie easy to understand in this conversation? (TTS Performance) In this conversation, did Annie understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with Annie appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialog? (User Expertise) How often was Annie sluggish and slow to reply to you in this conversation? (System Response) Did Annie work the way you expected her to in this conversation? (Expected Behavior) From your current experience with using Annie to get your email, do you think you'd use Annie regularly to access your mail when you are away from your desk? (Future Use) User Satisfaction:Sum of Many Measures

  17. Performance Functions from Three Systems • ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET • TOOT User Sat.= .35* COMP + .45* MRS - .14*ET • ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help • COMP: User perception of task completion (task success) • MRS: Mean recognition accuracy (cost) • ET: Elapsed time (cost) • Help: Help requests (cost)

  18. Performance Model • Perceived task completion and mean recognition score are consistently significant predictors of User Satisfaction • Performance model useful for system development • Making predictions about system modifications • Distinguishing ‘good’ dialogues from ‘bad’ dialogues • But can we also tell on-line when a dialogue is ‘going wrong’

  19. Next Week • Speech summarization and data mining

More Related