270 likes | 285 Views
Adapting and Learning Dialogue Models. Discourse & Dialogue CMSC 35900-1 November 19, 2006. Roadmap. The Problem: Portability Task domain: Call-routing Porting: Speech recognition Call-routing Dialogue management Conclusions Learning DM strategies HMMs and POMDPs. SLS Portability.
E N D
Adapting and Learning Dialogue Models Discourse & Dialogue CMSC 35900-1 November 19, 2006
Roadmap • The Problem: Portability • Task domain: Call-routing • Porting: • Speech recognition • Call-routing • Dialogue management • Conclusions • Learning DM strategies • HMMs and POMDPs
SLS Portability • Spoken language system design • Record or simulate user interactions • Collect vocabulary, sentence style, sequence • Transcribe/label • Expert creates vocabulary, language model, dialogue model • Problem: Costly, time-consuming, expert
Call-routing • Goal: Given an utterance, identify type • Dispatch to right operator • Classification task: • Manual rules or data-driven methods • Feature-based classification (Boosting) • Pre-defined types, e.g.: • Hello? -> Hello; I have a question -> request(info) • I would like to know my balance. > request(balance)
Dialogue Management • Flow Controller • Pluggable dialogue strategy modules • ATN: call-flow, easy to augment, manage context • Inputs: context, semantic rep. of utterance • ASR • Language models • Trigrams, in probabilistic framework
Adaptation: ASR • ASR: Language models • Usually trained from in-domain transcriptions • Here: out-of-domain transcriptions • Switchboard, spoken dialog (telecomm, insur) • In-domain web pages • New domain: pharmaceuticals • Style differences: SLS:pronouns; OOV: med best • Best accuracy: spoken dialogue+web • SWBD too big/slow
Adaptation: Call-routing • Manual tagging: Slow, expensive • Here: Existing out-of-domain labeled data • Meta call-types: Library • Generic: all apps • Re-usable: in-domain, but already exist • Specific: only this app • Grouping done by experts • Bootstrap: Start with generic, reusable
Call-type Classification • Boostexter: word n-gram features; 1,100 iter • ASR output basis • Telecomm based call-type library • Two classifications: reject-yn; classification • In-domain: true: 78%; ASR: 62% • Generic: test on generic: 95%; 91% • Bootstrap: generic+reuse+rules: 79%, 68%
Dialogue Model • Build dialogue strategy templates • Based on call-type classification • Generic: • E.g.. Yes, no, hello, repeat, help • Cause generic context dependent reply • Tag as vague/concrete: • Vague: “I have a question” -> clarification • Concrete:clear routing, attributes – sub-dialogs
Dialogue Model Porting • Evaluation: • Compare to original transcribed dialogue • Task 1: DM category: 32 clusters of calls • Bootstrap 16 categories – 70% of instances • Using call-type classifiers: get class, conf, concrete? • If confident/concrete/correct -> correct; • If incorrect, error • Also classify vague/generic • 67-70% accuracy for DM, routing task
Conclusions • Portability: • Bootstrapping of ASR, Call-type, DM • Generally effective • Call-type success high • Others: potential
Learning DM Strategies • Prior approaches: • Hand-coded: state-, frame- or agent-based • Adaptation bootstraps from existing structure • Alternative: • Capture prior interaction patterns • Learn dialogue structure and management
Training HMM DM • Construct training corpus • E.g. Record human-human interactions • Identify and label states • Train HMM dialogue management • Use tagged sequences to learn • Correspondences between utterances and states • State transition probabilities • Effective, still requires initial tagging
Reinforcement Learning • Model dialogues with (partially observable) Markov decision processes • Users form stochastic env, • Actions are system utterances, • State is dialogue so far • Goal: maximize some utility measure • Task completion/user satisfaction • Learn policy – implemented as actions in state • That optimizes utility measure
Applications • Toot – train information • Litman, Kearns, et al • Learned different initiative/confirmation strategies • Air travel bookings (Young et al 2006) • Problem: huge number of possible states • More airports, dramatically more possible utts • Approach: Collapse all alternative slot fillers • Represent with single default
Turn-taking Discourse and Dialogue CS 35900-1 November 16, 2004
Agenda • Motivation • Silence in Human-Computer Dialogue • Turn-taking in human-human dialogue • Turn-change signals • Back-channel acknowledgments • Maintaining contact • Exploiting to improve HCC • Automatic identification of disfluencies, jump-in points, and jump-ins
Turn-taking in HCI • Human turn end: • Detected by 250ms silence • System turn end: • Signaled by end of speech • Indicated by any human sound • Barge-in • Continued attention: • No signal
Yielding & Taking the Floor • Turn change signal • Offer floor to auditor/hearer • Cues: pitch fall, lengthening, “but uh”, end gesture, amplitude drop+’uh’, end clause • Likelihood of change increases with more cues • Negated by any gesticulation • Speaker-state signal: • Shift in head direction AND/OR Start of gesture
Retaining the Floor • Within-turn signal • Still speaker: Look at hearer as end clause • Continuation signal • Still speaker: Look away after within-turn/back • Back-channel: • ‘mmhm’/okay/etc; nods, • sentence completion. Clarification request; restate • NOT a turn: signal attention, agreement, confusion
Improving Human-Computer Turn-taking • Identifying cues to turn change and turn start • Meeting conversations: • Recorded, natural research meetings • Multi-party • Overlapping speech • Units = “Spurts” between 500ms silence
Tasks • Sentence/disfluency/non-boundary ID • End of sentence, break off, continue • Jump-in points • Times when others “jump in” • Jump-in words • Interruption vs start from silence • Off- and on- line • Language model and/or prosodic cues
Text + Prosody • Text sequence: • Modeled as n-gram language model • Hidden event prediction – e.g. boundary as hidden state • Implement as HMM • Prosody: • Duration, Pitch, Pause, Energy • Decision trees: classify + probability • Integrate LM + DT
Interpreting Breaks • For each inter-word position: • Is it a disfluency, sentence end, or continuation? • Key features: • Pause duration, vowel duration • 62% accuracy wrt 50% chance baseline • ~90% overall • Best combines LM & DT
Jump-in Points • (Used) Possible turn changes • Points WITHIN spurt where new speaker starts • Key features: • Pause duration, low energy, pitch fall • No lexical/punctuation features used • Forward features useless • Look like SB but aren’t • Accuracy: 65% wrt 50% baseline • Performance depends only on preceding prosodic features
Jump-in Features • Do people speak differently when jump-in? • Differ from regular turn starts? • Examine only first words of turns • No LM • Key features: • Raised pitch, raised amplitude • Accuracy: 77% wrt 50% baseline • Prosody only
Summary • Prosodic features signal conversational moves • Pause and vowel duration distinguish sentence end, disfluency, or fluent continuation • Jump-ins occur at locations that sound like sent. ends • Raise voice when jump in