End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

End-to-End Joint Learning of Natural LanguageUnderstandingandDialogueManager Paul Crook Xiujun Li Jianfeng Gao Li Deng Xuesong Yang Yun-Nung (Vivian) Chen DilekHakkani-Tür ICASSP 2017 March 9th xyang45@illinois.edu y.v.chen@ieee.org HLT-L2: Spoken Language Understanding I

What is a Dialogue System? Apple Siri (2011) • Category: • Chit Chat • Task-Oriented Microsoft Cortana(2014) Google Home (2016) Facebook M& Bot (2015) • Adialoguesystemis acomputeragentthatinteractswithhumanvianaturallanguages. [1 ] Zue, Victor W., and James R. Glass. “Conversational interfaces: Advances and challenges.” Proceedings of theIEEE 88.8 (2000): 1166-1180. Amazon Alexa/Echo(2014)

Pipelined Task-Oriented Dialogue System Hypothesis Any action movies recommended this weekend? Speech Signal • Natural Language Understanding (NLU) • User Intent Detection • Slot Filling Automatic Speech Recognition Text Input Any action movies recommended this weekend? Semantic Frame request_movie (genre=action, date=this weekend) Text-To-Speech • Dialogue Management (DM) • Dialogue State Tracking (DST) • Dialogue Policy Decision Natural Language Generation (NLG) Text response Which theater do you prefer? System Action/Policy request_location • Motivation: The pipelined system (NLU  DM) results in error propagation issues.

Proposed Approach • End-to-endmodel • MitigatetheeffectsofnoisyoutputfromNLU • RefineNLUbysupervisedsignalsfromDM • Multi-taskjointlylearning • NLU - Userintentclassification • NLU - Userslottagging • DM - Systemactionprediction • Contextualunderstanding • Accesstotheuser history • Monitoruserbehaviorstatesoverturns

Human-Human Dialogue Interaction Hi, how may I help you? Are there anycheap rate hotels to put my bags? Do you want to have a backpack type of hotel? Yes. Just gonnaleave our things there and stay out the whole day. So you don’t mind if it is not roomy, right? Yes. Guide Agent Tourist User Okay. These hotels are available for you: … Ok, thank you, bye! Thanks, goodbye. Idea: predictingthe next systemactiongiven the current userutterance together with the aggregated observations

Natural Language Understanding(NLU) Utterance:BOS % um how much is a taxi cab there ? EOS SlotTags:O O O B-det_PRICE I-det_PRICE O O B-trsp_TYPE I-trsp_TYPE B-area_CITY O O UserIntents:QST_HOW_MUCH; QST_INFO System Actions:RES_EXPLAIN; RES_INFO; FOL_EXPLAIN; FOL_HOW_MUCH; FOL_INFO Task 1: SlotTagging Task 2: Multi-LabelUserIntentPrediction Shared weights

NLU+DM 1: Pipelined BLSTMs Utterance:BOS % um how much is a taxi cab there ? EOS SlotTags:O O O B-det_PRICE I-det_PRICE O O B-trsp_TYPE I-trsp_TYPE B-area_CITY O O UserIntents:QST_HOW_MUCH; QST_INFO System Actions:RES_EXPLAIN; RES_INFO; FOL_EXPLAIN; FOL_HOW_MUCH; FOL_INFO Task 1+2:Natural Language Understanding Task 3:Multi-Label System Action Prediction single user turn current user turn w/ contextual history

NLU+DM 2: End-to-EndModel(JointModel) Utterance:BOS % um how much is a taxi cab there ? EOS SlotTags:O O O B-det_PRICE I-det_PRICE O O B-trsp_TYPE I-trsp_TYPE B-area_CITY O O UserIntents:QST_HOW_MUCH; QST_INFO System Actions:RES_EXPLAIN; RES_INFO; FOL_EXPLAIN; FOL_HOW_MUCH; FOL_INFO Task 1+2+3:End-to-End Joint NLU+DM DM supervised learning with three tasks DM output signal refines NLU for better robustness

Data • Dialogue State Tracking Challenge 4 • Human-human dialogues: 21-hour dialoguesessionsontouristicinformationcollectedvia Skypebetweentourguidesandtourists User Intent = Speech Act + Attributes System Action = Speech Act + Attributes

DM Result –System Action Prediction (SAP) • Metric: frame-level accuracy (FrmAcc) Human-human conversations are complicated, so predicting system actions for DM is difficult Pipeline-BLSTM and JointModel outperform the baseline JointModelimprovesPipeline-BLSTMabout 10%accuracy,indicating the importance of mitigatingdownsideofpipeline

DM Result –System Action Prediction (SAP) • Metric: frame-level accuracy (FrmAcc) Fully correct NLU output Oracle modelsshow the upper-bound of the SAP performance, since it transfers the errors from NLU to SAP Contextual user turns make significant contribution to DM performance JointModel achieves the best DM performance (FrmAcc) with richer latent representations

NLU Result – Slot Filling & Intent Prediction • Metrics: frame-level accuracy (FrmAcc) CRF+SVMs baseline maintains strong NLU performance with 33.1% Pipeline-BLSTM and JointModel outperformed the baseline Extra supervised DM signal helps refine the NLU by back-propagating the associated errors DM signal (system action prediction) helps more on user intent prediction than slot filling, and NLU is significantly improved

Conclusion • First propose an end-to-end deep hierarchical model for joint NLU and DM with limitedcontextualdialoguememory • Leverage multi-tasklearningusingthreesupervisedsignals • NLU: Userintentclassification • NLU: Slottagging • DM: Systemactionprediction • Outperform the state-of-the-art pipelined NLU and DM models • Better DM due to the contextual dialogue memory • Robust NLU fine-tuned by supervised signal from DM

Thanks for Your Attention!  Code Available at https://github.com/XuesongYang/end2end_dialog

Appendix - BaselineModel Predictingsystemactionsatthenextturnasresponsestothecurrent userbehaviorsbypipeliningNLUandSAPtogether NLU:CRFforslottagging,and One-Vs-AllSVMsforintentclassification SAP:One-Vs-AllSVMs [1]Raymond, Christian, and Giuseppe Riccardi. “Generative and discriminative algorithms for spoken language understanding." In INTERSPEECH, pp. 1605-1608. 2007.

Appendix: Configuration • Optimizer:a mini-batch stochastic gradient descent method Adam • Contextualhistory:fiveuserturns • Dimensionofwordembedding:512 • Dropoutratio:0.5 • Noearlystopping,butuse300trainingepochs • Best models for three tasks are selected individually under different metrics • Token-level micro-average F1 score is used for slot filling • frame-level accuracy (it counts only when the whole frame parse is correct) is used for both user intent prediction and system action prediction • the decision thresholds are tuned on dev set

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

End-to-End Joint Learning of Natural Language Understanding and Dialogue Manager

Presentation Transcript

Natural Language Understanding

End to End Protocols

Natural Language Understanding

Learning and Inference for Natural Language Understanding

End-to-End Issues

NATURAL LANGUAGE UNDERSTANDING

Dialogue with End-users

Understanding Natural Language

Natural Language Understanding

Understanding Natural Language

Understanding Natural Language

End-to-End and Innovation

Natural Language Understanding

Learning and Inference for Natural Language Understanding

Natural Language Understanding

End to End Quality of Service

Understanding Natural Language:

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

Understanding Natural Language

End-to-End and Innovation

Natural Language Understanding

End to End Protocols