480 likes | 500 Views
Explore the advancements and potential of open-domain conversational AI in intelligent assistants. Learn about task-oriented dialogue systems, language understanding, dialogue management, and more.
E N D
Towards Open-Domain Conversational AI Yun-Nung (Vivian) Chen 陳縕儂 Http://vivianchen.idv.tW
Iron Man (2008) What can machines achieve now or in the future?
Language Empowering Intelligent Assistants Google Now (2012) Apple Siri (2011) Google Assistant (2016) Microsoft Cortana(2014) Apple HomePod(2017) Facebook M& Bot (2015) Google Home (2016) Amazon Alexa/Echo(2014)
Why and When We Need? Social Chit-Chat Turing Test (talk like a human) Information consumption Task completion Decision support Task-Oriented Dialogues • What is today’s agenda? • What does SLT stand for? • Book me the flight ticket from Taipei to Athens • Reserve a table at Din Tai Fung for 5 people, 7PM tonight • Is SLT conference good to attend? “I want to chat” “I have a question” “I need to get this done” “What should I do?”
Intelligent Assistants Task-Oriented
Task-Oriented Dialogue Systems Baymax – Personal Healthcare Companion JARVIS – Iron Man’s Personal Assistant
Task-Oriented Dialogue Systems(Young, 2000) Speech Signal Hypothesis are there any action movies to see this weekend • Language Understanding (LU) • Domain Identification • User Intent Detection • Slot Filling Speech Recognition Text Input Are there any action movies to see this weekend? Semantic Frame request_movie genre=action, date=this weekend • Dialogue Management (DM) • Dialogue State Tracking (DST) • Dialogue Policy Natural Language Generation (NLG) Text response Where are you located? System Action/Policy request_location Backend Database/ Knowledge Providers
Task-Oriented Dialogue Systems(Young, 2000) Speech Signal Hypothesis are there any action movies to see this weekend • Language Understanding (LU) • Domain Identification • User Intent Detection • Slot Filling Speech Recognition Text Input Are there any action movies to see this weekend? Semantic Frame request_movie genre=action, date=this weekend • Dialogue Management (DM) • Dialogue State Tracking (DST) • Dialogue Policy Natural Language Generation (NLG) Text response Where are you located? System Action/Policy request_location Backend Database/ Knowledge Providers
Language Understanding (LU) Pipelined
Joint Semantic Frame Parsing ht please EOS taiwanese food U U U U ht+1 hT+1 ht-1 W W W W V V V V O FIND_REST B-type O Intent Prediction Slot Filling
Slot-Gated Joint SLU (Goo et al., 2018) Slot Sequence Slot Gate Slot Attention ∑ Intent Attention : matrix for output layer : bias for output layer : slot context vector : intent context vector • will be larger if slot and intent are better related ∙ : trainable matrix : trainable vector : scalar gate value tanh Slot Gate Slot Prediction + ∙ BLSTM BLSTM Word Sequence Word Sequence
Contextual Language Understanding Single Turn just sent email to bob about fishing this weekend U O O O O O S B-contact_name I-subject B-subject I-subject send_email(contact_name=“bob”, subject=“fishing this weekend”) Multi-Turn U1 send email to bob S1 B-contact_name send_email(contact_name=“bob”) U2 are we going to fish this weekend B-message I-message I-message I-message S2 I-message I-message I-message send_email(message=“are we going to fish this weekend”)
E2E MemNN for Contextual LU (Chen et al., 2016) 0.69 0.13 0.16 U: “Let’s do 5:40” U: “i d like to purchase tickets to see deepwater horizon” S: “for which theatre” U: “angelika” S: “you want them for angelika theatre?” U: “yes angelika” S: “how many tickets would you like ?” U: “3 tickets for saturday” S: “What time would you like ?” U: “Any time on saturday is fine” S: “okay , there is 4:10 pm , 5:40 pm and 9:20 pm”
Role-Based & Time-Aware Attention(Su et al., 2018) Sentence-Level Time-Decay Attention Time-Decay Attention Function ( & ) u1 u2 convex concave linear u3 u4 Tourist Guide u5 u6 u7 Current u6 u1 u3 u5 u4 u2 … … wt+1 wt wT Spoken Language Understanding DenseLayer History Summary DenseLayer Role-Level Time-Decay Attention + Time-decay attention significantly improves the understanding results
Task-Oriented Dialogue Systems(Young, 2000) Speech Signal Hypothesis are there any action movies to see this weekend • Language Understanding (LU) • Domain Identification • User Intent Detection • Slot Filling Speech Recognition Text Input Are there any action movies to see this weekend? Semantic Frame request_movie genre=action, date=this weekend • Dialogue Management (DM) • Dialogue State Tracking (DST) • Dialogue Policy Natural Language Generation (NLG) Text response Where are you located? System Action/Policy request_location Backend Action / Knowledge Providers
User Dialogue Policy Optimization Environment Language Understanding Natural Language Generation Reward R Action A Observation O Agent Dialogue Manager Select the best action that maximizes the future reward Dialogue management in a RL framework
E2E Task-Completion Bot (TC-Bot)(Li et al., 2017) w0 LU wi+1 wi+1 wi+1 wi wi wi EOS EOS EOS <intent> <intent> <intent> <slot> <slot> <slot> O O O Natural Language Are there any action movies to see this weekend? Knowledge Database DST NLG User Goal …… w2 EOS w1 Dialogue Policy Learning Dialogue Policy User Model Neural Dialogue System request_location User Simulation … Idea: SL for each component and RL for end-to-end training the neural dialogue system
Reinforcement Learning Agent (Li et al., 2017) User goal – Two tickets for “Deadpool” tomorrow 9PM at AMC Pacific Place 11 theater, Seattle. RULE BASED SYSTEM User: Do you have any ticket for movie ‘Deadpool’ at 9:00 pm? Agent: Which theater do you want? User: AMC pacific place 11 theater, please! Agent: How many tickets do you want? User: Two! Agent: What date? User: Tomorrow! Agent: Which city do you want? User: Seattle! Agent: Do you have any specification for zip code? User: No, I do not care. Agent: Do you have any distance constraint? User: No, I do not care. Agent: How about video format? 2d? Imax 3d? User: Any kind is okay. Agent: Do you have any constraint to price? User: No! Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie ‘Deadpool’. User: Okay, thanks. Agent: Thanks! REINFORCEMENT LEARNING SYSTEM User: I want two tickets for movie ‘Deadpool’. Agent: What is the starttime you prefer? User: 9:00 PM Agent: Which day? User: Tomorrow Agent: Which city? User: Seattle Agent: Which theater do you prefer?User: AMC Pacific Place 11 theater. Agent: Okay, I will issue 2 tickets for you, tomorrow 9:00 pm at AMC pacific place 11 theater, Seattle, movie ‘Deadpool’. User: Okay, thanks. Agent: Thanks! no notion about what requests can be skipped Skip the requests the user may not care about to improve efficiency
RLinDialogueSystems Sampleinefficient,hardtodesignrewardfunction,localoptima… Realusersareexpensive Discrepancybetweenrealusersandsimulators
D3Q: Discriminative Deep Dyna-Q (Su et al., 2018) Semantic Frame Human Conversational Data Real Experience Imitation Learning Supervised Learning DST NLU Controlled Planning Simulated Experience Policy Model Discriminator Discriminator Acting State Representation Discriminative Training DirectReinforcement Learning World Model User User World Model Policy Learning NLG Real Experience World Model Learning System Action (Policy) • Idea • learning with real users with planning • add a discriminator to filter out the bad experiences
D3Q: Discriminative Deep Dyna-Q (Su et al., 2018) The policy learning is more robust and shows the improvement in human evaluation S.-Y. Su, X. Li, J. Gao, J. Liu, and Y.-N. Chen, “Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning," (to appear) in Proc. of EMNLP, 2018.
Task-Oriented Dialogue Systems(Young, 2000) Speech Signal Hypothesis are there any action movies to see this weekend • Language Understanding (LU) • Domain Identification • User Intent Detection • Slot Filling Speech Recognition Text Input Are there any action movies to see this weekend? Semantic Frame request_movie genre=action, date=this weekend • Dialogue Management (DM) • Dialogue State Tracking (DST) • Dialogue Policy Natural Language Generation (NLG) Text response Where are you located? System Action/Policy request_location Backend Action / Knowledge Providers
Natural Language Generation (NLG) inform(name=Seven_Days, foodtype=Chinese) Seven Days is a nice Chinese restaurant • Mapping dialogue acts into natural language
Issues in Neural NLG • Issue • NLG tends to generate shorter sentences • NLG may generate grammatically-incorrect sentences • Solution • Generate word patterns in a order • Consider linguistic patterns
Hierarchical NLG w/ Linguistic Patterns(Su et al., 2018) GRUDecoder 1.Repeat-input2.Inner-LayerTeacherForcing 3. Inter-LayerTeacherForcing 4.CurriculumLearning NearAll Bar OneisamoderatelypricedItalianplaceitiscalledMidsummer House … … moderately a is lastoutput DECODINGLAYER4 4. Others outputfromlastlayer BidirectionalGRUEncoder All Bar OneismoderatelypricedItalianplaceitiscalledMidsummer House … … AllBarOne a is … … priceRange name Italian DECODING LAYER3 … … 3. ADJ + ADV is moderately AllBarOne Semantic 1-hot Representation […1,0,0,1,0,…] All Bar OneispricedplaceitiscalledMidsummer House Input Semantics ENCODER name[Midsummer House], food[Italian], priceRange[moderate], near[All Bar One] 26 DECODING LAYER2 2.VERB All Bar OneplaceitMidsummer House 1. NOUN + PROPN + PRON DECODING LAYER1 HierarchicalDecoder
Evolution Roadmap I feel sad… I’ve got a cold what do I do? Dialogue depth (complexity) Tell me a joke. Single domain systems Extended systems Multi-domain systems Open domain systems What is influenza? Dialogue breadth (coverage)
Dialogue Systems Understanding(NLU) State tracker input x Task-Oriented Dialogue Database Memory External knowledge DB Generation(NLG) Dialog policy output y Statistical model(e.g., neural) DB Understanding(NLU) State tracker input x Fully Data-Driven 28 Generation(NLG) Dialog policy output y
Neural Response Generation(Sordoni et al., 2015; Vinyals & Le, 2015) Source: conversation history Yeah your … because I’m game? EOS of on my encoder decoder Learns to generate dialogues from offline data (no state, action, intent, slot, etc.) way on Yeah I’m my Target: response
Issue 1: Blandness Problem ‘tis a fine brew on a day like this! Strong though, how many is sensible? I'm not sure yet, I'll let you know ! Milan apparently selling Zlatan to balance the books... Where next, Madrid? I don’t know. 32% responses are general and meaningless“I don’t know”“I don’t know what you are talking about” “I don’t think that is a good idea”“Oh my god” Wow sour starbursts really do make your mouth water... mmdrool. Can I have one? Of course! Well he was on in Bromley a while ago... still touring. I don't even know what he's talking about.
MMI for Response Diversity (Li et al., 2016) ‘tis a fine brew on a day like this! Strong though, how many is sensible? Depends on how much you drink! Milan apparently selling Zlatan to balance the books... Where next, Madrid? I think he'd be a good signing. Wow sour starbursts really do make your mouth water... mm drool. Can I have one? Of course you can! They’re delicious! Well he was on in Bromley a while ago... still touring. I’ve never seen him live.
Personalized Response Generation(Li et al., 2016) where do you live EOS in england . D_Gomes25 Jinnmeow3 u.s. london skinnyoflynny2 Rob Rob Rob Rob england TheCharlieZ great Rob_712 Dreamswalls good Tomcoatez Speaker embeddings (70k) Bob_Kelly2 Word embeddings (50k) in england . EOS Kush_322 okay monday live kierongillen5 This_Is_Artful tuesday stay DigitalDan285 The_Football_Bar
Persona Model for Speaker Consistency(Li et al., 2016) Baseline model inconsistency Persona model using speaker embedding consistency
Issue 3: Dialogue-Level Optimization via RL Collect rewards Language understanding Dialogue Manager User input (o) Language (response) generation Optimize Response
Deep RL for Response Generation(Li et al., 2016) Supervised Learning Agent Reinforcement Learning Agent Input message RL agent generates more interactive responses RL agent tends to end a sentence with a questionand hand the conversation over to the user
Issue 4: No Grounding (Sordoni et al., 2015; Li et al., 2016) • No grounding into a real calendar, but the “shape” of the conversation is fluent and plausible. • Neural model learns the general shape of conversations, and the system output is situationally appropriate and coherent. H: would thursday afternoon work sometime? M: yeah , sure . just let me know when you‘re free. H: after lunch is probably the best time M: okay, sounds good . just let me know when you‘re free. H: would 2 pm work for you? M: works for me. H: well let‘s say 2 pm then i ‘ll see you there M: sounds good. H: hi M: how are you? H: not bad M: what's wrong? H: nothing really M: wanna talk about it? i 'm here if you wanna talk H: well, i'm just tired M: me too, i'm here if you wanna talk
Chit-Chat v.s. Task-Oriented The weather is so depressing these days. I know, I dislike rain too. What about a day trip to eastern Washington? Any recommendation? Try Dry Falls, it’s spectacular! 39 Social Chat Engaging, Human-Like Interaction(Ungrounded) Task-Oriented Task Completion, Decision Support(Grounded)
Knowledge-Grounded Responses(Ghazvininejad et al., 2017) Dialogue Encoder Σ Try omakase, the best in town Going to Kusakabe tonight Decoder Response Conversation History Fact Encoder Consistently the best omakase . . Amazing sushi tasting […] . A They were out of kaisui[…] . . . World “Facts” Contextually-Relevant “Facts”
Conversational Agents Chit-Chat Task-Oriented
Evolution Roadmap I feel sad… Empathetic systems I’ve got a cold what do I do? Common sense system Dialogue depth (complexity) Tell me a joke. What is influenza? Knowledge based system Dialogue breadth (coverage)
High-Level Intention Learning(Sun et al., 2016; Sun et al., 2016) Schedule a lunch with Vivian. play music check location find restaurant contact What kind of restaurants do you prefer? The distance is … Should I send the restaurant information to Vivian? Users interact via high-level descriptions and the system learns how to plan the dialogues High-level intention may span several domains
Empathy in Dialogue System(Fung et al., 2016) text speech vision Emotion Recognizer • Embed an empathy module • Recognize emotion using multimodality • Generate emotion-aware responses
Her (2013) What can machines achieve now or in the future?
Thanks for Your Attention! Q & A Yun-Nung (Vivian) Chen Assistant Professor National Taiwan University y.v.chen@ieee.org / http://vivianchen.idv.tw