740 likes | 803 Views
Explore the evolution of spoken dialog systems from academic demos to industrial products. Learn about primitive prototypes, early approaches like ELIZA, and the distinction between open-domain and closed-domain systems. Compare generative and retrieval-based models and discover challenges in dialog control and context incorporation in dialog systems.
E N D
Università di Pisa Chatbots Human Language Technologies Giuseppe Attardi Dipartimento di Informatica Universitàdi Pisa
Evolution of Spoken Dialog Systems Primitive prototype -> Academic demo -> Industrial product Slide by CY Chen
Early Approaches • ELIZA (Weizenbaum, 1966) • Used clever hand-written templates to generate replies that resemble the user’s input utterances • Several programming frameworks available today for building dialog agents d (Marietto et al., 2013, Microsoft, 2017b), Google Assistant
Templates and Rules • hand-written rules to generate replies. • simple pattern matching or keyword retrieval techniques are employed to handle the user’s input utterances. • rules are used to transform a matching pattern or a keyword into a predefined reply. <category> <pattern>What is your name?</pattern> <template>My name is Alice</template> </category > <category> <pattern>I like *</pattern> <template>I too like <star/>.</template> </category >
Open vs Closed Domain Closed Domain (easier) Open Domain (harder) • The user can take the conversation anywhere • No necessarily a well-defined goal or intention • Conversations on social media sites like Twitter and Reddit are typically open domain • The space of possible inputs and outputs is somewhat limited • The system is trying to achieve a very specific goal • Technical Customer Support or Shopping Assistants are examples of closed domain problems
Two Paradigms Generative (harder) Retrieval based (easier) • Use a repository of predefined responses and some heuristic to pick an appropriate response based on the input and context • The heuristic could be as simple as a rule-based expression match, or as complex as an ensemble of ML classifiers. • These systems don’t generate any new text, they just pick a response from a fixed set. • Don’t rely on pre-defined responses • They generate new responses from scratch • Generative models are typically based on transducer techniques • They “transduce” an input into an output (response)
Comparison Generative Retrival-based • No grammatical mistakes. • Unable to handle unseen cases for which no appropriate predefined response exists • Can’t refer back to contextual entity information like names mentioned earlier in the conversation. • can refer back to entities in the input and give the impression of talking to a human • Hard to train • Likely to make grammatical mistakes (especially on longer sentences) • Typically require huge amounts of training data
Long vs Short Conversations The longer the conversation the more difficult to automate it. Long Conversation (harder) Short Conversation (easier) • The goal is to create a single response to a single input. • For example, answering a specific question from a user with an appropriate answer. • bot goes through multiple turns and must keep track of what has been said • Customer support conversations are typically long conversational threads with multiple questions. • Alexa Prize Challenge aims at building a Socialbot capable of engaging users for 20 minutes
Transducer Model • Train machine translation system to perform translation from utterance to response
Data-driven Dialog Response Generation • Lots of filtering, etc., to make sure that the extracted translation rules are reliable • Ritter et al. 2011 Slide by Graham Neubig
Neural Models for Dialog Response Generation • Like other translation tasks, dialog response generation can be done with encoder-decoders • Shang et al. (2015) present the simplest model, translating from previous utterance Slide by Graham Neubig
Templates • Many requests can be answered with templates • Select most relevant response from a collection of templates, determine the slots required to fill it and instantiate the template with those values • Slots filler can be extracted from user utterance or retried with a query
Retrieval-based Chat • Basic idea: given an utterance, find the most similar in the database and return it (Lee et al. 2009) • Similarity based on exact word match, plus extracted features regarding discourse Slide by Graham Neubig
Neural Response Retrieval • Idea: use neural models to soften the connection between input and output and do more flexible matching (Nio et al. 2014) • Model uses Socher et al. (2011) recursiveautoencoder+ dynamic pooling Slide by Graham Neubig
End-to-end Dialog Control • Train an LSTM that takes in text and entities and directly chooses an action to take (reply or API call) (Williams et al. 2017) • Trained using combination of supervised and reinforcement learning Slide by Graham Neubig
Incorporating Context • To produce sensible responses systems may need to incorporate both linguistic context and physical context. • In long dialogs people keep track of what has been said and what information has been exchanged. • Experiments in Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models and Attention with Intention for a Neural Network Conversation Model try to embed a conversation into vectors. Can you tell me why truth is not lie? You perceive truth to be not lie, but is that true? They are very different. Good to know! I will store that info in y database. Don’t you think so? I kind of do, but it depends on the right context
Coherence • Considering only a single previous utterance will lead to locally coherent but globally incoherent output • Need to consider more context! (Sordoni et al. 2015) • Contrast to MT, where context sometimes is helpful (Matsuzaki et al. 2015) and sometimes isn’t (Jean et al. 2017) Slide by Graham Neubig
Simple Solution: Concatenate Utterances • Sordoni et al. (2015) consider one additional previous context utterance concatenated together • Vinyals et al. (2015) just concatenate together all previous utterances and hope an RNN an learn Human: i am not sure i know that Machine: is the network connect prompting for any user certificate Machine: ? Human: yes Machine: are you at home or at work at the moment ? Machine: office ? Human: i am at home Machine: try this Machine: goto <URL> page and enter your name and password Human: done Machine: try logging in with <NAME> and check if its working fine now Human: yes , now it works ! Machine: great Machine: anything else that i can help Human: no , that was very useful Machine: you are other than <NAME> : ) Machine: have a nice night Machine: describe your problem Human: i am having problems accessing the vpn Machine: could you please let me know what are the operating systems you are using ? Human: i am using linux Machine: ok it was in the sslvpn on which you stated it was broken Machine: please check if you are logged in to corp network Machine: what is the error that you are running please Human: i am seeing an error related to vpn Machine: what is the error message that you are getting when connecting to vpn using network connect ? Human: connection refused or something like that Machine: may i know the version of network connect you connect ? Slide by Graham Neubig
Hierarchical Encoder-Decoder Model • Also have utterance-level RNN track overall dialog state (Serban et al. 2016) • 2RNN for words, RNN for utterance • No attention Slide from Graham Neubig
More Varied Responses • For translation, there is lexical variation but content remains Slide by Graham Neubig
Discourse-level VAE Model • Encode entire previous dialog context as latent variable in Variational Autoencoders (VAE) (Zhao et al. 2017) • Also meta-information such as dialog acts Slide by Graham Neubig
Diversity Objective • Basic idea: we want responses that are likely given the context, unlikely otherwise (Li et al. 2016) • Method: subtract weighted unconditioned log probability from conditioned probability (calculated only on first few words) Slide by Graham Neubig
Coherent Personality • If we train on all of our data, our agent will be a mish-mash of personalities (e.g. Li et al. 2016) • We would like our agents to be consistent! Slide by Graham Neubig
Personality Infused Dialog • Train a generation system with controllable “knobs” based on personality traits (Mairesse et al. 2007) • e.g. Extraversion: • Non-neural, but well done and perhaps applicable Slide by Graham Neubig
Speaker Embeddings Speaker Embeddings may learns general facts associated with the specific speaker (Li et al. 2017) For example to the question Where do you live? it might reply with different answers depending on the speaker embedding
Evaluation of Models • Translation uses BLEU score; while imperfect, not horrible • In dialog, BLEU shows very little correlation (Liu et al. 2016) Slide by Graham Neubig
DeltaBLEU • Retrieve good-looking responses, perform human evaluation, up-weight good ones, down-weight bad ones • Galley et al. 2015 Slide by Graham Neubig
Learning to Evaluate • Use context, true response, and actual response to learn a regressor that predicts goodness (Lowe et al. 2017) • Important: similar to model, but has access to reference! • Adversarial evaluation: try to determine whether response is true or fake (Li et al. 2017) • One caveat from MT: learnable metrics tend to overfit Slide by Graham Neubig
Integration with KB • Useful for task oriented dialogs. Not trivial to integrate.
Build Actions with Google Assistant • Experiment with Dialogflow: • https://codelabs.developers.google.com/codelabs/actions-1 • https://codelabs.developers.google.com/codelabs/actions-2 • Actions Console • https://console.actions.google.com/u/0/ • Once the action has been
Key Concepts • Action: An Action is an entry point into an interaction that you build for the Assistant. Users can request your Action by typing or speaking to the Assistant. • Intent: An underlying goal or task the user wants to do; for example, ordering coffee or finding a piece of music. In Actions on Google, this is represented as a unique identifier and the corresponding user utterances that can trigger the intent. • Fulfillment: A service, app, feed, conversation, or other logic that handles an intent and carries out the corresponding Action.
Tools Used • Google Actions • https://console.actions.google.com/ • Dialogflow • https://console.dialogflow.com/api-client/
Test the action on Google Home • Once the action has been setup, click on See how it works on Google Assistant • Now the action can be tested also on Google Home
The Ubuntu Dialog Corpus • Ubuntu Dialog Corpus (github). • One of the largest public dialog datasets available. • Based on chat logs from the Ubuntu channels on a public IRC network. • This paper goes into detail on how exactly the corpus was created. • The training data consists of 1,000,000 examples • 50% positive (label 1) and 50% negative (label 0) • Each example consists of a context, the conversation up to this point, and an utterance, a response to the context • A positive label means that an utterance was an actual response to a context, • a negative label means that the utterance wasn’t — it was picked randomly from somewhere in the corpus.
The dataset has been preprocessed— it has been tokenized, stemmed, and lemmatized using the NLTK tool. • Replaced entities like names, locations, organizations, URLs, and system paths with special tokens. • This preprocessing is likely to improve performance by a few percent. • The average context is 86 words long and the average utterance is 17 words long.
Dual Encoder LSTM • Dual Encoder has been reported to give decent performance on this data set. • Applying other models to this problem would be an interesting project.
Training • Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Stanford’s GloVe vectors and are fine-tuned during training. • Both the embedded context and response are fed into the same RNN word-by-word. The RNN generates a vector representation that captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions. • We multiply c with a matrix M to “predict” a response r’. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training. • The similarity of the predicted response r’ and the actual response r is measured by taking the dot product of these two vectors, aka cosine similarity. We then apply a sigmoid function to convert that score into a probability.