530 likes | 657 Views
Statisical Spoken Dialogue System Talk 2 – Belief tracking. CLARA Workshop Presented by Blaise Thomson Cambridge University Engineering Department brmt2@eng.cam.ac.uk http://mi.eng.cam.ac.uk/~brmt2. Human-machine spoken dialogue. inform(type=restaurant). I want a restaurant. Recognizer.
E N D
Statisical Spoken Dialogue SystemTalk 2 – Belief tracking CLARA Workshop Presented by Blaise Thomson Cambridge University Engineering Department brmt2@eng.cam.ac.ukhttp://mi.eng.cam.ac.uk/~brmt2
Human-machine spoken dialogue inform(type=restaurant) I want a restaurant Recognizer Semantic Decoder Dialog Manager User Dialog Acts Waveforms Words Synthesizer Message Generator What kind of food do you want.? request(food) Typical structure of a spoken dialogue system
Outline • Introduction • An example user model (spoken dialogue model) • The Partially Observable Markov Decision Process (POMDP) • POMDP models for dialogue systems • POMDP models for off-line experiments • POMDP models for simulating • Inference • Belief propagation (Fixed parameters) • Expectation Propagation (Learning parameters) • Optimisations • Results
Intro – An example user model? • Partially Observable Markov Decision Process (POMDP) • Probabilistic model of what the user will say • Variables: • Dialogue state, st. (e.g. User wants a restaurant) • System action, at. (e.g. “What type of food”) • Observation of what was said, ot. (e.g. N-best semantic list) • Assumes Input-Output Hidden Markov structure: s1 s2 sT ... a1 a2 aT ... o1 o2 oT ...
Intro – Simplifying the POMDP user model • Typically split dialogue state, st: a1 a2 aT ... s1 s2 sT ... o1 o2 oT ...
Intro – Simplifying the POMDP user model • Typically split dialogue state, st: • True user goal, gt • True user act, ut a1 a2 aT ... g1 g2 gT ... u1 u2 uT ... o1 o2 oT ...
Intro – Simplifying the POMDP user model • Further split the goal, gt, into sub-goals gt,c • e.g. User wants a Chinese restaurant food=Chinese, type=restaurant g gt,type gt,area gt gt,stars gt,food
Intro – Simplifying the POMDP user model G G’ gtype g’type gfood g’food a a’ U U’ utype ufood u’type u’food o o’
Intro – POMDP models for dialogue systems • How can I help you? • I’m looking for a beer [0.5] • I’m looking for a bar [0.4] • Sorry, what did you say? • bar [0.3] • bye [0.3] • When decisions are based on probabilistic user goals: Partially Observable Markov Decision Process (POMDPs) Beer Bar Bye Beer Bar Bye
Intro – belief model for dialogue systems Choose actions according to beliefs in the goal instead of most likely hypothesis More robust – some key reasons • Full hypothesis list • User model confirm(beer) Beer Bar Bye
Intro – POMDP models for off-line experiments • How can I help you? • I’m looking for a beer • I’m looking for a bar • Sorry, what did you say? • bar • bye [0.5] [0.4] [0.2] [0.7] Beer Bar Bye [0.3] [0.3] [0.5] [0.1] Beer Bar Bye
Intro – POMDP models for simulation • Often useful to be able to simulate how people behave: • For reinforcement learning • For testing a given system • In theory, simply generate from the POMDP user model G restaurant gtype gfood Chinese a U utype ufood silence() inform(type=restaurant)
An example – voicemail • We have a voicemail system with 2 possible user goals: g = SAVE: The user wants to save g = DEL: The user wants to delete • In each turn until we save/delete we observe one of two things o = USAVE: The user said save o = UDEL: The user said delete • We assume that the goal changes between each turn, and for the moment we only look at two turns • We start by being completely unsure what the user wants
An example – exercise • Observation probability: P(o | g) • If we observe the user saying they want to save and then what is the probability they want to save. P(g1 | o1 = OSAVE) • Use Bayes Theorem – P(A|B) = P(B|A) P(A) / P(B)
An example – exercise • Observation probability: P(o | g) • Transition probability: P(g’ | g) • If we observe the user saying they want to save and then saying they want to delete, what is the probability they want to save in the second turn. i.e. what is: P(g2 | o1 = OSAVE, o2 = ODEL)
An example – expanding further • In general we will want to compute probabilities conditional on the observations (we will call this the data D). • This always becomes a marginal on the joint distribution with the observation probabilities fixed. e.g. • These sums can be computed much more cleverly using dynamic programming
Belief Propagation Db • Interested in the marginals p(x|D) • Assume network is a tree with observations above and below D = {Da, Db} Da x
Belief Propagation Dc • When we split Db = {Dc, Dd} • These are called the messages into x. • We have one message for every probability factor connected Da x Dd
Belief Propagation - message passing Db Da a b
Belief Propagation - message passing Db b Dc Da c a
Belief Propagation • We can do the same thing repeatedly. • Start on one side, and keep getting p(x|Da) • Then start on the other ends and keep getting p(Db|x) • To get a marginal simply multiply these
Belief Propagation – our example • Write probabilities as vectors with SAVE on top g1 g2 o1 o2
Parameter Learning – The problem G G’ gtype g’type gfood g’food a a’ U U’ utype ufood u’type u’food o o’
Parameter Learning – The problem • For every (action, goal, goal) triple there is a parameter The parameters are a probability table of P(g|g,a) • The goals are all hidden and factorized and there are many of them at Need to tie parameters gt-1 gt Must allow for factorized hidden variables
Parameter Learning – Some options • Hand-craft • Roy et al, Zhang et al, Young et al, Thomson et al, Bui et al • Annotate user goal and use Maximum Likelihood • Williams et al, Kim et al, Henderson & Lemon • Isn’t always possible • Expectation Maximisation • Doshi & Roy (7 states), Syed et al (no goal changes) • Uses an unfactorisedstate • Intractable • Expectation Propagation (EP) • Allows parameter tying (details in paper) • Handles factorized hidden variables • Handles large state spaces • Doesn’t require any annotations (incl of user act) – though it does use the semantic decoder output
Belief Propagation as message passing Db Da a b Message from outside the factor – q\(a) input message from above a Message from this factor to b – q*(b) Message from outside the factor – q\(b) product of input messages below b Message from this factor to a – q*(a)
Belief Propagation as message passing Think in terms of approximations from each probability factor Db Da q\(a) q*(a) q*(b) q\(b) a b Message from outside network – q\(a) = p(a|Da) Message from this factor – q*(b) = p(b|Db) Message from outside network – q\(b) = p(Db|a) Message from this factor – q*(a) = p(Db|a)
Belief Propagation – Unknown parameters? • Imagine we have a discrete choice for the parameters • Integrate over our estimate from the rest of the network: • To estimate q, we want to sum over a and b:
Belief Propagation – Unknown parameters? • But we actually have continuous parameters • Integrate over our estimate from the rest of the network: • To estimate q, we want to sum over a and b:
Expectation Propagation • This doesn’t make sense – q is a probability! • Multiplying by q\(q) gives: • Choose q*(q) to minimize KL divergence with this • If we restrict ourselves to Dirichlet distributions, we need to find the Dirichlet that best matches a mixture of Dirichlets
Expectation Propagation – Example q g g’ gtype g’type a a’ u u’ o o’
Expectation Propagation – Example q g g’ gtype g’type a a’ u u’ o o’
Expectation Propagation – Example q g g’ gtype g’type a a’ u u’ o o’ p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2]
Expectation Propagation – Example q g g’ gtype g’type a a’ inform(type=bar) [0.5] inform(type=hotel) [0.2] u u’ o o’ p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2]
Expectation Propagation – Example q p(u=bar|g)0.4 * p(u=hotel|g)0.1 g g’ gtype g’type type=bar [0.45] type=hotel [0.18] a a’ inform(type=bar) [0.5] inform(type=hotel) [0.2] u u’ o o’ p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2]
Expectation Propagation – Example q type=bar [0.44] type=hotel [0.17] p(u=bar|g)0.4 * p(u=hotel|g)0.1 g g’ gtype g’type type=bar [0.45] type=hotel [0.18] a a’ inform(type=bar) [0.5] inform(type=hotel) [0.2] u u’ o o’ p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2]
Expectation Propagation – Example q type=bar [0.44] type=hotel [0.17] p(u=bar|g)0.4 * p(u=hotel|g)0.1 g g’ gtype g’type type=bar [0.45] type=hotel [0.18] a a’ inform(type=bar) [0.5] inform(type=hotel) [0.2] u u’ o o’ p(o|inform(type=bar)) [0.6] p(o|inform(type=rest)) [0.3] p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2]
Expectation Propagation – Example q type=bar [0.44] type=hotel [0.17] p(u=bar|g)0.4 * p(u=hotel|g)0.1 g g’ gtype g’type type=bar [0.45] type=hotel [0.18] a a’ inform(type=bar) [0.5] inform(type=hotel) [0.2] u u’ o o’ p(o|inform(type=bar)) [0.5] p(o|inform(type=hotel)) [0.2] p(o|inform(type=bar)) [0.6] p(o|inform(type=rest)) [0.3]
Expectation Propagation – Optimisation 1 • In dialogue systems, most of the values are equally likely • We can use this to reduce computations: • Compute the q distributions only once • Multiply instead of summing the same value repeatedly Twee stars please 1 2 3 4 5 Number of stars
Expectation Propagation – Optimisation 2 • For each value, assume transition to most other values is the same (mostly constant factor) • e.g. constant probability of change The reduced number of parameters means we can speed up learning too!
Results – Computation times No opt Grouping Const Change Both
Results – Simulated re-ranking • Train on 1000 simulated dialogues • Re-rank simulated semantics on 1000 dialogues • Oracle accuracy is 93.5% • TAcc – Semantic accuracy of the top hypothesis • NCE – Normalized Cross Entropy Score (Confidence scores) • ICE – Item Cross Entropy Score (Accuracy + Confidence)
Results – Data re-ranking • Train on Mar09 TownInfo trial data (720 dialogues) • Test on Feb08 TownInfo trial data (648 dialogues) • Oracle accuracy is 79.2%
Results – Simulated dialogue management • Use reinforcement learning (Natural Actor Critic algorithm) to train two systems: • One uses hand-crafted parameters • One uses parameters learned from 1000 simulated dialogues
Results – Live evaluations (control) • Tested in the Spoken Dialogue Challenge • Provide bus timetables in Pittsburgh • 800 road names (pairs represent a stop). Required to get place from, to and time • All parameters of the Cambridge system were hand-crafted
Results – Live evaluations (control) CAM Estimated success rate CAM Success CAM Failure BASELINE Success BASELINE Failure BASELINE WER
Summary • POMDP models are an effective model of dialogue: • For use in dialogue systems • For re-ranking semantic hypotheses off-line • Expectation Propagation allows parameter learning for complex models, without annotations of dialogue state • Experiments show: • EP gives improvements in re-ranked hypotheses • EP gives improvements in simulateddialogue management performance • Probabilistic belief gives improvements in live dialogue management performance