170 likes | 196 Views
Lecture 10 SeqGAN, Chatbot, Reinforcement Learning. Based on following two papers. L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017.
E N D
Based on following two papers • L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: 1701.06547v4, 2017. • And HY Lee’s lecture notes.
Maximizing Expected Reward Encoder Generator Human In place of discriminator Update θ We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) ---- sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?
Policy gradient γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?
Policy gradient • Gradient ascent: θnew θold + η Δγθ^old Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η
If R(hi,xi) is always positive: Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)
Solution: subtract a baseline If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) Not sampled Subtract a baseline Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)
Chatbot by SeqGAN • Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence
http://www.nipic.com/show/3/83/3936650kd7476069.html Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues
Discrimi nator scalar scalar update Encoder Can we do backpropogation? A B A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN De En Chatbot (ignoring sampling process) <BOS> A B
Discrimi nator SeqGAN solution, using RL scalar update • Use the output of discriminator as reward • Update generator to increase discriminator = to get maximum reward • Different from typical RL • The discriminator would update Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) De En Discriminator Score Chatbot
g-step New Objective: discriminator θt (1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θt+1 θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real
Rewarding a sentence vs word • Consider example: • hi =“what is your name” • xi = “I do not know” • Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3i|hi,x1:2i) • But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know
Rewarding at word level • Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) • Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) • How to estimate Q? Monte Carlo.
Monte Carlo estimation of Q • How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) • Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8
Experiments of Chatbot Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level
Li et al 2016 Example Results (Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)