1 / 17

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning. Based on following two papers. L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017.

victordrew
Download Presentation

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

  2. Based on following two papers • L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: 1701.06547v4, 2017. • And HY Lee’s lecture notes.

  3. Maximizing Expected Reward Encoder Generator Human In place of discriminator Update θ We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) ---- sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?

  4. Policy gradient γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?

  5. Policy gradient • Gradient ascent: θnew  θold + η Δγθ^old Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η

  6. If R(hi,xi) is always positive: Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

  7. Solution: subtract a baseline If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) Not sampled Subtract a baseline Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

  8. Chatbot by SeqGAN • Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence

  9. http://www.nipic.com/show/3/83/3936650kd7476069.html Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues

  10. Discrimi nator scalar scalar update Encoder Can we do backpropogation? A B A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN De En Chatbot (ignoring sampling process) <BOS> A B

  11. Discrimi nator SeqGAN solution, using RL scalar update • Use the output of discriminator as reward • Update generator to increase discriminator = to get maximum reward • Different from typical RL • The discriminator would update Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) De En Discriminator Score Chatbot

  12. g-step New Objective: discriminator θt (1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θt+1 θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real

  13. Rewarding a sentence vs word • Consider example: • hi =“what is your name” • xi = “I do not know” • Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3i|hi,x1:2i) • But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know

  14. Rewarding at word level • Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) • Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) • How to estimate Q? Monte Carlo.

  15. Monte Carlo estimation of Q • How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) • Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8

  16. Experiments of Chatbot Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level

  17. Li et al 2016 Example Results (Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)

More Related