1 / 44

A reinforcement learning algorithm for sampling design in Markov random fields

A reinforcement learning algorithm for sampling design in Markov random fields. Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail : { mbonneau,peyrard,sabbadin }@ toulouse.inra.fr MSTGA , Bordeaux, 7 2012. Plan. Problem statement. General Approach.

aren
Download Presentation

A reinforcement learning algorithm for sampling design in Markov random fields

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A reinforcementlearningalgorithm for sampling design in Markov randomfields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012

  2. Plan Problem statement. General Approach. Formulation using dynamic model. Reinforcement learning solution. Experiments. Conclusions.

  3. PROBLEM STATEMENT Adaptive sampling • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) X(1) X(2) X(3) X(4) X(5) X(6) • c(A) -> Cost of observing variables X(A) X(7) X(8) X(9) • B -> Initial budget • Observations are reliable

  4. PROBLEM STATEMENT Adaptive sampling • c(A,x(A)) -> Cost of observing variables X(A) in state x(A) • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) • B -> Initial budget • Observations are reliable Problem: Findstrategy / samplingpolicy to adaptively select variables to observe in order to : OptimizeQuality of the reconstruction of X / Respect Initial Budget

  5. PROBLEM STATEMENT Adaptive sampling • c(A,x(A)) -> Cost of observing variables X(A) in state x(A) • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) • B -> Initial budget • Observations are reliable Problem: Findstrategies/samplingpolicyto adaptively select variables to observed in order to : OptimizeQuality of the reconstructedvector X / Respect Initial Budget

  6. DEFINITION: Adaptive samplingpolicy For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

  7. DEFINITION: Adaptive samplingpolicy A1= 𝛿1={8} For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1. x(A1)=0 A2= 𝛿2 ((8,0))={6,7} X(1) X(2) X(3) -- Example -- X(4) X(5) X(6) X(7) X(8) X(9) x(A2)=(1,3)

  8. DEFINITIONS A1= 𝛿1={8} For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1. A2= 𝛿2 ((8,0))={6,7} • Vocabulary : • A history{(A1,x(A1)),…,(AH,x(AH)} is a trajectoryfollowedwhenapplying𝛿 • : set of all reachable histories of 𝛿 • c(𝛿) ≤ B cost of anyhistory respects the initial budget x(A1)=0 x(A2)=(1,3)

  9. GENERAL APPROACH

  10. GENERAL APPROACH Find a distribution ℙ thatwelldescribes the phenomenonunderstudy. Define the value of adaptive samplingpolicy: Defineapproximateresolutionmethod for findingnear optimal policy:

  11. STATE OF THE ART • Find a distribution ℙ thatwelldescribes the phenomenonunderstudy. • Define the value of adaptive samplingpolicy: • Defineapproximateresolutionmethod for findingnear optimal policy:

  12. OUR CONTRIBUTION • Find a distribution ℙ thatwelldesribed the phenomenonunderstudy. • Define the value of adaptive samplingpolicy: • Defineapproximateresolutionmethod for findingnear optimal policy:

  13. OUR CONTRIBUTION • Find a distribution ℙ thatwelldesribed the phenomenonunderstudy. • Define the value of adaptive samplingpolicy: • Defineapproximateresolutionmethod for findingnear optimal policy:

  14. Formulation usingdynamic model An adapted framework for reinforcement learning

  15. Summarize knowledge on X in a random vector S of length n • Observe variables  update our knowledge on X  Evolution of S • Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k s1 s2 s3 i A1 A2 A3 U(sH+1) sH+1

  16. Summarize knowledge on X in a random vector S of length n • Observe variables  update our knowledge on X  Evolution of S • Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k s1 s2 s3 i A1 A2 A3 U(sH+1) sH+1

  17. Reinforcementlearning solution

  18. Find optimal policy: The Q-function = « The expected value of the historywhenstarting in st, observing variables X(At)and thenfollowingpolicy𝛿*» • Compute Q*  Compute𝛿 *

  19. Find optimal policy: The Q-function • How to compute Q*: classical solution (Q-learning …) 1. Initialize Q U(s3) s1 s2 s3 2. Simulatehistory A1 A2 withproba 1-𝜀 withproba𝜀 random

  20. Find optimal policy: The Q-function • How to compute Q*: classical solution (Q-learning …) 1. Initialize Q Update Q(s1,A1) Update Q(s2,A2) U(s3) s1 s2 s3 2. Simulatehistory many times! A1 A2 C.V Q Q* withproba 1-𝜀 withproba𝜀 random

  21. Alternativeapproach • Linear approximation of Q-function: • Choice of functionΦi:

  22. LSDP Algorithm • Define weights for each decision step • Compute weights using “backward induction” • Linear approximation of Q-function: H

  23. LSDP Algorithm: application to sampling Computation of Φi(st,At): Computation of Computation of

  24. LSDP Algorithm: application to sampling Computation of Φi(st,At): Computation of Computation of • Wefix|At|=1 and use the approximation:

  25. Experiments

  26. Experiments • Regulargridwith first orderneighbourhood. • X(i) are binary variables. • ℙ is a Potts model withβ=0.5 • Simple cost: observation of eachvarialecost 1 X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9)

  27. Experiments • Comparison between: • Random policy • BP-max heuristic: at each time step observed variable • LSPI policy “ common reinforcement learning algorithm” • LSDP policy • using score:

  28. Experiment: 100 variables (n=100)

  29. LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals

  30. LSDP et BP-max : one observation Action’s value for LSDP policy Action’s value for LSDP policy Max marginals

  31. LSDP et BP-max : two observations Action’s value for LSDP policy Action’s value for LSDP policy Max marginals

  32. Experiment: 100 variables - constraint move • Allowed to visit second ordre neighbouroodonly !

  33. Experiment: 100 variables - constraint move • Allowed to visit second ordre neighbouroodonly !

  34. Experiment: 100 variables - constraint move

  35. Experiment: 200 variables - Different cost Initial Budget = 38

  36. Experiment: 200 variables - Different cost • CostRepartition: Random BP-max LSDP Initial Budget = 38

  37. Experiment: 100 variables - Different cost • Cost of 1 when observed variable is in state 0 • Cost of 3 when observed variable is in state 1 • Initial Budget = 30

  38. Conclusions • An adaptedframework for adaptive sampling in discreterandom variables • LSDP: a reinforcementlearningapproach for findingnear optimal policy • Adaptation of commonreinforcementlearningalgorithm for solving adaptive samplingproblem • Computation of near optimal policy « off-line » • Design of new policiesthatoutperform simple heuristics and usual RL method • Possible application? • Weedssampling in cropfield

  39. THANK YOU!

  40. Reconstruction of X(R) and trajectory value A1= 𝛿1={8} xA1=0 A2= 𝛿2 ((8,0))={6,7} xA2=(4,2) • Maximum Posterior Marginal for reconstruction: =Reconstruction of XR

  41. Reconstruction of X(R) and trajectory value A1= 𝛿1={8} xA1=0 A2= 𝛿2 ((8,0))={6,7} xA2=(4,2) • Maximum Posterior Marginal for reconstruction: • Quality of trajectory:

  42. LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: U(sH+1) s2 s2 sH-1 sH-1 sH sH sH+1 sH+1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 aH aH U(sH+1)

  43. LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: U(sH+1) s2 s2 sH-1 sH-1 sH sH+1 sH+1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 aH U(sH+1) sH aH LINEAR SYSTEM wH

  44. LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: s2 s2 sH-1 sH-1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 LINEAR SYSTEM wH-1

More Related