A reinforcement learning algorithm for sampling design in Markov random fields

A reinforcementlearningalgorithm for sampling design in Markov randomfields Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN INRA-MIA Toulouse E-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr MSTGA, Bordeaux, 7 2012

Plan Problem statement. General Approach. Formulation using dynamic model. Reinforcement learning solution. Experiments. Conclusions.

PROBLEM STATEMENT Adaptive sampling • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) X(1) X(2) X(3) X(4) X(5) X(6) • c(A) -> Cost of observing variables X(A) X(7) X(8) X(9) • B -> Initial budget • Observations are reliable

PROBLEM STATEMENT Adaptive sampling • c(A,x(A)) -> Cost of observing variables X(A) in state x(A) • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) • B -> Initial budget • Observations are reliable Problem: Findstrategy / samplingpolicy to adaptively select variables to observe in order to : OptimizeQuality of the reconstruction of X / Respect Initial Budget

PROBLEM STATEMENT Adaptive sampling • c(A,x(A)) -> Cost of observing variables X(A) in state x(A) • Adaptive selection of variables to observe for reconstruction of the randomvectorX=(X(1),….,X(n)) • B -> Initial budget • Observations are reliable Problem: Findstrategies/samplingpolicyto adaptively select variables to observed in order to : OptimizeQuality of the reconstructedvector X / Respect Initial Budget

DEFINITION: Adaptive samplingpolicy For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

DEFINITION: Adaptive samplingpolicy A1= 𝛿1={8} For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1. x(A1)=0 A2= 𝛿2 ((8,0))={6,7} X(1) X(2) X(3) -- Example -- X(4) X(5) X(6) X(7) X(8) X(9) x(A2)=(1,3)

DEFINITIONS A1= 𝛿1={8} For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe: 𝛿((A1,x(A1)),…,(At,x(At)))=At+1. A2= 𝛿2 ((8,0))={6,7} • Vocabulary : • A history{(A1,x(A1)),…,(AH,x(AH)} is a trajectoryfollowedwhenapplying𝛿 • : set of all reachable histories of 𝛿 • c(𝛿) ≤ B cost of anyhistory respects the initial budget x(A1)=0 x(A2)=(1,3)

GENERAL APPROACH

GENERAL APPROACH Find a distribution ℙ thatwelldescribes the phenomenonunderstudy. Define the value of adaptive samplingpolicy: Defineapproximateresolutionmethod for findingnear optimal policy:

STATE OF THE ART • Find a distribution ℙ thatwelldescribes the phenomenonunderstudy. • Define the value of adaptive samplingpolicy: • Defineapproximateresolutionmethod for findingnear optimal policy:

OUR CONTRIBUTION • Find a distribution ℙ thatwelldesribed the phenomenonunderstudy. • Define the value of adaptive samplingpolicy: • Defineapproximateresolutionmethod for findingnear optimal policy:

Formulation usingdynamic model An adapted framework for reinforcement learning

Summarize knowledge on X in a random vector S of length n • Observe variables  update our knowledge on X  Evolution of S • Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k s1 s2 s3 i A1 A2 A3 U(sH+1) sH+1

Reinforcementlearning solution

Find optimal policy: The Q-function = « The expected value of the historywhenstarting in st, observing variables X(At)and thenfollowingpolicy𝛿*» • Compute Q*  Compute𝛿 *

Find optimal policy: The Q-function • How to compute Q*: classical solution (Q-learning …) 1. Initialize Q U(s3) s1 s2 s3 2. Simulatehistory A1 A2 withproba 1-𝜀 withproba𝜀 random

Find optimal policy: The Q-function • How to compute Q*: classical solution (Q-learning …) 1. Initialize Q Update Q(s1,A1) Update Q(s2,A2) U(s3) s1 s2 s3 2. Simulatehistory many times! A1 A2 C.V Q Q* withproba 1-𝜀 withproba𝜀 random

Alternativeapproach • Linear approximation of Q-function: • Choice of functionΦi:

LSDP Algorithm • Define weights for each decision step • Compute weights using “backward induction” • Linear approximation of Q-function: H

LSDP Algorithm: application to sampling Computation of Φi(st,At): Computation of Computation of

LSDP Algorithm: application to sampling Computation of Φi(st,At): Computation of Computation of • Wefix|At|=1 and use the approximation:

Experiments

Experiments • Regulargridwith first orderneighbourhood. • X(i) are binary variables. • ℙ is a Potts model withβ=0.5 • Simple cost: observation of eachvarialecost 1 X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9)

Experiments • Comparison between: • Random policy • BP-max heuristic: at each time step observed variable • LSPI policy “ common reinforcement learning algorithm” • LSDP policy • using score:

Experiment: 100 variables (n=100)

LSDP et BP-max : No observation Action’s value for LSDP policy Max marginals

LSDP et BP-max : one observation Action’s value for LSDP policy Action’s value for LSDP policy Max marginals

LSDP et BP-max : two observations Action’s value for LSDP policy Action’s value for LSDP policy Max marginals

Experiment: 100 variables - constraint move • Allowed to visit second ordre neighbouroodonly !

Experiment: 100 variables - constraint move

Experiment: 200 variables - Different cost Initial Budget = 38

Experiment: 200 variables - Different cost • CostRepartition: Random BP-max LSDP Initial Budget = 38

Experiment: 100 variables - Different cost • Cost of 1 when observed variable is in state 0 • Cost of 3 when observed variable is in state 1 • Initial Budget = 30

Conclusions • An adaptedframework for adaptive sampling in discreterandom variables • LSDP: a reinforcementlearningapproach for findingnear optimal policy • Adaptation of commonreinforcementlearningalgorithm for solving adaptive samplingproblem • Computation of near optimal policy « off-line » • Design of new policiesthatoutperform simple heuristics and usual RL method • Possible application? • Weedssampling in cropfield

THANK YOU!

Reconstruction of X(R) and trajectory value A1= 𝛿1={8} xA1=0 A2= 𝛿2 ((8,0))={6,7} xA2=(4,2) • Maximum Posterior Marginal for reconstruction: =Reconstruction of XR

Reconstruction of X(R) and trajectory value A1= 𝛿1={8} xA1=0 A2= 𝛿2 ((8,0))={6,7} xA2=(4,2) • Maximum Posterior Marginal for reconstruction: • Quality of trajectory:

LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: U(sH+1) s2 s2 sH-1 sH-1 sH sH sH+1 sH+1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 aH aH U(sH+1)

LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: U(sH+1) s2 s2 sH-1 sH-1 sH sH+1 sH+1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 aH U(sH+1) sH aH LINEAR SYSTEM wH

LSDP Algorithm • Linear approximation of Q-function: • How to comput w1,…,wH: s2 s2 sH-1 sH-1 s1 s1 a1 a1 a2 a2 aH-1 aH-1 LINEAR SYSTEM wH-1

A reinforcement learning algorithm for sampling design in Markov random fields