160 likes | 277 Views
Reinforcement Learning in POMDPs Without Resets. E.Even-Dar, S.M.Kakade & Y.Mansour. IJCAI , August, 2005. Presented by Lihan He Machine Learning Reading Group Duke University 07/29/2005.
E N D
Reinforcement Learning in POMDPs Without Resets E.Even-Dar, S.M.Kakade & Y.Mansour IJCAI, August, 2005 Presented by Lihan He Machine Learning Reading Group Duke University 07/29/2005
During reinforcement learning in POMDP, we usually reset the agent to a same situation in the beginning of each attempt. This guarantees the agent starts from the same point so that the comparison of rewards is fair. This paper gives an approach of approximate reset in the situation where the agent is not allowed to be exactly reset. Author proved that by this approximate reset, or homing strategy, the agent moves towards a reset within a given tolerance in the sense of expectation.
Outline • POMDP, policy, and horizontal length; • Reinforcement learning • Homing strategies • Two algorithms of reinforce learning with homing (1) model-free (2) model based • Conclusion
POMDP POMDP = HMM + controllable actions. A POMDP model is defined by the tuple < S, A, T, R, , O>. An example: Hallway2 – navigation problem. 89 states: 4 orientations in 22 room, plus a goal. 17 observations: all combination of walls, plus ‘star’. 5 actions: stay in place, move forward, turn right, turn left, turn around. The state is hidden since the agent doesn’t know its current state based on current observation (wall / no wall in 4 orientations).
T 1 t POMDP policy A policy πis a mapping from belief state b to action a, which tells agent which action to be taken under an estimated belief state. T-horizon optimal policy: The algorithm looks only T step ahead to maximize expected reward value V. immediate reward also a function of a T=1: consider only immediate reward T=infinite : consider all the discounted future reward Horizontal length v.s. reward value for optimal policy
Reinforcement Learning How to get an optimal policy if the agent does not know the model parameters (state transition probability T(s,a,s’) and observation function O(a,s’,o)), or even does not know the structure of model? Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. • Model-based: first estimate model parameters by exploring environment, and then get policy based on this model. During the exploration, the agent improves the model and also policy continuously. • Model-free: totally discard model, find policy directly by exploring environment. Usually, algorithm searches in the space of behaviors to find the best performance.
Same starting here Reinforcement Learning : reset To compare performance during trial-and-error process, the agent usually reset itself to the same initial situation (belief state) before each try. In this way, the comparison is fair. This is usually done by “offline” simulation.
Reinforcement Learning : without reset Assume a realistic situation in which an agent starts in an unknown environment and must follow one continuous and uninterrupted chain of experience with no access to ‘resets’ or ‘offline’ simulation. • The paper presents an algorithm with an approximate reset strategy, or a homing strategy. • The homing strategy always exists in every POMDP. • By performing homing strategy, the agent approximately resets its current belief state ε-close to the (unknown) initial belief state. • The algorithm balances exploration and exploitation. • A series of actions to achieve the approximate reset. • When homing, the agent is neither exploring nor exploiting.
Homing Strategies Definition 1: H is an (ε, k)-approximate reset (or homing) strategy if for every 2 belief states b1 and b2, we have ||HE(b1)-HE(b2)||1≤ ε, where HE(b) is the expected belief state started from b after k homing actions of H and H(b) is a random variable such that HE(b)=EB[H(b)]. This definition states that H will approximately reset b, but this approximation quality could be poor. Next lemma will amplify the accuracy.
Lemma 1 (accuracy amplification): Suppose that H is an (ε, k)-approximate reset, then is an -approximate reset, where consecutively implements H for times. H H H H H H H H Homing Strategies Lemma 2 (existence of homing strategy): For all POMDPs, the random walk strategy (using ‘stay’ actions) constitutes an (ε, k)-approximate reset for some k≥1 and 0< ε<1/2. • Assumption: POMDP is connected, i.e., for any states s,s’, there exists a strategy which can reach s’ with positive probability starting from s. • Must contain ‘stay’ action in random walk – avoiding trapped into loop. According to these two lemmas, for any POMDP, we have at least “random walk” to achieve approximate reset with any accuracy.
Input : H for t = 1 to do foreach Policy π in πtdo for i = 1 to k1t do Run πfor t steps; Repeatedly run H for log(1/εt) times; end Let vπ be the average return of πfrom these k1t trials; end Let fori = 1 to k2t do Run for t steps; Repeatedly run H for log(1/εt)times; end end a (1/2,KH)-approximate reset strategy, e.g. random walk Exploration in Phase t Homing Exploitation in Phase t Homing Reinforcement Learning with Homing Algorithm 1: model-free algorithm a set of all possible policies horizontal length number of exploration time optimal policy number of exploitation time
Approximate reset: Reinforcement Learning with Homing Algorithm 1: model-free algorithm • What is a policy in this model-free POMDP? Definition: A history h is a sequence of actions, rewards and observations of some finite length, i.e., h={(a1,r1,o1), …, (at,rt,ot)}. A policy is defined as a mapping from histories to actions. • No relationship between the tth and the (t+1)th iteration. • Very inefficient, since it is testing all possible policies. • Impossible to implement. Reason of choosing k1t (number of exploration time) and k2t(number of exploitation time): Run enough times to guarantee convergence of estimated average reward.
a (1/2,KH)-approximate reset strategy, e.g. random walk Input : H Let L=|A|.|O|; for t = 1 to do for k1t times do Run RANDOM WALKfor t+1 steps; Repeatedly run H for log(Lt/εt) times; end for do if then ; end end Compute using for k2t times do Run for t steps; Repeatedly run H for log(Lt/εt)times; end end Exploration in Phase t Homing Model update in Phase t Exploitation in Phase t Homing Reinforcement Learning with Homing Algorithm 2: model-based algorithm a set of all possible histories
Again, no relationship between the tth and the (t+1)th iteration. Instead of trying all the policy in algorithm 1, here the algorithm 2 uses sparse model parameters to compute policy. is a history with h followed by (a,o). POMDP is equivalent to an MDP where the history are states. So we can compute policy from Reinforcement Learning with Homing Algorithm 2: model-based algorithm
Conclusion Author gives an approach of approximate reset in the situation where the agent is not allowed to be reset in the lifelong learning. A model-free algorithm and a model-based algorithm are suggested. The model-free algorithm is inefficient.
Reference Eyal Even-Dar, Sham M.Kakade, Yishay Mansour, “Reinforcement Learning in POMDPs without Resets”, 19th IJCAI, Jul.31, 2005. Mance E.Harmon, Stephanie S. Harmon. “Reinforcement Learning: A Tutorial”. Website about reinforcement learning: http://www-anw.cs.umass.edu/rlr/