1 / 41

Inverse Reinforcement Learning in Partially Observable Environments

Jaedeug Choi Kee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011. Inverse Reinforcement Learning in Partially Observable Environments . Basics. Reinforcement Learning (RL) Markov Decision Process (MDP). Reinforcement Learning. Internal State. Actions.

maj
Download Presentation

Inverse Reinforcement Learning in Partially Observable Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JaedeugChoiKee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011 Inverse Reinforcement Learning in Partially Observable Environments

  2. Basics • Reinforcement Learning (RL) • Markov Decision Process (MDP)

  3. Reinforcement Learning Internal State Actions Observation Reward

  4. Inverse Reinforcement Learning Internal State Actions Observation Reward

  5. Why reward function ?? • Solves the more natural problems • Most transferable representation of agent’s behaviour!

  6. Example 1 Reward

  7. Example 2

  8. Agent • Name: Agent • Role: Decision making • Property: Principle of rationality

  9. Environment Partially Observable Markov Decision Process (POMDP) Markov Decision Process (MDP)

  10. MDP • Sequential decision making problem • States are directly perceived

  11. POMDP • Sequential decision making problem • States are perceived through some noisy observation Concept of belief Seems like near a wall !!!

  12. Policy Explicit policy Trajectory

  13. IRL for MDP\R Apprenticeship learning

  14. Using Policies • Any policy deviating from expert’s policy should not yield a higher value. Ng and Russel, 2000

  15. Using Sample Trajectories • Linear approximation for reward function. R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a) = T where,   [-1,1]d  : SxA→ [0,1]d , basis functions.

  16. Using Linear Programming

  17. Apprenticeship • Learn policy from expert’s demonstration. • Does not compute the exact reward function.

  18. Using QCP • Approximated using Projection method !

  19. IRL in POMDP • Ill-posed problem • Existence • Uniqueness • Stability • Computationally intractable R = 0 Exponential increase in size!

  20. IRL for POMDP \R

  21. Comparing Q functions • Constraint: • Disadvantage: For each n N, there are |A||N||Z| ways to deviate one step from expert ! • For n nodes, there are |N||A||N||Z| ways to deviate – it grows exponentially !!!

  22. DP Update Based Appraoch • Comes from Generalized Howard’s Policy Improvement Theorem. If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state. Hansen, 1998

  23. Comparison

  24. IRL for POMDP \R

  25. MMV Method

  26. MMFE Method Approximated using Projection (PRJ) Method !!!

  27. Experimental Results • Tiger • 1d Maze • 5 x 5 Grid World • Heaven / Hell • Rock Sample

  28. Illustration

  29. Characteristics

  30. Results from Policy

  31. Results from Trajectories

  32. Questions ???

  33. Backup slides !

  34. Inverse Reinforcement Learning Given • measurements of an agent’s behaviour over time, in a variety of circumstances, • Measurements of the sensory inputs to the agent, • a model of the physical environment (including the agent’s body). Determine • The reward function that the agent is optimizing. Russel (1998)

  35. Partially Observable Environment • Mathematical framework for single-agent planning under uncertainty. • Agent cannot directly observe the underlying states. • Example: Study global warming from your grandfather’s diary !

  36. Advantages of IRL • Natural way to examine animal and human behaviors. • Reward function – most transferable representation of agent’s behavior.

  37. MDP • Modeling a sequentially decision making problem. • Five tuple system: <S, A, T, R, γ> • S – finite set of states • A – finite set of actions • T – state transition function T:SxA →∏(S) • R – Reward function R:SxA → Ɍ • γ– Discount factor [o,1) Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)

  38. POMDP • Partially observable environment • Eight tuple system <S,A,Z,T,O;R,bo,γ> • Z – finite set of observation • O:SxA →∏(Z), observation function • bo– initial state distribution bo(s) • Belief (b) – b(s) is the probability that the state is s at the current time step. (To reduce the complexity, introduced by the history of action-observation sequence).

  39. Finite State Controller(FSC) • Policy in POMDP is represented using FSC. • It’s a directed graph <N,E> • nN is associated with an action, aA • eE is an outgoing edge per observation zZ • ∏ = < , >.  is the action strategy and  is the observation strategy. Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏ (<n,s>,<a,os>).

  40. Using Projection Method

  41. PRJ Method

More Related