450 likes | 759 Views
Jaedeug Choi Kee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011. Inverse Reinforcement Learning in Partially Observable Environments . Basics. Reinforcement Learning (RL) Markov Decision Process (MDP). Reinforcement Learning. Internal State. Actions.
E N D
JaedeugChoiKee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011 Inverse Reinforcement Learning in Partially Observable Environments
Basics • Reinforcement Learning (RL) • Markov Decision Process (MDP)
Reinforcement Learning Internal State Actions Observation Reward
Inverse Reinforcement Learning Internal State Actions Observation Reward
Why reward function ?? • Solves the more natural problems • Most transferable representation of agent’s behaviour!
Example 1 Reward
Agent • Name: Agent • Role: Decision making • Property: Principle of rationality
Environment Partially Observable Markov Decision Process (POMDP) Markov Decision Process (MDP)
MDP • Sequential decision making problem • States are directly perceived
POMDP • Sequential decision making problem • States are perceived through some noisy observation Concept of belief Seems like near a wall !!!
Policy Explicit policy Trajectory
IRL for MDP\R Apprenticeship learning
Using Policies • Any policy deviating from expert’s policy should not yield a higher value. Ng and Russel, 2000
Using Sample Trajectories • Linear approximation for reward function. R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a) = T where, [-1,1]d : SxA→ [0,1]d , basis functions.
Apprenticeship • Learn policy from expert’s demonstration. • Does not compute the exact reward function.
Using QCP • Approximated using Projection method !
IRL in POMDP • Ill-posed problem • Existence • Uniqueness • Stability • Computationally intractable R = 0 Exponential increase in size!
Comparing Q functions • Constraint: • Disadvantage: For each n N, there are |A||N||Z| ways to deviate one step from expert ! • For n nodes, there are |N||A||N||Z| ways to deviate – it grows exponentially !!!
DP Update Based Appraoch • Comes from Generalized Howard’s Policy Improvement Theorem. If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state. Hansen, 1998
MMFE Method Approximated using Projection (PRJ) Method !!!
Experimental Results • Tiger • 1d Maze • 5 x 5 Grid World • Heaven / Hell • Rock Sample
Inverse Reinforcement Learning Given • measurements of an agent’s behaviour over time, in a variety of circumstances, • Measurements of the sensory inputs to the agent, • a model of the physical environment (including the agent’s body). Determine • The reward function that the agent is optimizing. Russel (1998)
Partially Observable Environment • Mathematical framework for single-agent planning under uncertainty. • Agent cannot directly observe the underlying states. • Example: Study global warming from your grandfather’s diary !
Advantages of IRL • Natural way to examine animal and human behaviors. • Reward function – most transferable representation of agent’s behavior.
MDP • Modeling a sequentially decision making problem. • Five tuple system: <S, A, T, R, γ> • S – finite set of states • A – finite set of actions • T – state transition function T:SxA →∏(S) • R – Reward function R:SxA → Ɍ • γ– Discount factor [o,1) Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)
POMDP • Partially observable environment • Eight tuple system <S,A,Z,T,O;R,bo,γ> • Z – finite set of observation • O:SxA →∏(Z), observation function • bo– initial state distribution bo(s) • Belief (b) – b(s) is the probability that the state is s at the current time step. (To reduce the complexity, introduced by the history of action-observation sequence).
Finite State Controller(FSC) • Policy in POMDP is represented using FSC. • It’s a directed graph <N,E> • nN is associated with an action, aA • eE is an outgoing edge per observation zZ • ∏ = < , >. is the action strategy and is the observation strategy. Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏ (<n,s>,<a,os>).