Inverse Reinforcement Learning in Partially Observable Environments

JaedeugChoiKee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011 Inverse Reinforcement Learning in Partially Observable Environments

Basics • Reinforcement Learning (RL) • Markov Decision Process (MDP)

Reinforcement Learning Internal State Actions Observation Reward

Inverse Reinforcement Learning Internal State Actions Observation Reward

Why reward function ?? • Solves the more natural problems • Most transferable representation of agent’s behaviour!

Example 1 Reward

Example 2

Agent • Name: Agent • Role: Decision making • Property: Principle of rationality

Environment Partially Observable Markov Decision Process (POMDP) Markov Decision Process (MDP)

MDP • Sequential decision making problem • States are directly perceived

POMDP • Sequential decision making problem • States are perceived through some noisy observation Concept of belief Seems like near a wall !!!

Policy Explicit policy Trajectory

IRL for MDP\R Apprenticeship learning

Using Policies • Any policy deviating from expert’s policy should not yield a higher value. Ng and Russel, 2000

Using Sample Trajectories • Linear approximation for reward function. R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a) = T where,   [-1,1]d  : SxA→ [0,1]d , basis functions.

Using Linear Programming

Apprenticeship • Learn policy from expert’s demonstration. • Does not compute the exact reward function.

Using QCP • Approximated using Projection method !

IRL in POMDP • Ill-posed problem • Existence • Uniqueness • Stability • Computationally intractable R = 0 Exponential increase in size!

IRL for POMDP \R

Comparing Q functions • Constraint: • Disadvantage: For each n N, there are |A||N||Z| ways to deviate one step from expert ! • For n nodes, there are |N||A||N||Z| ways to deviate – it grows exponentially !!!

DP Update Based Appraoch • Comes from Generalized Howard’s Policy Improvement Theorem. If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state. Hansen, 1998

Comparison

IRL for POMDP \R

MMV Method

MMFE Method Approximated using Projection (PRJ) Method !!!

Experimental Results • Tiger • 1d Maze • 5 x 5 Grid World • Heaven / Hell • Rock Sample

Illustration

Characteristics

Results from Policy

Results from Trajectories

Questions ???

Backup slides !

Inverse Reinforcement Learning Given • measurements of an agent’s behaviour over time, in a variety of circumstances, • Measurements of the sensory inputs to the agent, • a model of the physical environment (including the agent’s body). Determine • The reward function that the agent is optimizing. Russel (1998)

Partially Observable Environment • Mathematical framework for single-agent planning under uncertainty. • Agent cannot directly observe the underlying states. • Example: Study global warming from your grandfather’s diary !

Advantages of IRL • Natural way to examine animal and human behaviors. • Reward function – most transferable representation of agent’s behavior.

MDP • Modeling a sequentially decision making problem. • Five tuple system: <S, A, T, R, γ> • S – finite set of states • A – finite set of actions • T – state transition function T:SxA →∏(S) • R – Reward function R:SxA → Ɍ • γ– Discount factor [o,1) Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)

POMDP • Partially observable environment • Eight tuple system <S,A,Z,T,O;R,bo,γ> • Z – finite set of observation • O:SxA →∏(Z), observation function • bo– initial state distribution bo(s) • Belief (b) – b(s) is the probability that the state is s at the current time step. (To reduce the complexity, introduced by the history of action-observation sequence).

Finite State Controller(FSC) • Policy in POMDP is represented using FSC. • It’s a directed graph <N,E> • nN is associated with an action, aA • eE is an outgoing edge per observation zZ • ∏ = < , >.  is the action strategy and  is the observation strategy. Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏ (<n,s>,<a,os>).

Using Projection Method

PRJ Method

Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments

Presentation Transcript

Learning Pastoralists Preferences via Inverse Reinforcement Learning (IRL)

Learning Pastoralists Preferences via Inverse Reinforcement Learning (IRL)

Partially Observable Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Partially-Observable Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Model-based Bayesian Reinforcement Learning in Partially Observable Domains

What Are Partially Observable Markov Decision Processes

REINFORCEMENT LEARNING

Dynamic Programming for Partially Observable Stochastic Games

Partially Observable Markov Decision Process (POMDP)

Partially Observable MDP

What Are Partially Observable Markov Decision Processes

Model-based Bayesian Reinforcement Learning in Partially Observable Domains

Algorithms For Inverse Reinforcement Learning

Apprenticeship Learning via Inverse Reinforcement Learning

Inverse Reinforcement Learning