POMDP

POMDP

MDP (Markov Decision Process) • A MDP model contains: • A set of states S • A set of actions A • A set of state transition description T • Deterministic or Stochastic • A reward function R (s, a)

POMDP • A POMDP model contains: • A set of states S • A set of actions A • A set of state transition description T • A reward function R (s, a) • A finite set of observations Ω • An observation function O:S╳A→Π(Ω) • O(s’, a, o)

POMDP Problem • 1. Belief state • First approach: choose the most probable state of the world, given past experience • Informational properties described via observations • Not explicit • Second approach: probability distributions over states of the world.

POMDP Problem • 2. Finding an optimal policy: • The optimal policy is the solution of a continuous-space “Belief MDP”: • The set of belief states B • The set of actions A • The state-transition function τ(b, a, b’) • The reward function on belief states ρ(b, a)

Value function • Policy Tree • Infinite Horizon • Witness Algorithm

Policy Tree • A tree of depth t that specifies a complete t-step policy. • Nodes: actions, the top node determines the first action to be taken. • Edges: the resulting observation

Sample Policy Tree

Policy Tree • Value Evaluation: • Vp(s) is the value function of step-t that starting from state s and executing policy tree p.

Policy Tree • Value Evaluation: • Expected value under policy tree p: • Where • Expected value that execute different policy trees from different initial belief states

Policy Tree • Value Evaluation: • Vt with only two states:

Policy Tree • Value Evaluation: • Vt with three states:

Infinite Horizon • The three algorithm to compute V: • Naive approach • Improved by choosing useful policy tree • Witness algo.

Infinite Horizon • Naive approach: • εis a small number • This policy tree contains: • nodes • Each nodes can be labeled with |A| possible actions • Total number of policy threes:

Infinite Horizon • Improved by choosing useful policy tree: • Vt-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. • And there are |A||Vt-1||Ω| elements in Vt+

Infinite Horizon • Improved by choosing useful policy tree:

Infinite Horizon • Witness algorithm:

Infinite Horizon • Witness algorithm: • is a set of t-step policy trees that have action a at their root • is the value function • And

Infinite Horizon • Witness algorithm: • Finding witness: • At each iteration we ask, Is there some belief state,b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U? • Provided

Infinite Horizon • Witness algorithm: • Finding witness: • Now we can state the witness theorem [25]: The true Q-function, , differs from the approximate Q-function, , if and only if there is some , , and for which there is some b such that

Infinite Horizon • Witness algorithm: • Finding witness:

Infinite Horizon • Witness algorithm: • Finding witness: • The linear program used to find witness points:

Infinite Horizon • Witness algorithm: • Complete value-iteration: • An agenda containing any single policy tree • A set U containing the set of desired policy tree • Using pnew to determine whether it is an improvement over the policy trees in U • 1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. • 2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

Infinite Horizon • Witness algorithm: • Complexity: • Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) • only trees can ever be added to the agenda (in addition to the one tree in the initial agenda). • Each linear program solved has variables and no more than constraints. • Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times).

Tiger Problem • Two doors: • Behind one door is a tiger • Behind another door is a large reward • Two states: • the state of the world when the tiger is on the left as sl and when it is on the right as sr • Three actions: • left, right, and listen. • Rewards: • reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 • Observations: • to hear the tiger on the left (Tl) or to hear the tiger on the right (Tr) • in state sl, the listen action results in observation Tl with probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.

Tiger Problem

Tiger Problem • Decreasing listening reliability from 0.85 down to 0.65:

POMDP

POMDP

Presentation Transcript

KI2 – MDP / POMDP

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Optimal Policies for POMDP

Finding Approximate POMDP Solutions through Belief Compression

Solving Large-Scale POMDP Problems Via Belief State Analysis

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)

Fast approximate POMDP planning: Overcoming the curse of history!

Advances in Point-Based POMDP Solvers

Approximate POMDP planning: Overcoming the curse of history!

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Partially Observable Markov Decision Process (POMDP)

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Meeting 3 POMDP (Partial Observability MDP)

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

POMDP

POMDP

Presentation Transcript

KI2 – MDP / POMDP

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Optimal Policies for POMDP

Finding Approximate POMDP Solutions through Belief Compression

Solving Large-Scale POMDP Problems Via Belief State Analysis

Learning-based methods to MDP and POMDP (Reinforcement learning &amp; Q-Learning)

Fast approximate POMDP planning: Overcoming the curse of history!

Advances in Point-Based POMDP Solvers

Approximate POMDP planning: Overcoming the curse of history!

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Partially Observable Markov Decision Process (POMDP)

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Meeting 3 POMDP (Partial Observability MDP)

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)