POMDP

POMDP Simple Example AndreyAndreyevich Markov

POMDP – Instance general description Problem description: A robot is situated in the following course: S={ S1, S2, S3 } However, it cannot tell in which state it starts.

Instance parameters – The actions set The actions set: Move clockwise => am Stay put => as Activate colour sensor => ac A={ am, as, ac }

Instance parameters – The score function The score function: R(s, a) The (immediate) reward for doing aA at state S. R( S2, as )= R( S3, as )=-1 R( S1, as )=+1 R( S1, ac )=R( S2, ac )=R( S3, ac )=-0.1 R( S1, am )=R( S2, am )=R( S3, am )=0 * We will use γ=0.5

Instance parameters – The traverse function The transition function: Tr(s, a, s’) The probability to reach s’ from s using a. Tr(S1, am, S2)=0.9 Tr(S1, am, S1)=0.1 Tr(S1, am, S3)=0 Tr(S2, am, S3)=0.9 Tr(S2, am, S2)=0.1 Tr(S2, am, S1)=0 Tr(S3, am, S1)=0.9 Tr(S3, am, S3)=0.1 Tr(S3, am, S2)=0 Tr(S1, as, S1)=1 Tr(S1, as, S2)=Tr(S1, as, S3)=0 Tr(S2, as, S2)=1 Tr(S2, as, S3)=Tr(S2, as, S1)=0 Tr(S3, as, S3)=1 Tr(S3, as, S1)=Tr(S3, as, S2)=0 Tr(S1, ac, S1)=1 Tr(S1, ac, S2)=Tr(S1, ac, S3)=0 Tr(S2, ac, S2)=1 Tr(S2, ac, S3)=Tr(S2, ac, S1)=0 Tr(S3, ac, S3)=1 Tr(S3, ac, S1)=Tr(S3, ac, S2)=0

Instance parameters – The observation set & function The observations set: Black observation => ob Red observation => or Ω={ ob, or } The observation function: O(s, a, o) The probability to observe oΩ after doing a and reaching s O(S1, am, ob)=O(S1, am, or)=0.5 O(S2, am, ob)=O(S2, am, or)=0.5 O(S3, am, ob)=O(S3, am, or)=0.5 O(S1, as, ob)=O(S1, as, or)=0.5 O(S2, as, ob)=O(S2, as, or)=0.5 O(S3, as, ob)=O(S3, as, or)=0.5 O(S1, ac, or)=0.8 O(S1, ac, ob)=0.2 O(S2, ac, ob)=0.8 O(S2, ac, or)=0.2 O(S3, ac, ob)=0.8 O(S3, ac, or)=0.2

Calculations – notes Notation - the belief states probabilities: We will use the following notations: P(s1)=p P(s2)=q P(s3)=1-p-q b={P(s1)=p, P(s2)=q } (short for (p,q,1-p-q). Notation - the next belief state: We will use the following notation: for the resulting belief state (i.e. the next belief state) after applying action a at the belief state b and observing o. Start state: We start at the belief state: bi={P(s1)=⅓, P(s2)=⅓ }.

1 step policy schema: Concrete policies: The belief state value function: The initial belief state value:

The belief state value function (1-step)

2 steps policy – 1st round schema: Concrete policy (1 out of 27): The belief state value function: The initial belief state value:

Calculations – The new belief state & observations probabilities We will calculate: a) b)

Calculations – Ob new belief state & probability

Calculations – Ob new belief state & probability (con’t)

Calculations – Or new belief state & probability

Calculations – Or new belief state & probability (con’t)

Calculations – The belief state value function

2 steps policy – 1st round schema: Concrete policy (1 out of 27): The belief state value function: The initial belief state value:

The belief state value function (2-step)

POMDP

POMDP

Presentation Transcript

KI2 – MDP / POMDP

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Optimal Policies for POMDP

Finding Approximate POMDP Solutions through Belief Compression

Solving Large-Scale POMDP Problems Via Belief State Analysis

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)

Fast approximate POMDP planning: Overcoming the curse of history!

Advances in Point-Based POMDP Solvers

Approximate POMDP planning: Overcoming the curse of history!

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Partially Observable Markov Decision Process (POMDP)

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Meeting 3 POMDP (Partial Observability MDP)

POMDP

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

POMDP

POMDP

Presentation Transcript

KI2 – MDP / POMDP

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers

Optimal Policies for POMDP

Finding Approximate POMDP Solutions through Belief Compression

Solving Large-Scale POMDP Problems Via Belief State Analysis

Learning-based methods to MDP and POMDP (Reinforcement learning &amp; Q-Learning)

Fast approximate POMDP planning: Overcoming the curse of history!

Advances in Point-Based POMDP Solvers

Approximate POMDP planning: Overcoming the curse of history!

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture

Partially Observable Markov Decision Process (POMDP)

CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment

A reinforcement learning scheme for a multi-agent card game: het leren van een POMDP

Meeting 3 POMDP (Partial Observability MDP)

POMDP

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Learning-based methods to MDP and POMDP (Reinforcement learning & Q-Learning)