310 likes | 332 Views
Meeting 3 POMDP (Partial Observability MDP). 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授. Reference. “ Planning and acting in partially observable stochastic domains ” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998
E N D
Meeting 3POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授
Reference • “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998 • “Spoken Dialogue Management Using Probabilistic Reasoning”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000
MDP (Markov Decision Process) • A MDP model contains: • A set of states S • A set of actions A • A set of state transition description T • Deterministic or Stochastic • A reward function R (s, a)
MDP • For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. • What happens if the agent is no longer able to determine the state it is currently in with complete reliability?
POMDP • A POMDP model contains: • A set of states S • A set of actions A • A set of state transition description T • A reward function R (s, a) • A finite set of observations Ω • An observation function O:S╳A→Π(Ω) • O(s’, a, o)
POMDP Problem • 1. Belief state • First approach: chose the most probable state of the world, given past experience • Informational properties described via observations • Not explicit • Second approach: probability distributions over states of the world.
An example • Actions: EAST and WEST • each succeeds with probability 0.9, and when they fail the movement is in the opposite direction. If no movement is possible in particular direction, then the agent remains in the same location • Initially [0.33, 0.33, 0, 0.33] • After taking one EAST movement [0.1, 0.45, 0, 0.45] • After taking another EAST movement[0.1, 0.164, 0, 0.736]
POMDP Problem • 2. Finding an optimal policy: • Maps the belief state to actions
Policy Tree • A tree of depth t that specifies a complete t-step policy. • Nodes: actions, the top node determines the first action to be taken. • Edges: the resulting observation
Policy Tree • Value Evaluation: • Vp(s) is the value function of step-t that starting from state s and executing policy tree p.
Policy Tree • Value Evaluation: • Expected value under policy tree p: • Where • Expected value that execute different policy trees from different initial belief states
Policy Tree • Value Evaluation: • Vt with only two states:
Policy Tree • Value Evaluation: • Vt with three states:
Infinite Horizon • The three algorithm to compute V: • Naive approach • Improved by choosing useful policy tree • Witness algo.
Infinite Horizon • Naive approach: • εis a small number • This policy tree contains: • nodes • Each nodes can be labeled with |A| possible actions • Total number of policy threes:
Infinite Horizon • Improved by choosing useful policy tree: • Vt-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. • And there are |A||Vt-1||Ω| elements in Vt+
Infinite Horizon • Improved by choosing useful policy tree:
Infinite Horizon • Witness algorithm:
Infinite Horizon • Witness algorithm: • is a set of t-step policy trees that have action a at their root • is the value function • And
Infinite Horizon • Witness algorithm: • Finding witness: • At each iteration we ask, Is there some belief state,b, for which the true value, , computed by one-step lookahead using Vt-1, is different from the estimated value, , computed using the set U? • Provided
Infinite Horizon • Witness algorithm: • Finding witness: • Now we can state the witness theorem [25]: The true Q-function, , differs from the approximate Q-function, , if and only if there is some , , and for which there is some b such that
Infinite Horizon • Witness algorithm: • Finding witness:
Infinite Horizon • Witness algorithm: • Finding witness: • The linear program used to find witness points:
Infinite Horizon • Witness algorithm: • Complete value-iteration: • An agenda containing any single policy tree • A set U containing the set of desired policy tree • Using pnew to determine whether it is an improvement over the policy trees in U • 1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. • 2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.
Infinite Horizon • Witness algorithm: • Complexity: • Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) • only trees can ever be added to the agenda (in addition to the one tree in the initial agenda). • Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times).
Tiger Problem • Two doors: • Behind one door is a tiger • Behind another door is a large reward • Two states: • the state of the world when the tiger is on the left as sl and when it is on the right as sr • Three actions: • left, right, and listen. • Rewards: • reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 • Observations: • to hear the tiger on the left (Tl) or to hear the tiger on the right (Tr) • in state sl, the listen action results in observation Tl with probability 0.85 and the observation Tr with probability 0.15; conversely for world state sr.
Tiger Problem • Decreasing listening reliability from 0.85 down to 0.65: