1 / 31

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs). October 27, 2010. Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2010. Outline.

oral
Download Presentation

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE-517: Reinforcement Learning in Artificial IntelligenceLecture 15: Partially Observable Markov Decision Processes (POMDPs) October 27, 2010 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2010

  2. Outline • Why use POMDPs? • Formal definition • Belief state • Value function

  3. Partially Observable Markov Decision Problems (POMDPs) • To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city • The agent can look forward, backward, left or right • It cannot change speed but it can steer into the lane it is looking at • The different types of observations are • the direction in which the agent's gaze is directed • the closest object in the agent's gaze • whether the object is looming or receding • the color of the object • whether a horn is sounding • To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

  4. POMDP Example • The agent is in control of the middle car • The car behind is fast and will not slow down • The car ahead is slower • To avoid a crash, the agent must steer right • However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash • The agent basically needs to learn how the observations might aid its performance

  5. POMDP Example (cont.) • This is not easy when the agent has no explicit goals beyond “performing well" • There are no explicit training patterns such as “if there is a car ahead and left, steer right." • However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) • The agent is penalized for colliding with other cars or the road shoulder • The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

  6. POMDP Example (cont.) • Two significant problems make it difficult to learn under these conditions • Temporal credit assignment– • If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? • Generally same as in MDPs • Partial Observability - • If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right • However, when it looks to the right it has no sensory information regarding what goes on elsewhere • To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

  7. Forms of Partial Observability • Partial Observability coarsely pertains to either • Lack of important state information in observations – must be compensated using memory • Extraneous information in observations – needs to learn to avoid • In our example: • Color of the car in its gaze is extraneous (unless red cars really drive faster) • It needs to build a memory-based model of the world in order to accurately predict what will happen • Creates “belief state” information (we’ll see later) • If the agent has access to the complete state, such as a chess playing machine that can view the entire board: • It can choose optimal actions without memory • Markov property holds – i.e. future state of the world is simply a function of the current state and action

  8. Modeling the world as a POMDP • Our setting is that of an agent taking actions in a world according to its policy • The agent still receives feedback about its performance through a scalar reward received at each time step • Formally stated, POMDPs consists of … • |S| states S = {1,2,…,|S|} of the world • |U| actions (or controls) U = {1,2,…, |U|} available to the policy • |Y| observations Y = {1,2,…,|Y|} • a (possibly stochastic) reward r(i) for each state i in S

  9. Modeling the world as a POMDP (cont.)

  10. MDPs vs. POMDPs • In MDP: one observation for each state • Concept of observation and state being interchangeable • Memoryless policy that does not make use of internal state • In POMDPs different states may have similar probability distributions over observations • Different states may look the same to the agent • For this reason, POMDPs are said to have hidden state • Two hallways may look the same for a robot’s sensors • Optimal action for the first  take left • Optimal action for the first  take right • A memoryless policy can not distinguish between the two

  11. MDPs vs. POMDPs (cont.) • Noise can create ambiguity in state inference • Agent’s sensors are always limited in the amount of information they can pick up • One way of overcoming this is to add sensors • Specific sensors that help it to “disambiguate” hallways • Only when possible, affordable or desirable • In general, we’re now considering agents that need to be proactive (also called “anticipatory”) • Not only react to environmental stimuli • Self-create context using memory • POMDP problems are harder to solve, but represent realistic scenarios

  12. POMDP solution techniques – model based methods • If an exact model of the environment is available, POMDPs can (in theory) be solved • i.e. an optimal policy can be found • Like model-based MDPs, it’s not so much a learning problem • No real “learning”, or trial and error taking place • No exploration/exploitation dilemma • Rather a probabilistic planning problem  find the optimal policy • In POMDPs the above is broken into two elements • Belief state computation, and • Value function computation based on belief states

  13. The belief state • Instead of maintaining the complete action/observation history, we maintain a belief state b. • The belief state is a probability distribution over the states • Given an observation • Dim(b) = |S|-1 • The belief space is the entire probability space • We’ll use a two-state POMDP as a running example • Probability of being in state one = p probability of being in state two = 1-p • Therefore, the entire space of belief states can be represented as a line segment

  14. The belief space • Here is a representation of the belief space when we have two states (s0,s1)

  15. The belief space (cont.) • The belief space is continuous, but we only visit a countable number of belief points • Assumption: • Finite action set • Finite observation set • Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

  16. The Tiger Problem • Standing in front of two closed doors • World is in one of two states: tiger is behind left door or right door • Three actions: Open left door, open right door, listen • Listening is not free, and not accurate (may get wrong info) • Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r)

  17. Tiger Problem: POMDP Formulation • Two states: SL and SR (tiger is really behind left or right door) • Three actions: LEFT, RIGHT, LISTEN • Transition probabilities: • Listening does not change thetiger’s position • Each episode is a “Reset” Current state Next state

  18. Tiger Problem: POMDP Formulation (cont.) • Observations: TL (tiger left) or TR (tiger right) • Observation probabilities: Current state Next state Rewards: • R(SL, Listen) = R(SR, Listen) = -1 • R(SL, Left) = R(SR, Right) = -100 • R(SL, Right) = R(SR, Left) = +10

  19. POMDP Policy Tree (Fake Policy) Starting belief state (tiger left probability: 0.3) Listen Tiger roar left Tiger roar right New belief state Listen Tiger roar right Open Left door New belief state Tiger roar left Open Left door Listen … New belief state … Listen

  20. POMDP Policy Tree (cont’) A1 o1 o3 o2 A2 o3 A4 A3 o4 o5 A7 A5 A6 … … A8

  21. How many POMDP policies possible A1 1 o1 o3 o2 A2 A4 |O| o6 A3 o4 o5 A7 A5 A6 |O|^2 … … A8 … • How many policy trees, if |A| actions, |O| observations, T horizon: • How many nodes in a tree: • N =  |O|i= (|O|T- 1)/ (|O| - 1) How many trees: T-1 |A|N i=0

  22. Computing Belief States b’(s’) = Pr (s’ | o, a, b) = Pr (s’  o  a  b) / Pr(o  a  b) = Pr(o |s’, a, b) Pr(s’| a, b) * Pr (a  b) Pr(o | a, b) * Pr (a  b) = Pr(o | s’, a) Pr (s’ | a, b) Pr(o | a, b) Will not repeat Pr(o | a, b) in the next slide, but assume it is there! • Treated as a normalizing factor, so that b’ sums to 1

  23. Computing Belief States: Numerator Pr(o | s’ a) Pr (s’ | a, b) = O(s’, a, o) Pr (s’ | a, b) = O(s’, a, o)  Pr (s’ | a, b, s) Pr (s | a, b) = O(s’, a, o)  Pr (s’ | a, b, s) b(s) ; Pr (s | a, b) = Pr (s | b) = b(s) = O(s’, a, o)  T(s, a, s’) b(s) (Please work out some of the details at home!)

  24. Belief State Overall formula: • The belief state is updated proportionally to: • The prob. of seeing the current observation given state s’, • and to the prob. of arriving at state s’ given the action and our previous belief state (b) • The above are all given by the model

  25. Belief State (cont.) • Let’s look at an example: • Consider a robot that is initially completely uncertain about its location • Seeing a door may, as specified by the model’s occur in three different locations • Suppose that the robot takes an action and observes a T-junction • It may be that given the action only one of the three states could have lead to an observation of a T-junction • The agent now knows withcertainty which state it is in • Not in all cases the uncertaintydisappears like that

  26. Finding an optimal policy • The policy component of a POMDP agent must map the current belief state into action • It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) • We can not do better even if we remembered the entire history of observations and actions • We have now transformed the POMDP into a MDP • Good news: we have ways of solving those (GPI algorithms) • Bad news: the belief state space is continuous !!

  27. Value function • The belief state is the input to the second component of the method: the value function computation • The belief state is a point in a continuous space of N-1 dimensions! • The value function must be defined over this infinite space • Application of dynamic programming techniques  infeasible

  28. Value function (cont.) • Let’s assume only two states: S1 and S2 • Belief state [0.25 0.75] indicates b(s1) = 0.25, b(s2) = 0.75 • With two states, b(s1) is sufficient to indicate belief state: b(s2) = 1 – b(s1) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

  29. Piecewise linear and Convex (PWLC) • Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function • Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value V(b) b: belief state S1 [1, 0] S2 [0, 1] [0.5, 0.5]

  30. Why does PWLC helps? • We can directly work with regions (intervals) of belief space! • The vectors are policies, and indicate the right action to take in each region of the space Vp1 Vp3 V(b) Vp2 region3 region1 region2 S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

  31. Summary • POMDPs  better modeling of realistic scenarios • Rely on belief states that are derived from observations and actions • Can be transformed into an MDP with PWLC for value function approximation • Next class: (recurrent) neural networks come to the rescue …

More Related