580 likes | 734 Views
KI2 – MDP / POMDP. Kunstmatige Intelligentie / RuG. Decision Processes. Agent Perceives environment (S t ) flawlessly Chooses action (a) Which alters the state of the world (S t+1 ). finite state machine. geen signalen. zie BAL . geen signalen. A1: lummel wat rond. zie BAL .
E N D
KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG
Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) Which alters the state of the world (St+1)
finite state machine geen signalen zie BAL geen signalen A1: lummel wat rond zie BAL A2: volg object geen signalen zie obstakel zie obstakel zie BAL A3: houd afstand zie obstakel
Stochastic Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)
Markov Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a) If no longer-term dependencies: 1st order Markov process
Aannames De waarneming van St is zonder ruis, alle benodigde informatie is waarneembaar Acties a worden volgens kans P(a|S) geselecteerd (random generator) Gevolgen van a in (St+1) treden stochastisch op met kans P(St+1|St,a)
A policy +1 -1 START
A policy +1 -1 START
MDP States Actions Transitions between states P(ai|sk) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s
policy π • “ argmax ai P(ai|sk) “ • Hoe kan een agent dit leren? • Kostenminimalisatie • Beloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911) • Reinforcements R(a,S) • Structuur wereld T = P(St+1|St)
Reinforcements • Gegeven een historie van States, Actions en de resulterende Reinforcements kan een agent leren de waarde van een Actie te schatten. • Hoe: Som van reinforcements R? Gemiddelde? • exponentiële weging • eerste stap bepaalt alle latere (learning from past) • directe beloning is nuttiger (rekenen over toekomst) • impatience & mortality
Assigning Utility (Value) to Sequences Discounted Rewards: V(s0,s1,s2…) = R(s0) + R(s1) + 2R(s2)… where 0<≤1 where R is reinforcement value, s refers to state, is the the discount factor
Assigning Utility to States NO!!! • Can we say V(s)=R(s)? • “ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd” • Transitiekans T(s,a,s’)
Assigning Utility to States • Can we say V(s)=R(s)? • Vπ(s) is specific to each policy π • Vπ(s) = E(tR(st)| π, s0=s) • V(s)= Vπ *(s) • V(s)=R(s) + max T(s,a,s’)V(s’) a s’ Bellman equation If we solve function V(s) for each state we will have solved the optimal π* for the given MDP
Value Iteration Algorithm • We have to solve |S| simultaneous Bellman equations • Can’t solve directly, so use an iterative approach: 1. Begin with arbitrary initial values V0 2. For each s, calculate V(s) from R(s) and V0 3. Use these new utility values to update V0 4. Repeat steps 2-3 until V0 converges This equilibrium is a unique solution! (see R&N for proof) page 621 R&N
Search space • T: S*A*S • Explicit enumeration of combinations is often not feasible (cf. chess, GO) • Chunking within T • Problem: if S is real valued
MDP POMDP • MDP: wereld is weliswaar stochastisch, Markoviaans, maar: • De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt. • De meeste ‘echte’ problemen omvatten: • ruis in de waarneming zelf • onvolledigheid van informatie
MDP POMDP • De meeste ‘echte’ problemen omvatten: • ruis in de waarneming zelf • onvolledigheid van informatie • In deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.
Partially Observable Markov Decision Processes (POMDPs) • A POMDP has: • States S • Actions A • Probabilistic transitions • Immediate Rewards on actions • A discount factor • +Observations Z • +Observation probabilities (reliabilities) • +An initial belief b0
The Tiger Problem • Description: • 2 states: Tiger_Left, Tiger_Right • 3 actions: Listen, Open_Left, Open_Right • 2 observations: Hear_Left, Hear_Right
The Tiger Problem • Rewards are: • -1 for the Listen action • -100 for the Open(x) in the Tiger-at-x state • +10 for the Open(x) in the Tiger-not-at x state
The Tiger Problem • Furthermore: • The Listen action does not change the state • The Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial. • The Listen action gives the correct information 85% of the time: p(hearleft | Listen, tigerleft) = 0.85 p(hearright | Listen, tigerleft) = 0.15
The Tiger Problem • Question: • what policy gives the highest return in rewards? • Actions depend on beliefs! • If belief is: 50/50 L/R, the expected reward will be R = 0.5 * (-100 + 10) = -45 • Beliefs are updated with observations (which may be wrong)
The Tiger Problem, horizon t=1 • Optimal policy:
The Tiger Problem, horizon t=2 • Optimal policy:
The Tiger Problem, horizon t=Inf • Optimal policy: • listen a few times • choose a door • next trial • listen1: Tigerleft (p=0.85), listen2: Tigerleft (p=0.96),listen3: ... (binomial) • Good news: the optimal policy can be learnedif actions are followed by rewards!
The Tiger Problem, belief updates on “Listen” • P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) • Example: • initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698),listen3: Tigerleft (p=0.9945),listen4: ...(Note: underlying binomial distribution)
The Tiger Problem, belief updates on “Listen” • P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) • Example 2, noise in observation: • initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698),listen3: Tigerright (p=0.8500), Belief drops...listen4: Tigerleft (p=0.9698), and recoverslisten5: ...
Solving a POMDP • To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:
The belief state • Instead of maintaining the complete action/observation history, we maintain a belief state b. • The belief is a probability distribution over the states. Dim(b) = |S|-1
The belief space Here is a representation of the belief space when we have two states (s0,s1)
The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)
The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)
The belief space • The belief space is continuous but we only visit a countable number of belief points.
Value Function in POMDPs • We will compute the value function over the belief space. • Hard: the belief space is continuous !! • But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. • We can represent any finite-horizon solution by a finite set of alpha-vectors. • V(b) = max_α[Σ_s α(s)b(s)]
Alpha-Vectors • They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.
Belief Transform • Assumption: • Finite action • Finite observation • Next belief state = T(cbf,a,z) where cbf: current belief state, a:action, z:observation • Finite number of possible next belief state
PO-MDP into continuous CO-MDP • The process is Markovian, the next belief state depends on: • Current belief state • Current action • Observation • Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.
Problem • Using VI in continuous state space. • No nice tabular representation as before.
PWLC • Restrictions on the form of the solutions to the continuous space CO-MDP: • The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. • the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
Steps in Value Iteration (VI) • Represent the value function for each horizon as a set of vectors. • Overcome how to represent a value function over a continuous space. • Find the vector that has the largest dot product with the belief state.
a2 is the best a1 is the best PO-MDP Value Iteration Example • Assumption: • Two states • Two actions • Three observations • Ex: horizon length is 1. b=[0.25 0.75] a1 a2 ] [ s1 s2 • 0 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125
PO-MDP Value Iteration Example • The value of a belief state for horizon length 2 given b,a1,z1: • immediate action plus the value of the next action. • Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.
PO-MDP Value Iteration Example • Find the value for all the belief points given this fixed action and observation. • The Transformed value function is also PWLC.
PO-MDP Value Iteration Example • How to compute the value of a belief state given only the action? • The horizon 2 value of the belief state, given that: • Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835
Transformed Value Functions • Each of these transformed functions partitions the belief space differently. • Best next action to perform depends upon the initial belief state and observation.
Best Value For Belief States • The value of every single belief point, the sum of: • Immediate reward. • The line segments from the S() functions for each observation's future strategy. • since adding lines gives you lines, it is linear.
Best Strategy for any Belief Points • All the useful future strategies are easy to pick out: