380 likes | 614 Views
Advances in Point-Based POMDP Solvers. Guy Shani Ronen Brafman Solomon E. Shimony. Overview. Agenda: Introduce point-based POMDP solvers. Overview recent advancements. Structure: Background – MDPs, POMDPs. Point-based solvers – Belief set selection. Value function computation.
E N D
Advances in Point-Based POMDP Solvers Guy Shani Ronen Brafman Solomon E. Shimony
Overview • Agenda: • Introduce point-based POMDP solvers. • Overview recent advancements. • Structure: • Background – MDPs, POMDPs. • Point-based solvers – • Belief set selection. • Value function computation. • Experiments.
Markov Decision Process - MDP • Model agents in a stochastic environment. • State – an encapsulation of all the relevant environment information: • Agent location • Can the agent eat monsters? • Monsters location • Gold coins location • Action – affect the environment: • Moving up, down, left, right • Stochastic effects – • Movement can sometime fail • Monster movements are random • Reward – received for achieving goals • Collecting coins • Eating a monster
MDP Formal Definition • Markov property – action effects depend only on the current state. • MDP is defined by the tuple <S,A,tr,R>. • S – state space • A – action set • tr – state transition function: tr(s,a,s’)=pr(s’|s,a) • R – reward function: R(s,a)
Policies and Value Functions • Policy – specifies an action for each state. • Optimal policy – maximizes the collected rewards: • Sum: • Average: • Discounted sum: • Value function – assigns a value to a state
Value Iteration (Bellman 1957) • Dynamic programming method. • Value is updated from reward states backwards. • Update is known as a backup.
Value Iteration (Bellman 1957) Initialize – V0(s) = 0, n = 0 While V has not converged For each s n = n + 1 • Known to converge to V*- the optimal value function. • π* - the optimal policy corresponds to the optimal value function. Bellman update
Policy Iteration (Howard 1960) • Intuition – we care about policies, not about value functions. • Changes in the value function may not affect the policy. • Expectation-Maximization. • Expectation – fix the policy and compute its value. • Maximization – change the policy to maximize the values.
Partial Observability • Real agents cannot directly observe the state. • Sensors – provide partial and noisy information about the world.
Partially Observable MDP - POMDP • The environment is Markovian. • The agent cannot directly view the state. • Sensors give observations over the current state. • Formal POMDP model: • <S, A, tr, R> – an MDP (the environment) • Ω – set of possible observations • O(a,s,o) – observation probability given action and state – pr(o|a,s).
Value of Information • POMDPs capture the value of information. • Example – we don’t know where the larger reward is – should we go and read the map? • Answer – it depends on: • The difference between the rewards. • The cost of reading the map. • The accuracy of the map. • POMDPs take all such considerations into account and provide an optimal policy.
Belief States • The agent does not directly observe the environment state. • Due to noisy and insufficient information, the agent has a belief over the current world state. • b(s) is the probability of being at state s. • τ(b,a,o) - a deterministic function computing the next belief state given action a and observation o. • The agent knows its initial belief state – b0
Value Function (Sondik 1973) • A value function V assigns a value to a belief state b. • V* - the optimal value function. • V*(b) – the expected reward if the agent will behave optimally starting from belief state b. • V is traditionally represented as a set of α-vectors. • V(b) = maxαα·b (upper envelope). • α·b = ∑sα(s)b(s) α1 α0 b=<0.4,0.6> s0 s1
Exact Value Iteration • Creates a new set of α-vectors. • Exponential explosion of vectors. • Dominated vectors can be pruned. (Littman et al. 1997) • Pruning process is time consuming.
Point-Based Backups (Pineau et al. 2001) • Bellman update (backup): • Vn+1(b) = maxa ra·b + γ∑opr(o|b,a)Vn(τ(b,a,o)) • Can be written using vector notation – • backup(b) = argmaxa gb,a·b • gb,a = ra + γ∑o argmaxαgα,a,o·b • gα,a,o(s) = ∑s’O(a,s’,o)tr(s,a,s’)α(s’) • Computes a new α-vector optimal for a specific input belief point b. • Known as a point-based backup.
Point-based Solvers • Compute a value function V over a subset B of the belief space. • Usually only reachable belief points are used. • Use α-vectors to represent V. • Assumption: an optimal value function over B will generalize well to other, unobserved belief points. • Advantage – each vector must maximize some b in B. Dominated vectors are pruned implicitly.
Variations of Point-Based Solvers • A number of algorithms were suggested: • PBVI ( Pineau et al. 2001) • Perseus (Spaan and Vlasis 2003) • HSVI (Smith and Simmons 2005) • PVI (Shani et al. 2006) • FSVI (Shani et al. 2007) • SCVI (Virin et al. 2007) • Differences between algorithms: • Selection of B –fixed/expanding set, traversal/distance. • Computation of V – which points are updated, what is the order of backups.
Belief Set Selection • Option 1 – expanding belief set • PBVI [Pineua et al. 2001] • B0 = {b0} – the initial belief state • Bn+1– for each b, add an immediate successor b’ = τ(b,a,o) s.t. dist(Bn,b’) is maximal. • Assumption – at the limit, B will include all reachable belief states and therefore V would converge to an optimal value function.
Goal B Candidates b0
Belief Set Selection • Option 2 – Random walk • Perseus [Spaan & Vlassis 2004] • Run a number of trials beginning at b0. • n is trial length : • for i = 0 to n • ai=random action • oi=random observation • bi+1= τ(bi, ai, oi) • B is the set of all observed belief states. • Assumption – a sufficiently long exploration would visit all “important” belief points. • Disadvantage – may add many “irrelevant” belief points.
Goal B
Belief Set Selection • Option 3 – Heuristic Exploration • Run a number of trials beginning at b0. • while stopping criterion was not reached • ai=choose action • oi=choose observation • bi+1= τ(bi, ai, oi) • i++ • HSVI [Smith & Simmons 2005] – • Maintains a lower bound and an upper bound over the V*. • Choose best a according to the upper bound. • Choose o such that bi+1 has the largest gap between bounds.
Forward Search Value Iteration[Shani et al. 2007] • A POMDP agent cannot directly obtain the environment state. • In simulation we may assume that the environment state is available. • Idea – use the simulated environment state to guide exploration in belief space.
Forward Search Value Iteration[Shani, Brafman, Shimony 2007] ai*←best action for si si+1←choose from tr(si,ai*,·) oi←choose from O(ai*,si+1,·) bi+1← τ(bi,ai*,oi) b3 a2,o2 s3 b2 a2 a1,o1 s2 b1 a1 s1 a0,o0 a0 MDP state space POMDP belief space b0 s0
Value Function Update • PBVI – An α-vector for each belief point in B. Arbitrary order of update. • Perseus – Randomly select next point to update from the points which were not yet improved. • HSVI+FSVI – Over each belief state traversal, execute backups in reversed order.
PBVI • Many vectors may not participate in the upper envelope. • All points are updated before a point can be updated twice (synchronous update). • It is possible for a successor of a point to be updated after that point, causing slow update of values. b0 b1 b2 b3 b4 b5 b6
HSVI & FSVI • Advantage – backups exploit previous backups on successors. b0 b1 b2 b3 b4 b5 b6
Perseus • Advantage • small number of vectors in each iteration. • All points are improved, but not all are updated. • Disadvantage • May choose points that are slightly improved and avoid points that can be highly improved. b0 b1 b2 b3 b4 b5 b6
Backup Selection • Perseus generates good value functions. • Can we accelerate the convergence of V? • Idea – choose points to backup smartly so that value function improves considerably after each backup.
Prioritizing Backups • Update a point b where the Bellman error e(b) = HV(b) - V(b) is maximal. • Well known in MDPs. • Problem – unlike MDPS, after improving b, updating the error for all other points is difficult: • The list of predecessors of b cannot be computed. • A new α-vector may improve the value for more than a single belief point. • Solution – recompute error over a sampled subset of B. Select point with maximal error from that set.
PVI[Shani, Brafman, Shimony 2006] • Advantage • all backups result in value function improvement. • Backups are optimal locally. • Disadvantage • HV(B) computations are expensive HV(B) b0 b1 b2 b3 b4 b5 b6
Clustered Value Iteration[Virin, Shani, Shimony, Brafman, 2007] • Compute a clustering of the belief space. • Iterate over the clusters and backup only belief points from the current cluster. • Clusters built such that a state is usually updated after its successors.
Value Directed Clustering • Compute the MDP optimal value function. • Cluster MDP states by their MDP value. • Define a soft clustering over belief space: pr(b in c) = Σs in c b(s). • Iterate over the clusters by decreasing cluster value: V(c) = 1/|C| Σs in c V(s). • Update all belief points such that pr(b in c) exceeds a threshold.
Summary • Point-based solvers are able to scale up to POMDPs with millions of states. • Algorithms differ in the selection of the belief points and the order of backups. • A smart order of backups can be computed using prioritization and clustering. • Trial-based algorithms are an alternative. FSVI is the fastest algorithm of this family.