410 likes | 590 Views
Stochastic Dynamic Programming with Factored Representations. Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000). The Problem. Standard MDP algorithms require explicit state space enumeration Curse of dimensionality Need: Compact Representation (intuition: STRIPS)
E N D
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf(Boutilier, Dearden, Goldszmidt 2000)
The Problem • Standard MDP algorithms require explicit state space enumeration • Curse of dimensionality • Need: Compact Representation(intuition: STRIPS) • Need: versions of standard dynamic programming algorithms for it
A Glimpse of the Future Policy Tree Value Tree
Roadmap • MDPs- Reminder • Structured Representation for MDPs: Bayesian Nets, Decision Trees • Algorithms for Structured Representation • Experimental Results • Extensions
MDPs- Reminder • (states, actions, transitions, rewards) • Discounted infinite-horizon • Stationary Policies (an action to take at state s) • Value functions: is k-stage-to-go value function for π
Roadmap • MDPs- Reminder • Structured Representation for MDPs: Bayesian Nets, Decision Trees • Algorithms for Structured Representation • Experimental Results • Extensions
O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella Representing MDPs as Bayesian Networks: Coffee world The effect of the actions might be noisy.Need to provide a distribution for each effect.
Representing Actions: DelC 00.300
Representing Actions: Interesting Points • No need to provide marginal distribution over pre-action variables • Markov Property: we need only the previous state • For now, no synchronic arcs • Frame Problem? • Single Network vs. a network for each action • Why Decision Trees?
Representing Reward Generally determined by a subset of features.
Policies and Value Functions Policy Tree Value Tree Features HCR=T HCR=F Actions Values The optimal choice may depend only on certain variables (given some others).
Roadmap • MDPs- Reminder • Structured Representation for MDPs: Bayesian Nets, Decision Trees • Algorithms for Structured Representation • Experimental Results • Extensions
Value Iteration- Reminder • Bellman Backup • Q-Function: The value of performing a in s, given value function v
Structured Value Iteration- Overview Input: Tree( ). Output: Tree( ). 1. Set Tree()= Tree( ) 2. Repeat (a) Compute Tree( )= Regress(Tree( ),a) for each action a (b) Merge (via maximization) trees Tree( ) to obtain Tree( ) Until termination criterion. Return Tree( ).
Step 2a: Calculating Q-Functions 2. DiscountingFutureValue 3. AddingImmediateReward 1. Expected FutureValue How to use the structure of the trees? Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.
Calculating : 1*10+0*0 Z: Z: Z: Tree(V0) PTree( ) FVTree( ) Tree( ) Undiscounted Expected Future Value for performing action a with one-stage-to-go. Finding conditions under which a will have distinct expected value, with respect to V0 Discounting FVTree (by 0.9), and adding the immediate reward function.
(a more complicated example) Tree(V1) PartialPTree( ) Unsimplified PTree( ) PTree( ) FVTree( ) Tree( )
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified)
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Prb =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with vb.
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Prb =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with vb. 3. Discount FVTree( ) with , append Tree(R) 4. Return FVTree( )
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X)
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on 5. For each leaf l in , add or both (according to distribution. Use union to combine labels) 6. Return
Step 2b. Maximization Value Iteration Complete.
Roadmap • MDPs- Reminder • Structured Representation for MDPs: Bayesian Nets, Decision Trees • Algorithms for Structured Representation • Experimental Results • Extensions
Experimental Results WorstCase: BestCase:
Roadmap • MDPs- Reminder • Structured Representation for MDPs: Bayesian Nets, Decision Trees • Algorithms for Structured Representation • Experimental Results • Extensions
Extensions • Synchronic edges • POMDPs • Rewards • Approximation
Backup slides • Here be dragons.