370 likes | 554 Views
Probabilistic Planning via Determinization in Hindsight FF-Hindsight. Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati. Probabilistic Planning Competition. Client : Participants, send action Server: Competition Host, simulates actions. The Winner was ……. FF-Replan
E N D
Probabilistic Planning via Determinization in HindsightFF-Hindsight Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati
Probabilistic Planning Competition Client : Participants, send action Server: Competition Host, simulates actions
The Winner was …… • FF-Replan • A replanner. Use FF • Probabilistic domain is determinized • Interesting Contrast • Many probabilistic planning techniques • Work in theory but does not work in practice • FF-Replan • No theory • Work in practice
The Paper’s Objective Better determinization approach (Determinization in Hindsight) Theoretical consideration of the new determinization (in Hindsight) New view on FF-Replan Experimental studies with determinization in Hindsight (FF-Hindsight)
Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State
All Outcome Replanning (FFRA) ICAPS-07 Effect 1 Action1 Effect 1 Probability1 Action Probability2 Effect 2 Action2 Effect 2
Probabilistic PlanningAll Outcome Determinization Action Find Goal I Probabilistic Outcome A1 A2 Time 1 A1-1 A1-2 A2-1 A2-2 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 Dead End Action Goal State State
Probabilistic PlanningAll Outcome Determinization Action Find Goal I Probabilistic Outcome A1 A2 Time 1 A1-1 A1-2 A2-1 A2-2 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 Dead End Action Goal State State
Problem of FF-Replan and better alternative sampling FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem
Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State
Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I
Hindsight Sample 1 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 1 A2: 0 Dead End Action Goal State State
Hindsight Sample 2 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State
Hindsight Sample 3 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State
Hindsight Sample 4 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 3 A2: 1 Dead End Action Goal State State
Summary of the Idea:The Decision Process(Estimating Q-Value, Q(s,a)) S: Current State, A(S) → S’ 1. For Each Action A, Draw Future Samples Each Sample is a Deterministic Planning Problem 2. Solve The Deterministic Problems The solution length is used for goal-oriented problems, Q(s,A) 3. Aggregate the solutions for each action Max A Q(s,A) 4. Select the action with best aggregation
Mathematical Summary of the Algorithm • H-horizon future FH for M = [S,A,T,R] • Mapping of state, action and time (h<H) to a state • S × A × h → S • Value of a policy π for FH • R(s,FH, π) • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Compare this and the real value • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H) • Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] ) • In our proposal, computation of maxπ R(s,FH-1,π) is approximately done by FF [Hoffmann and Nebel ’01] Each Future is a Deterministic Problem Done by FF
Key Technical Results The Importance of Independent Sampling of States, Actions, Time The necessity of Random Time Breaking in Decision making We identify the characteristic of FF-Replan in terms of Hindsight Decision Making, VFFRa(s) = maxFV(s,F) Theorem 1 When there is a policy that can achieve the goal with probability 1 within horizon, hindsight decision making algorithm will find the goal with probability 1. Theorem 2 Polynomial number of samples are needed with regard to, Horizon, Action, The minimum Q-value advantage
Empirical Results IPPC-04 Problems Numbers are solved Trials For ZenoTravel, when we used Importance sampling, the solved trials have been improved to 26
Empirical Results These Domains are Developed just to Beat FF-Replan Obviously, FF-Replan did not do well. But, FF-Hindsight did very well, showing Probabilistic Reasoning Ability while achieving Scalability
Conclusion Deterministic Planning Probabilistic Planning scalability scalability Classic Planning Markov Decision Processes Machine Learning for Planning Machine Learning for MDP Net Benefit Optimization Temporal MDP Temporal Planning scalability Determinization
Conclusion • Devised an algorithm that can take advantage of the significant advances in deterministic planning in the context of probabilistic planning • Made many of the deterministic planning techniques available to probabilistic planning • Most of the learning to planning techniques are developed solely for deterministic planning • Now, these techniques are relevant to probabilistic planning too • Advanced net-benefit style of planners can be used for the reward maximization style of probabilistic planning problems
Discussion • Mercier and Van Hentenryck provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Ng and Jordan provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • V^(s,H) = maxπ ∑ [ R(s,FH,π) ] / m, where m is the sample number
IPPC-2004 Results Winner of IPPC-04 FFRs Human Control Knowledge Numbers : Successful Runs Learned Knowledge 2nd Place Winners
IPPC-2006 Results Numbers : Percentage of Successful Runs Unofficial Winner of IPPC-06 FFRa
Sampling ProblemTime dependency issue S1 S2 A Start Goal D (with probability 1-p) B C (with probability p) S3 C (with probability 1-p) D (with probability p) Dead End
Sampling ProblemTime dependency issue S1 S2 A Start Goal B S3 Dead End S3 is worse state then S1 but looks like there is always a path to GoalNeed to sample independently across actions
Action Selection ProblemRandom Tie breaking is essential B: with probability 1-p A: Always stays in Start Start S1 Goal B: with probability p C: with probability 1-p C: with probability p In Start state, C action is definitely better, but A can be used to wait until C to the Goal effect is realized
Sampling ProblemImportance Sampling (IS) B: with very high probability Start S1 Goal B: with extremely low probability - Sampling uniformly would find the problem unsolvable. - Use importance sampling. - Identifying the region that needs importance sampling is for further study.-In the benchmark, Zenotravel needs the IS idea.
Theoretical Results • Theorem 1 • For goal-achieving probabilistic planning problems, if there is a policy that can solve the probabilistic planning problem with probability 1 with bounded horizon, then hindsight planning would solve the problem with probability 1. If there is no such policy, hindsight planning would return less 1 success ratio. • If there is a future where no plan can achieve the goal, the future can be sampled • Theorem 2 • The number of future samples needed to correctly identify the best action • w > 4Δ-2T ln (|A|H| / δ) • Δ : the minimum Q-advantage of the best action over the other actions, δ: confidence parameter • From Chernoff Bound
Probabilistic PlanningExpecti-max solution Action Maximize Goal Achievement Probabilistic Outcome Max Time 1 Exp Exp Max Max Max Max Time 2 E E E E E E E E Action Goal State State
Hindsight Sample 1 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 1 A2: 0 Dead End Action Goal State State
Hindsight Sample 2 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State
Hindsight Sample 3 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 2 A2: 1 Dead End Action Goal State State
Hindsight Sample 4 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 3 A2: 1 Dead End Action Goal State State