Probabilistic Planning via Determinization in Hindsight FF-Hindsight

Probabilistic Planning via Determinization in HindsightFF-Hindsight Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati

Probabilistic Planning Competition Client : Participants, send action Server: Competition Host, simulates actions

The Winner was …… • FF-Replan • A replanner. Use FF • Probabilistic domain is determinized • Interesting Contrast • Many probabilistic planning techniques • Work in theory but does not work in practice • FF-Replan • No theory • Work in practice

The Paper’s Objective Better determinization approach (Determinization in Hindsight) Theoretical consideration of the new determinization (in Hindsight) New view on FF-Replan Experimental studies with determinization in Hindsight (FF-Hindsight)

Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State

All Outcome Replanning (FFRA) ICAPS-07 Effect 1 Action1 Effect 1 Probability1 Action Probability2 Effect 2 Action2 Effect 2

Probabilistic PlanningAll Outcome Determinization Action Find Goal I Probabilistic Outcome A1 A2 Time 1 A1-1 A1-2 A2-1 A2-2 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 A1-1 A1-2 A2-1 A2-2 Dead End Action Goal State State

Problem of FF-Replan and better alternative sampling FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

Probabilistic Planning(goal-oriented) Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 Dead End Action Goal State State

Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I

Hindsight Sample 1 Left Outcomes are more likely Action Maximize Goal Achievement I Probabilistic Outcome A1 A2 Time 1 A1 A1 A1 A1 A2 A2 A2 A2 Time 2 A1: 1 A2: 0 Dead End Action Goal State State

Summary of the Idea:The Decision Process(Estimating Q-Value, Q(s,a)) S: Current State, A(S) → S’ 1. For Each Action A, Draw Future Samples Each Sample is a Deterministic Planning Problem 2. Solve The Deterministic Problems The solution length is used for goal-oriented problems, Q(s,A) 3. Aggregate the solutions for each action Max A Q(s,A) 4. Select the action with best aggregation

Mathematical Summary of the Algorithm • H-horizon future FH for M = [S,A,T,R] • Mapping of state, action and time (h<H) to a state • S × A × h → S • Value of a policy π for FH • R(s,FH, π) • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Compare this and the real value • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VFFRa(s) = maxF V(s,F) ≥ VHS(s,H) ≥ V*(s,H) • Q(s,a,H) = (R(a) + EFH-1 [maxπ R(a(s),FH-1,π)] ) • In our proposal, computation of maxπ R(s,FH-1,π) is approximately done by FF [Hoffmann and Nebel ’01] Each Future is a Deterministic Problem Done by FF

Key Technical Results The Importance of Independent Sampling of States, Actions, Time The necessity of Random Time Breaking in Decision making We identify the characteristic of FF-Replan in terms of Hindsight Decision Making, VFFRa(s) = maxFV(s,F) Theorem 1 When there is a policy that can achieve the goal with probability 1 within horizon, hindsight decision making algorithm will find the goal with probability 1. Theorem 2 Polynomial number of samples are needed with regard to, Horizon, Action, The minimum Q-value advantage

Empirical Results IPPC-04 Problems Numbers are solved Trials For ZenoTravel, when we used Importance sampling, the solved trials have been improved to 26

Empirical Results These Domains are Developed just to Beat FF-Replan Obviously, FF-Replan did not do well. But, FF-Hindsight did very well, showing Probabilistic Reasoning Ability while achieving Scalability

Conclusion Deterministic Planning Probabilistic Planning scalability scalability Classic Planning Markov Decision Processes Machine Learning for Planning Machine Learning for MDP Net Benefit Optimization Temporal MDP Temporal Planning scalability Determinization

Conclusion • Devised an algorithm that can take advantage of the significant advances in deterministic planning in the context of probabilistic planning • Made many of the deterministic planning techniques available to probabilistic planning • Most of the learning to planning techniques are developed solely for deterministic planning • Now, these techniques are relevant to probabilistic planning too • Advanced net-benefit style of planners can be used for the reward maximization style of probabilistic planning problems

Discussion • Mercier and Van Hentenryck provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • VHS(s,H) = EFH [maxπ R(s,FH,π)] • Ng and Jordan provided the analysis of the difference between • V*(s,H) = maxπ EFH [ R(s,FH,π) ] • V^(s,H) = maxπ ∑ [ R(s,FH,π) ] / m, where m is the sample number

IPPC-2004 Results Winner of IPPC-04 FFRs Human Control Knowledge Numbers : Successful Runs Learned Knowledge 2nd Place Winners

IPPC-2006 Results Numbers : Percentage of Successful Runs Unofficial Winner of IPPC-06 FFRa

Sampling ProblemTime dependency issue S1 S2 A Start Goal D (with probability 1-p) B C (with probability p) S3 C (with probability 1-p) D (with probability p) Dead End

Sampling ProblemTime dependency issue S1 S2 A Start Goal B S3 Dead End S3 is worse state then S1 but looks like there is always a path to GoalNeed to sample independently across actions

Action Selection ProblemRandom Tie breaking is essential B: with probability 1-p A: Always stays in Start Start S1 Goal B: with probability p C: with probability 1-p C: with probability p In Start state, C action is definitely better, but A can be used to wait until C to the Goal effect is realized

Sampling ProblemImportance Sampling (IS) B: with very high probability Start S1 Goal B: with extremely low probability - Sampling uniformly would find the problem unsolvable. - Use importance sampling. - Identifying the region that needs importance sampling is for further study.-In the benchmark, Zenotravel needs the IS idea.

Theoretical Results • Theorem 1 • For goal-achieving probabilistic planning problems, if there is a policy that can solve the probabilistic planning problem with probability 1 with bounded horizon, then hindsight planning would solve the problem with probability 1. If there is no such policy, hindsight planning would return less 1 success ratio. • If there is a future where no plan can achieve the goal, the future can be sampled • Theorem 2 • The number of future samples needed to correctly identify the best action • w > 4Δ-2T ln (|A|H| / δ) • Δ : the minimum Q-advantage of the best action over the other actions, δ: confidence parameter • From Chernoff Bound

Probabilistic PlanningExpecti-max solution Action Maximize Goal Achievement Probabilistic Outcome Max Time 1 Exp Exp Max Max Max Max Time 2 E E E E E E E E Action Goal State State

Probabilistic Planning via Determinization in Hindsight FF-Hindsight

Probabilistic Planning via Determinization in Hindsight FF-Hindsight

Presentation Transcript

Hindsight is 20-20

Better climate predictions using hindsight

Mobile Learning Using The iPod Touch – In Hindsight

Foresight not hindsight

Hindsight

It's easy to be wise in hindsight - planning to overcome obstacles

Janus : exploiting parallelism via hindsight

College Graduates- How Can Hindsight become Foresight?

Gent-McWilliams parameterization: 20/20 Hindsight

Hindsight, Contingency, Agency, and Complexity

‘Estimating with Confidence’ and hindsight:

Hindsight Heresy

20 – 20 Hindsight What did it look like?

Foresight vs. Hindsight

National Australia Group (UK) With Hindsight!

Hindsight Bias

[PDF] Hindsight by Justin Timberlake

2050 Hindsight

Hindsight Bias

Foresight, Insight. Hindsight US Statistical Observations

National Australia Group (UK) With Hindsight!