160 likes | 176 Views
Hierarchical POMDP Planning and Execution. Joelle Pineau Machine Learning Lunch November 20, 2000. S 1. S 2. S 3. Partially Observable MDP. POMDPs are characterized by: States: s S Actions: aA Observations: oO Transition probabilities: T(s,a,s’)=Pr(s’|s,a)
E N D
Hierarchical POMDP Planning and Execution Joelle Pineau Machine Learning Lunch November 20, 2000
S1 S2 S3 Partially Observable MDP • POMDPs are characterized by: • States: sS • Actions: aA • Observations: oO • Transition probabilities: T(s,a,s’)=Pr(s’|s,a) • Observation probabilities: T(o,a,s’)=Pr(o|s,a) • Rewards: R(s,a) • Beliefs: b(st)=Pr(st|ot,at,…,o0,a0)
The problem • How can we find good policies for complex POMDPs? • Is there a principled way to provide near-optimal policies?
Act InvestigateHealth Move Navigate AskWhere CheckPulse CheckMeds Left Right Up Down Proposed Approach • Exploit structure in the problem domain. • What type of structure? • Action set partitioning
Hierarchical POMDP Planning • What do we start with? • A full POMDP model: {So,Ao,Oo,Mo}. • An action set partitioning graph. • Key idea: • Break the problem into many “related” POMDPs. • Each smaller POMDP has only a subset of Ao. • imposing policy constraint • But why? • POMDP: exponential run-time per value iteration O(|A|n-1|O|)
Example POMDP: Value Function: 0.8 M KitchenState 0.1 MedsState BedroomState 0.1 CheckMeds GoToKitchen E 0.1 GoToBedroom 0.1 K B 0.1 ClarifyTask 0.8 0.8 0.1 So= {Meds, Kitchen, Bedroom} Ao = {ClarifyTask, CheckMeds, GoToKitchen, GoToBedroom} Oo = {Noise, Meds, Kitchen, Bedroom}
Hierarchical POMDP Action Partitioning: Act CheckMeds ClarifyTask Move ClarifyTask GoToKitchen GoToBedroom
KitchenState MedsState GoToKitchen BedroomState GoToBedroom ClarifyTask Local Value Function and Policy - Move Controller
Modeling Abstract Actions Problem: Need parameters for abstract action Move Solution: Use the local policy of corresponding low-level controller General form: Pr ( sj | si, akabstract ) = Pr ( sj | si, Policy(akabstract,si) ) Example: Pr ( sj | MedsState, Move ) = Pr ( sj | MedsState, ClarifyTask ) Policy(Move,si): KitchenState MedsState BedroomState GoToKitchen GoToBedroom ClarifyTask
KitchenState MedsState BedroomState CheckMeds Move Local Value Function and Policy - Act Controller
= ClarifyTask = CheckMeds = GoToKitchen = GoToBedroom Comparing Policies Hierarchical Policy: Optimal Policy:
Bounding the value of the approximation • Value function of top-level controller is an upper-bound on the value of the approximation. • Why? We were optimistic when modeling the abstract action. • Similarly, we can find a lower-bound. • How? We can assume “worst-case” view when modeling the abstract action. • If we partition the action set differently, we will get different bounds.
A real dialogue management example - SayTime Act CheckHealth - AskHealth - OfferHelp CheckWeather Greet Move DoMeds Phone - GreetGeneral - GreetMorning - GreetNight - RespondThanks - AskGoWhere - GoToRoom - GoToKitchen - GoToFollow - VerifyRoom - VerifyKitchen - VerifyFollow - AskWeatherTime - SayCurrent - SayToday - SayTomorrow - StartMeds - NextMeds - ForceMeds - QuitMeds - AskCallWho - CallHelp - CallNurse - CallRelative - VerifyHelp - VerifyNurse - VerifyRelative
Final words • We presented: • a general framework to exploit structure in POMDPs; • Future work: • automatic generation of good action partitioning; • conditions for additional observation abstraction; • bigger problems!