480 likes | 606 Views
Decision Making under Uncertainty MURI Meeting, June 2001. Daphne Koller Stanford University. Research Themes. Decision making in high-dimensional spaces Feature selection [Guestrin, Ormoneit] Factored models [Guestrin, Parr, K.] Decision making in multi-agent settings
E N D
Decision Making under UncertaintyMURI Meeting, June 2001 Daphne Koller Stanford University
Research Themes • Decision making in high-dimensional spaces • Feature selection [Guestrin, Ormoneit] • Factored models [Guestrin, Parr, K.] • Decision making in multi-agent settings • Inferring preferences from behavior [Chajewska, Ormoneit, K.] • Strategic interactions [Milch, K.] • Hybrid (discrete/continuous) models [Lerner, Parr, K.] • Reasoning in complex multi-entity domains [Getoor, Segal, Taskar, K.] • Learning probabilistic models from data [Tong, K.]
Motivation This time Next time • A complex battlespace, composed of multiple entities, moving across space and time • Very large state space: • Positions of all units • Intentions of enemy units • Weather & terrain • … • Many agents (units) making parallel decisions, trying to coordinate
The MDP Framework Actions ? State, Reward/Cost Complex environment Actor • State space: S • Action space: A • Actions stochastically influence next state: • Transition model: P(s’ | s,a) • States are associated with momentary rewards • Rewards accumulate over time • Task: Maximize expected, discounted reward • Find policy
Policies & Value Functions V(S1) = 10 V(S2) = 5 0.5 S1 0.7 S1 S2 S2 0.3 0.5 Action 2 Action 1 Expectation over next-state values • Suppose an expert told you the “value” of each state: • V(s) is the value of acting optimally starting at s • If V is optimal, then it is optimal to act greedily wrt V • Pick action with highest expected immediate value
Large state spaces • There are several approaches for computing the optimal value function: • Policy iteration: an iterative bootstrap algorithm • Linear programming • Reinforcement learning for cases when process dynamics unknown or available only via simulation • Problem: In most real-world problems, state space is very large • Exponential in number of features used to describe it • Value function has a value for each state: • Impractical to represent or compute in most cases
Feature Selection in MDPs H H’ X’ X • One approach: • Select features • Solve problem as if it was an MDP over the features How bad is this approximation?
Theorem: Near-Optimal Policy H H’ X’ X • Theorem: The loss of acting according to the greedy memory-less policy over the observable variables is bounded by a factor of the mutual information between H and H’ given X’. • For normally distributed H’ and linear features (X = WTH) I(H,H’|X’) is minimized if W spans first k principal components of H
Learning to Ride a Bicycle • Task proposed by [Randløv and Alstrøm ‘98]: • Learn to ride from initial state to goal • Control handlebar torque and center of mass • [Randløv and Alstrøm ‘98]: • Discretized 6 dof, used NN to represent policy • Distance to reach goal: • 7 Km on average • Best case 1.7 Km • PEGASUS [Ng and Jordan ‘00]: • Used 15 features and linear sigmoid policies • Worst case distance 1.07 Km Start 1Km Goal
Feature Selection Algorithm • Used same 15 features and policy representation • Simulate system using do-nothing policy • Run PCA on points in sample trajectories • Apply PEGASUS algorithm using only first k principal components as features
Summary • Can get excellent performance in sequential decision process using few features • Feature selection algorithm tells us which features: • of the state are most important for decision making • of recent past are worth remembering • Can allow us to deal effectively with high-dimensional spaces
Factored MDPs Time t t+1 Y’ Z’ X’ X R1 Y Z R2 Total reward adding sub-rewards: R=R1+R2 Actions only make local changes to transition model
Decomposable Value Functions K basis functions h1(s1) h2(s1)... h1(s2) h2(s2)… . . . A= 2nstates • Each hiis the status of some small part(s) of a complex system • status of a machine • inventory of a store Linear combination of restricted domain functions [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00]
Approach I: Policy Iteration (2nx1) (2nx2n) (2nx1) Guess w0 pt= greedy(A wt) Awt+1 value of acting on pt Guess V0 pt= greedy(Vt) Vt+1= value of acting on pt Approximate Value of Acting on p: Innovative approach based on linear programming
Network Management Problem Server Server Bidirectional Ring Ring and Star Star Server 3 Legs Ring of Rings • Computers connected in a network • Each computer can fail with some probability • If a computer fails, it increases the probability the neighbors will fail • At every time step, the sys-admin must decide which computer to fix
Experimental Results Running time Error Error remains bounded Runs in time O(n2) not O((2n)3)
Approach II: Linear Programming Exponential number of variables! Exponential number of constraints! • Find the optimal value function by linear programming
Approximate LP formulation Small number of variables. Still exponential number of constraints. • Approximate LP solution to value function using value function approximation: • Error analysis by [De Farias & Van Roy, ‘01] Factored MDPs allow for compact, closed form representation of constraints!
Approx LP vs Policy Iteration Running time Value of final policy Note: state space sizes up to 232!!
Summary • Factored representation of system dynamics allows representation of very complex systems • We use value functions that approximate value as sum of values of system components • System components can overlap • Very natural approximation that people also use • Allows very efficient algorithms for sequential decision making in structured complex systems • Collaborative multi-agent decision making • State space size 4.3x1028 (30 agents in parallel)!! New New
Research Themes • Decision making in high-dimensional spaces • Feature selection [Guestrin, Ormoneit] • Factored models [Guestrin, Parr, K.] • Decision making in multi-agent settings • Inferring preferences from behavior [Chajewska, Ormoneit, K.] • Strategic interactions [Milch, K.] • Hybrid (discrete/continuous) models [Lerner, Parr, K.] • Reasoning in complex multi-entity domains [Getoor, Segal, Taskar, K.] • Learning probabilistic models from data [Tong, K.]
Motivation • The enemy is also acting in the battlespace • He is also a rational agent, making decisions to optimize his goals • To act optimally in presence of other intelligent agent, need to figure out what he wants • We address two issues: • Figuring out utility functions • Both ours and the enemy’s • Acting optimally in context of strategic interaction
Example Decision Task Loss of Fetus Test Abortion Testresult Miscarriage Mother’sage Utility Future pregnancy • Chance nodes: as in a Bayesian network • Decision nodes: parents are observed prior to decision • Utility nodes: deterministic real-valued function of parents Knowledge Down’ssyndrome
Utility — A Random Variable Main idea: Express uncertainty over user’s utilities as a probability distribution p(U) Utility(o1) Utility(o2)
Incorporating Information • We start with a prior over utilities • As we observe behavior or ask questions, we obtain constraints • We condition distribution on constraints to obtain informed posterior • Posterior cannot be represented in closed form • Use MCMC sampling to generate “prototypical” utilities
Partial Utility Elicitation Compute optimalpolicy based on current p(U) Expected regret low enough? Yes No Ask question with highest value of information Condition p(U) on the answer
Experimental Results I 0.03 0.025 0.02 age - 20 - predicted age - 20 - actual 0.015 age - 40 - actual 0.01 0.005 0 1 2 3 4 5 6 7 8 9 10 Predicted and Actual Regret Regret Number of questions
Experimental Results II Q1 Q2 Q3 avg over 15 0.15 - 0.20 0.09 - 0.22 0.07 - 0.26 0.11 - 0.20 Target loss = 0.01 Number of questions asked Utility loss at end Distance from indifference point
Inferring Utilities [Ng & Russell, 2000] Agent’s Decision d1 d2 Nature’s move p1 1-p1 p2 1-p2 Outcomes o1 o2 o3 o4 Given • probability distribution over events • observed decision sequence Compute • utility values for outcomes • Knowledge about another agent’s utility function • gives us insight about the agent • allows us to predict his future actions • enables us to optimize our own actions in non-cooperative situations • Assume the observed agent is rational — acts to maximize expected utility
Example: Online Bookseller Moves: • Bookseller — strategic player • Customer — oblivious player • Nature (distribution known to both players) Sign up for e-mail yes no Offer discount Offer discount yes no yes no no no Buy Buy Buy Buy yes no yes no yes no yes no Enjoy Enjoy Enjoy Enjoy yes yes no yes no yes no no l1 l2 l3 l4 l5 l6 l7 l8 l9 l10 l11 l12 U(l1) = u(enjoy) + u(discount-price)+ u(e-mail)+ u(bargain) U(l5) = u(hate)+ u(full-price)+ u(e-mail)
Partial Strategy U(E1) ≥ U(l3) ? U(B2) = max(U(E2), U(l6)) U(B1) = max(U(E1), U(l3)) U(E1)=U(l1)*P(l1)+U(l2)*P(l2) U(E2)=U(l4)*P(l4)+U(l5)*P(l5) U(l5) U(l3) U(l4) U(l1) U(l2) U(l6) Sign up for e-mail Offer discount Buy Buy U(B1) = U(E1) Enjoy Enjoy
Behavior Implies Bounds U(O1) ≥ U(O2) U(B2)=i minc (i,c)ui,c U(B2)=i maxc(i,c)ui,c U(B1) =U(E1) U(B1) =U(E1) U(E1) = c p(c)U(c)U(E1) = c p(c)U(c) U(l1) = U(l1) = U(l1)= i i ui Sign up for e-mail U(E1) ≥ U(l3) Offer discount Buy Buy Enjoy Enjoy U(l5) U(l3) U(l4) U(l1) U(l2) U(l6)
Constraints in the Utility Space uo’ uo • Our linear constraints form a convex region which contains all consistent utility functions • Which one should we choose? • [Ng & Russell, 2000] — propose heuristics for selecting “natural” utility functions
Feasible Region Projection onto u1u2 plane After 1 observation After 17 observations
Predicting Using Learned Utility 1.6 predicting utility function predicting strategy 1.4 1.2 1 Distance 0.8 0.6 0.4 0.2 0 0 50 100 150 200 250 300 Number of observations
Strategizing based on Learned Utility 0.24 0.238 0.236 0.234 0.232 0.23 actual utility obtained utility obtained by following the optimal strategy 0.228 0.226 0.224 0 20 40 60 80 Number of observations
Summary • Utilities can be treated as a “random variable” • Distribution over utilities can be learned from population • Observations of behavior and/or answers to questions “narrow down” distribution • This approach can be used: • To facilitate utility elicitation in cooperative setting • To determine another agent’s utility and act accordingly
Road Example Suitability 1W Suitability 1E Util 1W Building 1E Building 1W Util 1E Suitability 2W Suitability 2E Util 2W Building 2W Building 2E Util 2E Suitability 3W Suitability 3E Util 3W Building 3W Building 3E Util 3E
Compactness • Assume all variables have three values • Each decision node observes three variables • Number of information sets per agent: 33 = 27 • Size of MAID: • n chance nodes of “size” 3 • n decision nodes of “size” 27·3 • Size of game tree: • 2n splits, each over three values • Size of normal (matrix) form: • n players, each with 327 pure strategies 54n 32n (327)n
Decision Making: Single Agent Goods Burglary Earthquake Recovery Alarm Newscast PhoneCall Go Home Sale • Need to choose d Val(D) for every information setu Val(Parents(D)) • home/stay for every value of PhoneCall • Compute expectation of utility nodes in distribution conditioned on d,u • For each u choose d that maximizes expected utility Meeting Sale
Strategic Relevance D D’ Question: What do we need to know in order to compute utility-maximizing strategy at D? • Need to compute expected utility for decisions dVal(D) given information uVal(Parents(D)) • Intuitively, D relies on D’ if we need to know the decision rule at D’ in order to optimize decision rule at D. • We define a relevance graph, with: • a node for each decision • an edge from D to D’ if D relies on D’ • We provide sound & complete procedure for determining strategic relevance using only graph structure • Can build relevance graph in quadratic time
Examples I: Information D D D D U D’ D’ D’ U D’ U U U don’t care simultaneous move perfect info perfect enough D D D D D’ D’ D’ D’
Bet2 relies on Bet1 even though Bet2 observes Bet1 Bet2 can depend on Deal Deal influences U Need probability model of Bet2 to derive posterior on Deal and compute expectation over U Examples II: Card Game Bet1 Bet2 U Deal Bet1 Bet2 Decision D can rely on D’ even if D’ is observed at D !
Solving Games 1W 1E 2W 2E 3W 3E 1W 1E 1W 1E 2W 2E • Nash equilibrium: • Ascribes strategy for all agents in game • Rational game-theoretic solution concept • Structured algorithm for computing equilibrium • Find minimal set of decisions that rely only on each other • Find equilibrium for subgame over these decisions • Fix their strategy to be selected equilibrium Theorem: The result is equilibrium for whole game
Experiment: “Road” Example Reminder, for n=4: Tree size: 6561 nodes Matrix size: 4.71027 For n=40: Tree size: 1.47 1038nodes
Summary • Multi-agent influence diagrams: • Compact intuitive language for multi-agent interactions • Like Bayesian nets for multi-agent setting • MAIDs elucidate important qualitative structure: • How different decisions interact • Can exploit structure to find strategies efficiently • Sometimes exponentially faster than existing algorithms
Conclusions & Future Directions • Goal: • Deal with complex decision problems, involving multiple agents moving across space & time • Progress: • Substantial scaling up of decision problems solved in single agent case • New ideas for dealing with multiple agents • Some current directions: • Multi-agent MDPs • Object-relational decision-making problems • Domains where things evolve at different time scales
Related Publications (2001) • “Feature Selection for Reinforcement Learning”, C. Guestrin & D. Ormoneit. Submitted. • “Max-norm Projections for Factored MDPs”, C. Guestrin, D. Koller, and R. Parr. To appear IJCAI 2001. • “Cooperative Multiagent Planning with Factored MDPs”, C. Guestrin, D. Koller, and R. Parr. Submitted. • “Learning an Agent's Utility Function by Observing Behavior”, U. Chajewska, D. Koller, & D. Ormoneit. To appear ICML 2001. • “Multi-Agent Influence Diagrams for Representing and Solving Games”, D. Koller and B. Milch. To appear IJCAI 2001. • Plus: • Two papers on hybrid models • Three papers on object-relational models • Two papers on learning Bayesian networks from data