1 / 41

Instructor: Eyal Amir Grad TAs : Wen Pu, Yonatan Bisk Undergrad TAs : Sam Johnson, Nikhil Johri

CS 440 / ECE 448 Introduction to Artificial Intelligence Spring 2010 Lecture #25. Instructor: Eyal Amir Grad TAs : Wen Pu, Yonatan Bisk Undergrad TAs : Sam Johnson, Nikhil Johri. Machine Learning. Knowledge Representn. Sense. Search. Reason. SAT. Planning. HMMs. NB. Game Search.

haydenm
Download Presentation

Instructor: Eyal Amir Grad TAs : Wen Pu, Yonatan Bisk Undergrad TAs : Sam Johnson, Nikhil Johri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 440 / ECE 448Introduction to Artificial IntelligenceSpring 2010Lecture #25 Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri

  2. Machine Learning Knowledge Representn. Sense Search Reason SAT Planning HMMs NB Game Search FOL Resolution DT BNs Var. Elim. Sampling What We Have Done So Far AI Decision Making

  3. Machine Learning Knowledge Representn. Sense Search Reason SAT Planning HMMs NB Game Search FOL Resolution Vision MDPs DT BNs Var. Elim. Sampling What We Have Done So Far AI Decision Making

  4. sensors environment ? agent actuators Utility-Based Agent

  5. ? ? a a b b c c • {a(pa),b(pb),c(pc)} • decision that maximizes expected utility value Non-deterministic vs. Probabilistic Uncertainty • {a,b,c} • decision that is best for worst case Non-deterministic model Probabilistic model ~ Adversarial search

  6. Expected Utility • Random variable X with n values x1,…,xn and distribution (p1,…,pn)E.g.: X is the state reached after doing an action A under uncertainty • Function U of XE.g., U is the utility of a state • The expected utility of A is EU[A] = Si=1,…,n p(xi|A)U(xi)

  7. s0 A1 s1 s2 s3 0.2 0.7 0.1 100 50 70 One State/One Action Example EU(A1) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1 = 20 + 35 + 7 = 62

  8. s0 A1 A2 s1 s2 s3 s4 0.2 0.7 0.2 0.1 0.8 100 50 70 One State/Two Actions Example • EU(AI) = 62 • EU(A2) = 74 • EU(S0) = max{EU(A1),EU(A2)} • = 74 80

  9. s0 A1 A2 s1 s2 s3 s4 0.2 0.7 0.2 0.1 0.8 100 50 70 Introducing Action Costs • EU(A1) = 62 – 5 = 57 • EU(A2) = 74 – 25 = 49 • EU(S0) = max{EU(A1),EU(A2)} • = 57 -5 -25 80

  10. MEU Principle • rational agent should choose the action that maximizes agent’s expected utility • this is the basis of the field of decision theory • normative criterion for rational choice of action AI is Solved!!!

  11. Not quite… • Must have complete model of: • Actions • Utilities • States • Even if you have a complete model, will be computationally intractable • In fact, a truly rational agent takes into account the utility of reasoning as well---bounded rationality • Nevertheless, great progress has been made in this area recently, and we are able to solve much more complex decision theoretic problems than ever before

  12. We’ll look at • Decision Theoretic Planning • Simple decision making (ch. 16) • Sequential decision making (ch. 17)

  13. Stochastic Systems • Stochastic system: a triple  = (S, A, P) • S = finite set of states • A = finite set of actions • Pa (s| s) = probability of going to s if we execute a in s • sS Pa (s | s) = 1 • Several different possible action representations • e.g., Bayes networks, probabilistic operators • Situation calculus with stochastic (nature) effects • Explicit enumeration of each Pa (s | s)

  14. Example • Robot r1 startsat location l1 • Objective is toget r1 to location l4 Start Goal

  15. Example • Robot r1 startsat location l1 • Objective is toget r1 to location l4 • No classical sequence of actions as a solution Start Goal

  16. Policies π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} π2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))} π3 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4)} • Policy: a function that maps states into actions • Write it as a set of state-action pairs Start Goal

  17. Initial States • In general, there is no single initial state • For every state s,we start at s withprobability P(s) • In the example, P(s1) = 1, and P(s) = 0 for all other states Start Goal

  18. Histories • History: a sequenceof system states h = s0, s1, s2, s3, s4, … h0 = s1, s3, s1, s3, s1, …  h1 = s1, s2, s3, s4, s4, …  h2 = s1, s2, s5, s5, s5, …  h3 = s1, s2, s5, s4, s4, …  h4 = s1, s4, s4, s4, s4, …  h5 = s1, s1, s4, s4, s4, …  h6 = s1, s1, s1, s4, s4, …  h7 = s1, s1, s1, s1, s1, …  • Each policy induces a probability distribution over histories • If h = s1, s2, …  then P(h | π) = P(s1) i ≥ 0Pπ(Si) (si+1 | si) Start Goal

  19. Example π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} h1 = s1, s2, s3, s4, s4, …  P(h1| π1) = 1  0.8  1  … = 0.8 h2 = s1, s2, s5, s5 …  P(h2| π1) = 1  0.2  1  … = 0.2 All other h P(h | π1) = 0 Start Goal goal

  20. Markov Decision Processes • An MDP has four components, S, A, R, Pr: • (finite) state set S (|S| = n) • (finite) action set A (|A| = m) • transition function Pr(s,a,t) • each Pr(s,a,-) is a distribution over S • represented by set of n x n stochastic matrices • bounded, real-valued reward function R(s) • represented by an n-vector • can be generalized to include action costs: R(s,a) • can be stochastic (but replacable by expectation) • Model easily generalizable to countable or continuous state and action spaces

  21. Assumptions • Markovian dynamics (history independence) • Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St) • Markovian reward process • Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St) • Stationary dynamics and reward • Pr(St+1|At,St) = Pr(St’+1|At’,St’) for all t, t’ • Full observability • though we can’t predict what state we will reach when we execute an action, once it is realized, we know what it is

  22. Policies • Nonstationary policy • π:S x T → A • π(s,t) is action to do at state s with t-stages-to-go • Stationary policy • π:S → A • π(s)is action to do at state s (regardless of time) • analogous to reactive or universal plan • These assume or have these properties: • full observability • history-independence • deterministic action choice

  23. Finite Horizon Problems • Utility (value) depends on stage-to-go • hence so should policy: nonstationary π(s,k) • is k-stage-to-go value function for π • Here Rtis a random variable denoting reward received at stage t

  24. Value Iteration (Bellman 1957) • Markov property allows exploitation of DP principle for optimal policy construction • no need to enumerate |A|Tn possible policies • Value Iteration Bellman backup Vk is optimal k-stage-to-go valuefunction

  25. Value Iteration • Note how DP is used • optimal soln to k-1 stage problem can be used without modification as part of optimal soln to k-stage problem • Because of finite horizon, policy nonstationary • In practice, Bellman backup computed using:

  26. Summary So Far • Resulting policy is optimal • convince yourself of this; convince that nonMarkovian, randomized policies not necessary • Note: optimal value function is unique, but optimal policy is not

  27. Discounted Infinite Horizon MDPs • Total reward problematic (usually) • many or all policies have infinite expected reward • some MDPs (e.g., zero-cost absorbing states) OK • “Trick”: introduce discount factor 0 ≤ β < 1 • future rewards discounted by β per time step • Note: • Motivation: economic? failure prob? convenience?

  28. Some Notes • Optimal policy maximizes value at each state • Optimal policies guaranteed to exist (Howard60) • Can restrict attention to stationary policies • why change action at state s at new time t? • We define for some optimal π

  29. Value Equations (Howard 1960) • Value equation for fixed policy value • Bellman equation for optimal value function

  30. Backup Operators • We can think of the fixed policy equation and the Bellman equation as operators in a vector space • e.g., La(V) = V’ = R + βPaV • Vπ is unique fixed point of policy backup operator Lπ • V* is unique fixed point of Bellman backup L* • We can compute Vπ easily: policy evaluation • simple linear system with n variables, n constraints • solve V = R + βPV • Cannot do this for optimal policy • max operator makes things nonlinear

  31. Value Iteration • Can compute optimal policy using value iteration, just like FH problems (just include discount term) • no need to store argmax at each stage (stationary)

  32. Convergence • L(V) is a contraction mapping in Rn • || LV – LV’ || ≤ β || V – V’ || • When to stop value iteration? when ||Vk -Vk-1||≤ ε • ||Vk+1 -Vk|| ≤ β ||Vk -Vk-1|| • this ensures ||Vk –V*|| ≤ εβ /1-β • Convergence is assured • any guess V: ||V* - L*V || = ||L*V* - L*V || ≤ β||V* - V || • so fixed point theorems ensure convergence

  33. How to Act • Given V* (or approximation), use greedy policy: • if V within ε of V*, then V(π) within 2ε of V* • There exists an ε s.t. optimal policy is returned • even if value estimate is off, greedy policy is optimal • proving you are optimal can be difficult (methods like action elimination can be used)

  34. Policy Iteration • Given fixed policy, can compute its value exactly: • Policy iteration exploits this 1. Choose a random policy π 2. Loop: (a) Evaluate Vπ (b) For each s in S, set (c) Replace π with π’ Until no improving action possible at any state

  35. Policy Iteration Notes • Convergence assured (Howard) • intuitively: no local maxima in value space, and each policy must improve value; since finite number of policies, will converge to optimal policy • Very flexible algorithm • need only improve policy at one state (not each state) • Gives exact value of optimal policy • Generally converges much faster than VI • each iteration more complex, but fewer iterations • quadratic rather than linear rate of convergence

  36. Modified Policy Iteration • MPI a flexible alternative to VI and PI • Run PI, but don’t solve linear system to evaluate policy; instead do several iterations of successive approximation to evaluate policy • You can run SA until near convergence • but in practice, you often only need a few backups to get estimate of V(π) to allow improvement in π • quite efficient in practice • choosing number of SA steps a practical issue

  37. Asynchronous Value Iteration • Needn’t do full backups of VF when running VI • Gauss-Siedel: Start with Vk .Once you compute Vk+1(s), you replace Vk(s) before proceeding to the next state (assume some ordering of states) • tends to converge much more quickly • note: Vk no longer k-stage-to-go VF • AVI: set some V0; Choose random state s and do a Bellman backup at that state alone to produce V1; Choose random state s… • if each state backed up frequently enough, convergence assured • useful for online algorithms (reinforcement learning)

  38. Some Remarks on Search Trees • Analogy of Value Iteration to decision trees • decision tree (expectimax search) is really value iteration with computation focused on reachable states • Real-time Dynamic Programming (RTDP) • simply real-time search applied to MDPs • can exploit heuristic estimates of value function • can bound search depth using discount factor • can cache/learn values • can use pruning techniques

  39. Logical or Feature-based Problems • AI problems are most naturally viewed in terms of logical propositions, random variables, objects and relations, etc. (logical, feature-based) • E.g., consider “natural” spec. of robot example • propositional variables: robot’s location, Craig wants coffee, tidiness of lab, etc. • could easily define things in first-order terms as well • |S| exponential in number of logical variables • Spec./Rep’n of problem in state form impractical • Explicit state-based DP impractical • Bellman’s curse of dimensionality

  40. Solution? • Require structured representations • exploit regularities in probabilities, rewards • exploit logical relationships among variables • Require structured computation • exploit regularities in policies, value functions • can aid in approximation (anytime computation) • We start with propositional represnt’ns of MDPs • probabilistic STRIPS • dynamic Bayesian networks • BDDs/ADDs

More Related