1 / 54

CS344 : Introduction to Artificial Intelligence

CS344 : Introduction to Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning. U Universe. C. h. C h = Error region. +. P(C h ) <= Є. +. accuracy parameter. Prob. distribution. +.

azriel
Download Presentation

CS344 : Introduction to Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS344 : Introduction to Artificial Intelligence Pushpak BhattacharyyaCSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning

  2. U Universe C h C h = Error region + P(C h ) <= Є + accuracy parameter Prob. distribution

  3. + Learning Means the following Should happen: Pr(P(c h) <= Є) >= 1- δ PAC model of learning correct. Probably Approximately Correct

  4. y A B - + - - - - - + - - + - - - C - D x IIT Bombay

  5. Algo: 1. Ignore –ve example. 2. Find the closest fitting axis parallel rectangle for the data.

  6. Pr(P(c h) <= Є ) >= 1- δ y + c C h + A B - - - - - - + + - - + h - - - C - D Case 1: If P([]ABCD) < Є than the Algo is PAC. x

  7. Case 2 p([]ABCD) > Є y A B Top - - - - - - - - Right Left - - - C - D Case 2: x Bottom P(Top) = P(Bottom) = P(Right) = P(Left) = Є/4

  8. Let # of examples = m. • Probability that a point comes from top = Є/4 • Probability that none of the m example come from top = (1- Є/4)m IIT Bombay

  9. Probability that none of m examples come from one of top/bottom/left/right = 4(1 - Є/4)m Probability that at least one example will come from the 4 regions = 1- 4(1 - Є/4)m

  10. This fact must have probability greater than or equal to 1- δ 1-4 (1 - Є/4 )m >1- δ or 4(1 - Є/4 )m < δ

  11. (1 - Є/4)m < e(-Єm/4) We must have 4 e(-Єm/4) < δ Or m > (4/Є) ln(4/δ)

  12. Lets say we want 10% error with 90% confidence M > ((4/0.1) ln (4/0.1)) Which is nearly equal to 200

  13. VC-dimension Gives a necessary and sufficient condition for PAC learnability.

  14. C C1 C3 Def:- Let C be a concept class, i.e., it has members c1,c2,c3,…… as concepts in it. C2

  15. Let S be a subset of U (universe). Now if all the subsets of S can be produced by intersecting with Cis, then we say C shatters S.

  16. The highest cardinality set S that can be shattered gives the VC-dimension of C. VC-dim(C)= |S| VC-dim: Vapnik-Cherronenkis dimension.

  17. y 2 – Dim surface C = { half planes} x IIT Bombay

  18. y S1= { a } {a}, Ø a x |s| = 1 can be shattered IIT Bombay

  19. y S2= { a,b } {a,b}, {a}, {b}, Ø b a x |s| = 2 can be shattered IIT Bombay

  20. y S3= { a,b,c } b a c x |s| = 3 can be shattered IIT Bombay

  21. IIT Bombay

  22. y S4= { a,b,c,d } A B C D x |s| = 4 cannot be shattered IIT Bombay

  23. Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989) • A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finite • If the VC dimension of C is d, then…(next page) IIT Bombay

  24. Fundamental theorem (contd) (a) for 0<ε<1 and the sample size at least max[(4/ε)log(2/δ), (8d/ε)log(13/ε)] any consistent function A:ScC is a learning function for C (b) for 0<ε<1/2 and sample size less than max[((1-ε)/ ε)ln(1/ δ), d(1-2(ε(1- δ)+ δ))] No function A:ScH, for any hypothesis space is a learning function for C. IIT Bombay

  25. Paper’s • 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 -1142. • 2. Learnability and the VC-dimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth - Journal of the ACM, 1989. Book Computational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997.

  26. Introducing Reinforcement Learning

  27. Introduction • Reinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.

  28. Constituents • In RL no correct/incorrrect input/output are given. • Feedback for the learning process is called 'Reward' or 'Reinforcement' • In RL we examine how an agent can learn from success and failure, reward and punishment

  29. The RL framework • Environment is depicted as a finite-state Markov Decision process.(MDP)‏ • Utility of a state U[i] gives the usefulness of the state • The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

  30. The RL problem • Rewards can be received either in intermediate or a terminal state. • Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)‏ • The agent can be a passive or an active learner

  31. Passive Learning in a Known Environment In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

  32. Passive Learning in a Known Environment • Agent can move {North, East, South, West} • Terminate on reading [4,2] or [4,3]

  33. Passive Learning in a Known Environment Agent is provided: Mi j =a model given the probability of reaching from state i to state j

  34. Passive Learning in a Known Environment • The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i • Utilities can be learned using 3 approaches 1) LMS (least mean squares)‏ 2) ADP (adaptive dynamic programming)‏ 3) TD (temporal difference learning)‏

  35. Passive Learning in a Known Environment LMS (Least Mean Square)‏ Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

  36. Passive Learning in a Known Environment LMS • Collect statistics on final payoff for each state(eg. when on [2,3], how often reached +1 vs -1 ?)‏ • Learner computes average for each state Probably converges to true expected value (utilities)‏

  37. Passive Learning in a Known Environment LMS Main Drawback: - slow convergence - it takes the agent well over a 1000 training sequences to get close to the correct value

  38. Passive Learning in a Known Environment ADP (Adaptive Dynamic Programming)‏ Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

  39. Passive Learning in a Known Environment ADP In general: Un+1(i) = Un(i)+ ∑ Mij . Un(j) -Un(i) is the utility of state i after nth iteration -Initially set to R(i) - R(i) is reward of being in state i (often non zero for only a few end states)‏ - Mij is the probability of transition from state i to j

  40. Passive Learning in a Known Environment ADP Consider U(3,3)‏ U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2)‏ = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152

  41. Passive Learning in a Known Environment ADP • makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment • somewhat intractable for large state spaces

  42. Passive Learning in a Known Environment TD (Temporal Difference Learning)‏ The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

  43. Passive Learning in a Known Environment TD Learning • Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5 • Suggests that we should increase U(i) to make it agree better with it successor • Can be achieved using the following updating rule Un+1(i) = Un(i)+ a(R(i) + Un(j) –Un(i))

  44. Passive Learning in a Known Environment TD Learning Performance: • Runs “noisier” than LMS but smaller error • Deal with observed states during sample runs (Not all instances, unlike ADP)‏

  45. Passive Learning in an Unknown Environment LMS approach and TD approach operate unchanged in an initially unknown environment. ADP approach adds a step that updates an estimated model of the environment.

  46. Passive Learning in an Unknown Environment ADP Approach • The environment model is learned by direct observation of transitions • The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

  47. Passive Learning in an Unknown Environment ADP & TD Approaches • The ADP approach and the TD approach are closely related • Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors

  48. Passive Learning in an Unknown Environment Minor differences : • TD adjusts a state to agree with its observed successor • ADP adjusts the state to agree with all of the successors Important differences : • TD makes a single adjustment per observed transition • ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

  49. Passive Learning in an Unknown Environment To make ADP more efficient : • directly approximate the algorithm for value iteration or policy iteration • prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates Advantage of the approximate ADP : • efficient in terms of computation • eliminate long value iterations occur in early stage

  50. Active Learning in an Unknown Environment An active agent must consider : • what actions to take • what their outcomes may be • how they will affect the rewards received

More Related