500 likes | 632 Views
CSCE 580 Artificial Intelligence Ch.18: Learning from Observations. Fall 2008 Marco Valtorta mgv@cse.sc.edu. Acknowledgment. The slides are based on the textbook [AIMA] and other sources, including other fine textbooks and the accompanying slide sets The other textbooks I considered are:
E N D
CSCE 580Artificial IntelligenceCh.18: Learning from Observations Fall 2008 Marco Valtorta mgv@cse.sc.edu
Acknowledgment • The slides are based on the textbook [AIMA] and other sources, including other fine textbooks and the accompanying slide sets • The other textbooks I considered are: • David Poole, Alan Mackworth, and Randy Goebel. Computational Intelligence: A Logical Approach. Oxford, 1998 • A second edition (by Poole and Mackworth) is under development. Dr. Poole allowed us to use a draft of it in this course • Ivan Bratko. Prolog Programming for Artificial Intelligence, Third Edition. Addison-Wesley, 2001 • The fourth edition is under development • George F. Luger. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Sixth Edition. Addison-Welsey, 2009
Outline • Learning agents • Inductive learning • Decision tree learning
Learning • Learning is essential for unknown environments, • i.e., when designer lacks omniscience • Learning is useful as a system construction method, • i.e., expose the agent to reality rather than trying to write it down • Learning modifies the agent's decision mechanisms to improve performance
Learning element • Design of a learning element is affected by • Which components of the performance element are to be learned • What feedback is available to learn these components • What representation is used for the components • Type of feedback: • Supervised learning: correct answers for each example • Unsupervised learning: correct answers not given • Reinforcement learning: occasional rewards
Inductive learning • Simplest form: learn a function from examples f is the target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h ≈ f given a trainingset of examples This is a highly simplified model of real learning: • Ignores prior knowledge • Assumes examples are given
Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting:
Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting:
Inductive learning method • Construct/adjust h to agree with f on training set • h is consistent if it agrees with f on all examples • E.g., curve fitting:
Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting:
Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting:
Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting: • Ockham’s razor: prefer the simplest hypothesis consistent with data
Curve Fitting and Occam’s Razor • Data collected by Galileo in1608 – ball rolling down an inclined plane, then continuing in free-fall • Occam's razor ( suggests the simpler model is better; it has a higher prior probability • The simpler model may have a greater posterior probability (the plausibility of the model): Occam’s razor is not only a good heuristic, but it can be shown to follow from more fundmental principles • Jefferys, W.H. and Berger, J.O. 1992. Ockham's razor and Bayesian analysis. American Scientist 80:64-72
Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: • Alternate: is there an alternative restaurant nearby? • Bar: is there a comfortable bar area to wait in? • Fri/Sat: is today Friday or Saturday? • Hungry: are we hungry? • Patrons: number of people in the restaurant (None, Some, Full) • Price: price range ($, $$, $$$) • Raining: is it raining outside? • Reservation: have we made a reservation? • Type: kind of restaurant (French, Italian, Thai, Burger) • WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations • Examples described by attribute values (Boolean, discrete, continuous) • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F)
Decision trees • One possible representation for hypotheses • E.g., here is the “true” tree for deciding whether to wait:
Expressiveness • Decision trees can express any function of the input attributes • E.g., for Boolean functions, truth table row → path to leaf • Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples • Prefer to find more compact decision trees
Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n (for each of the 2n rows of the decision table, the function may return 0 or 1) • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 (more than 18 quintillion) trees
Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n • E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? • Each attribute can be in (positive), in (negative), or out 3n distinct conjunctive hypotheses • More expressive hypothesis space • increases chance that target function can be expressed • increases number of hypotheses consistent with training set may get worse predictions
Decision tree learning • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree
Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice
Using information theory • To implement Choose-Attribute in the DTL algorithm • Information Content (Entropy): I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi) • For a training set containing p positive examples and n negative examples:
Information gain • A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IG
Information gain • For the training set, p = n = 6, I(6/12, 6/12) = 1 bit • Consider the attributes Patrons and Type (and others too): • Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data
Performance measurement • How do we know that h ≈ f ? • Use theorems of computational/statistical learning theory • Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size
Summary (so far) • Learning needed for unknown environments, lazy designers • Learning agent = performance element + learning element • For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples • Decision tree learning using information gain • Learning performance = prediction accuracy measured on test set
Outline for Ensemble Learning and Boosting • Ensemble Learning • Bagging • Boosting • Reading: [AIMA-2] Sec. 18.4 • This set of slides is based on http://www.cs.uwaterloo.ca/~ppoupart/teaching/cs486-spring05/slides/Lecture21notes.pdf • In turn, those slides follow [AIMA-2]
Ensemble Learning • Sometimes each learning techniqueyields a different hypothesis • But no perfect hypothesis… • Could we combine several imperfect hypotheses into a better hypothesis?
Ensemble Learning • Analogies: • Elections combine voters’ choices to pick a good candidate • Committees combine experts’ opinions to make better decisions • Intuitions: • Individuals often make mistakes, but the “majority” is less likely to make mistakes. • Individuals often have partial knowledge, but a committee can pool expertise to make better decisions
Ensemble Learning • Definition: method to select and combine an ensemble of hypotheses into a (hopefully) better hypothesis • Can enlarge hypothesis space • Perceptron (a simple kind of neural network) • linear separator • Ensemble of perceptrons • polytope
Bagging • Assumptions: • Each hi makes error with probability p • The hypotheses are independent • Majority voting of n hypotheses: • k hypotheses make an error: • Majority makes an error: • – With n=5, p=0.1 error( majority ) < 0.01
Weighted Majority • In practice • Hypotheses rarely independent • Some hypotheses make fewer errors than others • Let’s take a weighted majority • Intuition: • Decrease weight of correlated hypotheses • Increase weight of good hypotheses
Boosting • Most popular ensemble technique • Computes a weighted majority • Can “boost” a “weak learner” • Operates on a weighted training set
Weighted Training Set • Learning with a weighted training set • Supervised learning -> minimize training error • Bias algorithm to learn correctly instances with high weights • Idea: when an instance is misclassified by a hypotheses, increase its weight so that the next hypothesis is more likely to classify it correctly
Boosting Framework Read the figure left to right: the algorithm builds a hypothesis on a weighted set of four examples, one hypothesis per column
AdaBoost (Adaptive Boosting) There are N examples. There are M “columns” (hypotheses), each of which has weight zm
What can we boost? • Weak learner: produces hypotheses at least as good as random classifier. • Examples: • Rules of thumb • Decision stumps (decision trees of one node) • Perceptrons • Naïve Bayes models
Boosting Paradigm • Advantages • No need to learn a perfect hypothesis • Can boost any weak learning algorithm • Boosting is very simple to program • Good generalization • Paradigm shift • Don’t try to learn a perfect hypothesis • Just learn simple rules of thumbs and boost them
Boosting Paradigm • When we already have a bunch of hypotheses, boosting provides a principled approach to combine them • Useful for • Sensor fusion • Combining experts
Boosting Applications • Any supervised learning task • Spam filtering • Speech recognition/natural language processing • Data mining • Etc.
Computational Learning Theory The slides on COLT are from ftp://ftp.cs.bham.ac.uk/pub/authors/M.Kerber/Teaching/SEM2A4/l4.ps.gz and http://www.cs.bham.ac.uk/~mmk/teaching/SEM2A4/, which also has slides on version spaces
How many examples are needed? This is the probability that Hεbad contains a consistent hypothesis
Learning Decision Lists • A decision list consists of a series of tests, each of which is a conjunction of literals. If the tests succeeds, the decision list specifies the value to be returned. Otherwise, the processing continues with the next test in the list • Decision lists can represent any Boolean function hence are not learnable (in polynomial time) • A k-DL is a decision list where each test is restricted to at most k literals • K- Dl is learnable! [Rivest, 1987]