260 likes | 372 Views
Announcements. List of 5 source for research paper Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30. Classification problems and Machine Learning. Lecture 10. EnjoySport concept learning task. Given Instances X : Possible days, each described by the attributes
E N D
Announcements • List of 5 source for research paper • Homework 5 due Tuesday, October 30 • Book Review due Tuesday, October 30 CS 484 – Artificial Intelligence
Classification problems and Machine Learning Lecture 10
EnjoySport concept learning task • Given • Instances X: Possible days, each described by the attributes • Sky (with possible values Sunny, Cloudy, and Rainy) • AirTemp (with values Warm and Cold) • Humidity (with values Normal and High) • Wind (with values Strong and Weak) • Water (with values Warm and Cool), and • Forecast (with values Same and Change) • Hypothesis H: Each hypothesis is described by a conjunction of constraints on the attributes. The constraints may be "?", "Ø", or a specific value • Target concept c: EnjoySport : X→ {0,1} • Training Examples D: Positive or negative examples of the target function • Determine • A hypothesis h in H such that h(x) = c(x) for all x in X CS 484 – Artificial Intelligence
Find-S: Finding a maximally Specific Hypothesis (review) • Initialize h to the most specific hypothesis in H • For each positive training instance x • For each attribute constraint ai in h • If the constraint ai is satisfied by x • Then do nothing • Else replace ai in h by the next more general constraint that is satisfied by x • Output hypothesis h • Begin: h ← <Ø, Ø, Ø, Ø, Ø, Ø> CS 484 – Artificial Intelligence
Candidate Elimination • Candidate elimination aims to derive one hypothesis which matches all training data (including negative examples). S: {<Sunny, Warm, ?, Strong, ?, ?>} <Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?> G: {<Sunny, ?, ?, ?, ?, ?>}, {<?, Warm, ?, ?, ?, ?>} CS 484 – Artificial Intelligence
Candidate-Elimination Learning Algorithm • Initialize G to the set of maximally general hypotheses in H • Initialize S to the set of maximally specific hypotheses in H • For each training example d, do • If d is a positive example • Remove from G any hypothesis inconsistent with d • For each hypothesis s in S that is not consistent with d • Remove s from S • Add to S all minimal generalizations h of s such that • h is consistent with d, and some member of G is more general than h • Remove from S any hypothesis that is more general than another hypothesis in S • If d is a negative example • Remove from S any hypothesis inconsistent with d • For each hypothesis g in G that is not consistent with d • Remove g from G • Add to G all minimal specializations h of g such that • h is consistent with d, and some member of S is more specific than h • Remove from G any hypothesis that is less general than another hypothesis in G CS 484 – Artificial Intelligence
G0← {<?,?,?,?,?,?>} G1← G2← S0← {< Ø, Ø, Ø, Ø, Ø, Ø>} S1← S2← Example E1 = <Sunny, Warm, Normal, Strong, Warm, Same> positive E2 = <Sunny, Warm, High, Strong, Warm, Same> positive CS 484 – Artificial Intelligence
G2← G3← S2← S3← Example (cont. 2) E3 = <Rainy, Cold, High, Strong, Warm, Change> negative CS 484 – Artificial Intelligence
G3← G4← S3← S4← Example (cont. 3) E4 = <Sunny, Warm, High, Strong, Cool, Change> positive CS 484 – Artificial Intelligence
Decision Tree Learning • Has two major benefits over Find-S and Candidate Elimination • Can cope with noisy data • Capable of learning disjunctive expressions • Limitation • May be many valid decision trees given the training data • Prefers small trees over large trees • Apply to board range of learning tasks • Classify medical patients by their disease • Classify equipment malfunctions by their cause • Classify loan applicants by their likelihood of defaulting on payments CS 484 – Artificial Intelligence
Decision Tree Example Outlook Rain Sunny Overcast Yes Humidity Wind Normal High Strong Weak No Yes No Yes Days on which to play tennis CS 484 – Artificial Intelligence
Decision Tree Induction (1) • Decision tree induction involves creating a decision tree from a set of training data that can be used to correctly classify the training data. • ID3 is an example of a decision tree learning algorithm. • ID3 builds the decision tree from the top down, selecting the features from the training data that provide the most information at each stage. CS 484 – Artificial Intelligence
Decision Tree Induction (2) • ID3 selects attributes based on information gain. • Information gain is the reduction in entropy caused by a decision. • Entropy is defined as: H(S) = - p1 log2 p1 - p0 log2 p0 • p1 is the proportion of the training data which are positive examples • p0 is the proportion which are negative examples • Intuition about H(S) • Zero (min value) when all the examples are the same (positive or negative) • One (max value) when half are positive and half are negative. CS 484 – Artificial Intelligence
Example – Training Data CS 484 – Artificial Intelligence
Calculate Information Gain • Initial Entropy • All examples in one class • 9 positive examples, 5 negative examples • H(init) = -.643 log2 .643 - .357 log2 .357 = 0.940 • Calculate Entropy for each attribute and then combine as a weighted sum • Entropy of "Outlook" • Sunny • 5 examples, 2 positives, 3 negatives • H(Sunny) = -(2/5) log2 (2/5) – (3/5) log2 (3/5) = 0.971 • Overcast • 4 examples, 4 positives, 0 negatives • H(Overcast) = -1 log2 (1) – (0) log2 (0) = 0 (0 log2 0 is defined as 0) • Rain • 5 examples, 3 positives, 2 negatives • H(Rain) = -(3/5) log2 (3/5) – (2/5) log2 (2/5) = 0.971 • H(Outlook) = .357(0.971) + .286(0) + .357(0.971) = 0.694 • Information Gain = H(0) – H(1) • Gain = 0.940 – 0.694 = .246 CS 484 – Artificial Intelligence
Maximize Information Gain • Gain of each attribute • Gain(Outlook) = 0.246 • Gain(Humidity) = 0.151 • Gain(Wind) = 0.048 • Gain(Temperature) = 0.029 {D1, D2, …, D14} [9+,5-] Outlook Rain Sunny Overcast {D4, D5, D6,D10, D14} [3+,2-] {D1, D2, D8,D9, D11} [2+,3-] Yes ? ? {D3, D7, D12, D13} [4+,0-] CS 484 – Artificial Intelligence
Unbiased Learner • Provide a hypothesis space capable of representing everyteachable concept • Every possible subset of the instances X (power set of X) • How large is this space? • For EnjoySport, there are 96 instances in X • The power set is 2|X| • EnjoySport has 1028 distinct target concepts • Allows disjunctions, conjunctions, and negations • Can no longer generalize beyond observed examples CS 484 – Artificial Intelligence
Inductive Bias • All learning methods have an inductive bias. • The inductive bias of a learning method is the set of restrictions on the learning method. • Without inductive bias, a learning method could not learn to generalize. • A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances CS 484 – Artificial Intelligence
Bias in Learning Algorithms • Rote-Learner: If the instance is found in memory, the stored classification is returned. Otherwise, the system refuses to classify the new instance • Find-S: Finds the most specific hypothesis consistent with the training examples. It then uses this hypothesis to classify all subsequent instances CS 484 – Artificial Intelligence
Candidate-Elimination Bias • Candidate-Elimination will converge to true target concept provided accurate training examples and its initial hypothesis space contains the true target concept • Only consider conjunctions of attribute values • Cannot represent "Sky = Sunny or Sky = Cloudy" • What if the target concept is not contained in the hypothesis space? CS 484 – Artificial Intelligence
Bias of ID3 • Choose the first acceptable tree it encounters in its simple-to-complex, hill-climbing search • Favors shorter trees over longer ones • Selects trees that place the attributes with highest information gain closest to the root • Interaction between attribute selection heuristic and training examples makes it difficult to precisely characterize its bias CS 484 – Artificial Intelligence
ID3 vs. Candidate Elimination • Difference between the types of inductive bias • Hypothesis space • ID3 searches a complete hypothesis space • Inductive bias is a consequence of the ordering of hypotheses by its search strategy • Candidate-Elimination searches an incomplete hypothesis space • Searches the space completely • Inductive bias is a consequence of the expressive power of its hypothesis representation CS 484 – Artificial Intelligence
Why Prefer Short Hypotheses? • Occam's razor • Prefer the simplest hypothesis that fits the data • Appling Occam's razor • Fewer short hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data • A 5-node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis • Problems with this argument • By the same argument, you could put many more qualifications on the decision tree. Would that be better? • Size is determined by the particular representation used internally by the learner • Don't reject Occam's razor all together • Evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm CS 484 – Artificial Intelligence
The Problem of Overfitting Black dots represent positive examples, white dots negative. The two lines represent two different hypotheses. In the first diagram, there are just a few items of training data, correctly classified by the hypothesis represented by the darker line. In the second and third diagrams we see the complete set of data, and that the simpler hypothesis which matched the training data less well matches the rest of the data better than the more complex hypothesis, which overfits. CS 484 – Artificial Intelligence
The Nearest Neighbor Algorithm (1) • This is an example of instance based learning. • Instance based learning involves storing training data and using it to attempt to classify new data as it arrives. • The nearest neighbor algorithm works with data that consists of vectors of numeric attributes. • Each vector represents a point in n-dimensional space. CS 484 – Artificial Intelligence
The Nearest Neighbor Algorithm (2) • When an unseen data item is to be classified, the Euclidean distance is calculated between this item and all training data. • the distance between <x1, y1> and <x2, y2> is: • The classification for the unseen data is usually selected as the one that is most common amongst the few nearest neighbors. • Shepard’s method involves allowing all training data to contribute to the classification with their contribution being proportional to their distance from the data item to be classified. CS 484 – Artificial Intelligence