820 likes | 827 Views
A classification learning example Predicting when Rusell will wait for a table. --similar to book preferences, predicting credit card fraud, predicting when people are likely to respond to junk mail. Inductive Learning (Classification Learning). Main variations:
E N D
A classification learning example Predicting when Rusell will wait for a table --similar to book preferences, predicting credit card fraud, predicting when people are likely to respond to junk mail
Inductive Learning(Classification Learning) • Main variations: • Bias: the “sort” of rule are you looking for? • If you are looking for only conjunctive hypotheses, there are just 3n • Search: • Greedy search • Decision tree learner • Systematic search • Version space learner • Iterative search • Neural net learner • Given a set of labeled examples, and a space of hypotheses • Find the rule that underlies the labeling • (so you can use it to predict future unlabeled examples) • Tabularasa, fully supervised • Idea: • Loop through all hypotheses • Rank each hypothesis in terms of its match to data • Pick the best hypothesis It can be shown that sample complexity of PAC learning is proportional to 1/e, 1/d AND log |H| • The main problem is that • the space of hypotheses is too large • Given examples described in terms of n boolean variables • There are 2 different hypotheses • For 6 features, there are 18,446,744,073,709,551,616 hypotheses 2n
Why Simple is Better? Bias & Learning Accuracy • Having weak bias (large hypothesis space) • Allows us to capture more concepts • ..increases learning cost • May lead to over-fitting Also the goal of a compression algorithm is to drive down the training error But the goal of a learning algorithm is to drive down the test error
Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs
Which one to pick? Learning Decision Trees---How? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively (Special case: Decision Stumps If you don’t feel like splitting any further, return the majority label ) 20 Questions: AI Style
Depending on the order we pick, we can get smaller or bigger trees Which tree is better? Why do you think so??
Would you split on patrons or Type? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element)
# expected comparisons needed to tell whether a given example is +ve or -ve N+ N- Splitting on feature fk N2+ N2- Nk+ Nk- N1+ N1- I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-) k S [Ni+ + Ni- ]/[N+ + N-]I(Pi+ ,, Pi-) i=1 The Information Gain Computation P+ : N+ /(N++N-) P- : N- /(N++N-) I(P+ ,, P-) = -P+ log(P+) - P- log(P- ) The difference is the information gain So, pick the feature with the largest Info Gain I.e. smallest residual info Given k mutually exclusive and exhaustive events E1….Ek whose probabilities are p1….pk The “information” content (entropy) is defined as S i -pi log2 pi A split is good if it reduces the entropy..
I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2 = 1/2 + 1/2 =1 I(1,0) = 1*log 1 + 0 * log 0 = 0 A simple example V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0 V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 So Anxious is the best attribute to split on Once you split on Anxious, the problem is solved
“Majority” function (say yes if majority of attributes are yes) Russell Domain m-fold cross-validation Split N examples into m equal sized parts for i=1..m train with all parts except ith test with the ith part Evaluating the Decision Trees Lesson: Every bias makes some concepts easier to learn and others harder to learn… Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns Test the learned pattern over Ntest and compute the accuracy (%correct)
Problems with Info. Gain. Heuristics • Feature correlation: We are splitting on one feature at a time • The Costanza party problem • No obvious easy solution… • Overfitting: We may look too hard for patterns where there are none • E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the time of the day etc. • Solution: Don’t consider splitting if the information gain given by the best feature is below a minimum threshold • Can use the c2 test for statistical significance • Will also help when we have noisy samples… • We may prefer features with very high branching • e.g. Branch on the “universal time string” for Russell restaurant example • Branch on social security number to look for patterns on who will get A • Solution: “gain ratio” --ratio of information gain with the attribute A to the information content of answering the question “What is the value of A?” • The denominator is smaller for attributes with smaller domains.
Decision stumps are decision trees where the leaf nodes do not necessarily have all +ve or all –ve training examples Could happen either because examples are noisy and mis-classified or because you want to stop before reaching pure leafs When you reach that node, you return the majority label as the decision. (We can associate a confidence with that decision using the P+ and P-) N+ N- N2+ N2- Nk+ Nk- N1+ N1- Decision Stumps Splitting on feature fk P+= N1+ / N1++N1- Sometimes, the best decision tree for a problem could be a decision stump (see coin toss example next)
Bayes Network Learning • Bias: The relation between the class label and class attributes is specified by a Bayes Network. • Approach • Guess Topology • Estimate CPTs • Simplest case: Naïve Bayes • Topology of the network is “class label” causes all the attribute values independently • So, all we need to do is estimate CPTs P(attrib|Class) • In Russell domain, P(Patrons|willwait) • P(Patrons=full|willwait=yes)= #training examples where patrons=full and will wait=yes #training examples where will wait=yes • Given a new case, we use bayes rule to compute the class label Class label is the disease; attributes are symptoms
Naïve Bayesian Classification • Problem: Classify a given example E into one of the classes among [C1, C2 ,…, Cn] • E has k attributes A1, A2 ,…, Ak and each Aican takeddifferent values • Bayes Classification: Assign E to class Ci that maximizes P(Ci | E) P(Ci| E) = P(E| Ci) P(Ci) / P(E) • P(Ci) and P(E) are a priori knowledge (or can be easily extracted from the set of data) • Estimating P(E|Ci) is harder • Requires P(A1=v1 A2=v2….Ak=vk|Ci) • Assuming d values per attribute, we will need ndkprobabilities • Naïve Bayes Assumption: Assume all attributes are independentP(E| Ci) = P P(Ai=vj | Ci ) • The assumption is BOGUS, but it seems to WORK (and needs only n*d*k probabilities
NBC in terms of BAYES networks.. NBC assumption More realistic assumption
Common factor USER PROFILE Estimating the probabilities for NBC Given an example E described as A1=v1 A2=v2….Ak=vk we want to compute the class of E • Calculate P(Ci | A1=v1 A2=v2….Ak=vk) for all classes Ci and say that the class of E is the one for which P(.) is maximum • P(Ci | A1=v1 A2=v2….Ak=vk) = P P(vj | Ci ) P(Ci) / P(A1=v1 A2=v2….Ak=vk) Given a set of training N examples that have already been classified into n classes Ci Let #(Ci) be the number of examples that are labeled as Ci Let #(Ci, Ai=vi) be the number of examples labeled as Ci that have attribute Ai set to value vj P(Ci) = #(Ci)/N P(Ai=vj | Ci) = #(Ci, Ai=vi) / #(Ci)
Example P(willwait=yes) = 6/12 = .5 P(Patrons=“full”|willwait=yes) = 2/6=0.333 P(Patrons=“some”|willwait=yes)= 4/6=0.666 Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666 P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5 P(willwait=no|Patrons=full) = k* 0.666*.5
Using M-estimates to improve probablity estimates • The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high) • Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m] • p is the prior probability of Ai taking the value vi • If we don’t have any background information, assume uniform probability (that is 1/d if Ai can take d values) • m is a constant—called “equivalent sample size” • If we believe that our sample set is large enough, we can keep m small. Otherwise, keep it large. • Essentially we are augmenting the #(Ci) normal samples with m more virtual samples drawn according to the prior probability on how Ai takes values • Popular values p=1/|V| and m=|V| where V is the size of the vocabulary Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)
How Well (and WHY) DOES NBC WORK? • Naïve bayes classifier is darned easy to implement • Good learning speed, classification speed • Modest space storage • Supports incrementality • It seems to work very well in many scenarios • Lots of recommender systems (e.g. Amazon books recommender) use it • Peter Norvig, the director of Machine Learning at GOOGLE said, when asked about what sort of technology they use “Naïve bayes” • But WHY? • NBC’s estimate of class probability is quite bad • BUT classification accuracy is different from probability estimate accuracy • [Domingoes/Pazzani; 1996] analyze this
Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions K-nearest neighbors If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules SVMs Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs
Decision Surface Learning(aka Neural Network Learning) • Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces? • Mathematically, a surface is a function • Need a way of learning functions • “Threshold units”
= 1 if w1I1+w2I2 > k = 0 otherwise Recurrent Feed Forward Uni-directional connections Bi-directional connections Single Layer Multi-Layer Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net Can act as associative memory Any linear decision surface can be represented by a single layer neural net “Neural Net” is a collection of with interconnections threshold units differentiable
The “Brain” Connection A Threshold Unit Threshold Functions differentiable …is sort of like a neuron
I1 w1 t=k I2 w2 I0=-1 w1 w0= k t=0 w2 Perceptron Networks What happened to the “Threshold”? --Can model as an extra weight with static input ==
Perceptron Learning • Perceptron learning algorithm Loop through training examples • If the activation level of the output unit is 1 when it should be 0, reduce the weight on the link to the jth input unit by a*Ij, where Ii is the ith input value and a a learning rate • If the activation level of the output unit is 0 when it should be 1, increase the weight on the link to the ith input unit by a*Ij • Otherwise, do nothing Until “convergence” Iterative search! --node -> network weights --goodness -> error Actually a “gradient descent” search A nice applet at: http://neuron.eng.wayne.edu/java/Perceptron/New38.html
Perceptron Learning as Gradient Descent Search in the weight-space Often a constant learning rate parameter is used instead Ij I
Can Perceptrons Learn All Boolean Functions? --Are all boolean functions linearly separable?
Comparing Perceptrons and Decision Trees in Majority Function and Russell Domain Decision Trees Perceptron Decision Trees Perceptron Majority function Russell Domain Majority function is linearly seperable.. Russell domain is apparently not.... Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain
Any line that separates the +ve & –ve examples is a solution And perceptron learning finds one of them But could we have a preference among these? may want to get the line that provides maximum margin (equidistant from the nearest +ve/-ve) The nereast +ve and –ve holding up the line are called support vectors This changes the problem into an optimization one Quadratic Programming can be used to directly find such a line Max-Margin Classification & Support Vector Machines Learning is Optimization after all!
First transform the data into higher dimensional space Find a linear surface Which is guaranteed to exist Transform it back to the original space TRICK is to do this without explicitly doing a transformation Learn non-linear surfaces directly (as multi-layer neural nets) Trick is to do training efficiently Back Propagation to the rescue.. Two ways to learn non-linear decision surfaces
Linear Separability in High Dimensions “Kernels” allow us to consider separating surfaces in high-D without first converting all points to high-D
Kernelized Support Vector Machines • Turns out that it is not always necessary to first map the data into high-D, and then do linear separation • The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors • Dot product is a form of similarity metric between points • If you replace that dot product by any non-linear function, you will, in essence, be transforming data into some high-dimensional space and then finding the max-margin linear classifier in that space • Which will correspond to some wiggly surface in the original dimension • The trick is to find the RIGHT similarity function • Which is a form of prior knowledge
Kernelized Support Vector Machines • Turns out that it is not always necessary to first map the data into high-D, and then do linear separation • The quadratic programming formulation for SVM winds up using only the pair-wise dot product of training vectors • Dot product is a form of similarity metric between points • If you replace that dot product by any non-linear function, you will, in essence, be tranforming data into some high-dimensional space and then finding the max-margin linear classifier in that space • Which will correspond to some wiggly surface in the original dimension • The trick is to find the RIGHT similarity function • Which is a form of prior knowledge
Those who ignore easily available domain knowledge are doomed to re-learn it… Santayana’s brother Domain-knowledge & Learning • Classification learning is a problem addressed by both people from AI (machine learning) and Statistics • Statistics folks tend to “distrust” domain-specific bias. • Let the data speak for itself… • ..but this is often futile. The very act of “describing” the data points introduces bias (in terms of the features you decided to use to describe them..) • …but much human learning occurs because of strong domain-specific bias.. • Machine learning is torn by these competing influences.. • In most current state of the art algorithms, domain knowledge is allowed to influence learning only through relatively narrow avenues/formats (E.g. through “kernels”) • Okay in domains where there is very little (if any) prior knowledge (e.g. what part of proteins are doing what cellular function) • ..restrictive in domains where there already exists human expertise..
Multi-layer Neural Nets How come back-prop doesn’t get stuck in local minima? One answer: It is actually hard for local minimas to form in high-D, as the “trough” has to be closed in all dimensions
Multi-Network Learning can learn Russell Domains Decision Trees Decision Trees Multi-layer networks Perceptron Russell Domain …but does it slowly…
Practical Issues in Multi-layer network learning • For multi-layer networks, we need to learn both the weights and the network topology • Topology is fixed for perceptrons • If we go with too many layers and connections, we can get over-fitting as well as sloooow convergence • Optimal brain damage • Start with more than needed hidden layers as well as connections; after a network is learned, remove the nodes and connections that have very low weights; retrain
Humans make 0.2% Neumans (postmen) make 2% Other impressive applications: --no-hands across america --learning to speak K-nearest-neighbor The test example’s class is determined by the class of the majority of its k nearest neighbors Need to define an appropriate distance measure --sort of easy for real valued vectors --harder for categorical attributes
True hypothesis eventually dominates… probability of indefinitely producing uncharacteristic data 0
Bayesian prediction is optimal (Given the hypothesis prior, all other predictions are less likely)
Also, remember the Economist article that shows that humans have strong priors..
..note that the Economist article says humans are able to learn from few examples only because of priors..
So, BN learning is just probability estimation! (as long as data is complete!)
How Well (and WHY) DOES NBC WORK? • Naïve bayes classifier is darned easy to implement • Good learning speed, classification speed • Modest space storage • Supports incrementality • It seems to work very well in many scenarios • Lots of recommender systems (e.g. Amazon books recommender) use it • Peter Norvig, the director of Machine Learning at GOOGLE said, when asked about what sort of technology they use “Naïve bayes” • But WHY? • NBC’s estimate of class probability is quite bad • BUT classification accuracy is different from probability estimate accuracy • [Domingoes/Pazzani; 1996] analyze this