600 likes | 755 Views
LING / C SC 439/539 Statistical Natural Language Processing. Lecture 11 part 1 2/18/2013. Recommended reading. Ensemble learning, cross-validation Marsland Chapter 7, on web page Hastie 7.10-7.11, 8.7. Outline. Ensemble learning Cross-validation Voting Bagging Boosting.
E N D
LING / C SC 439/539Statistical Natural Language Processing Lecture 11 part 1 2/18/2013
Recommended reading • Ensemble learning, cross-validation • Marsland Chapter 7, on web page • Hastie 7.10-7.11, 8.7
Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting
Ensemble learning • Various techniques to improve performance over results of basic classification algorithms • Use multiple classifiers • Select subset of training set • Weight training set points • Algorithms • Cross-validation • Voting • Bagging • Boosting
Ensemble learning and model selection • Available models • Range of possible parameterizations of model • Choice of learning algorithms to combine • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data more robustness to noisy data • Separabilitymore complicated decision boundary • Maximum margin • Computational issues • Not substantially more than original algorithms
Example of ensemble learning:Netflix prize • http://en.wikipedia.org/wiki/Netflix_Prize • Collaborative filtering / recommender system: predict which movies a user will like, given ratings that the user assigned to other movies • In 2006, Netflix offered $1,000,000 to anyone who could improve on their system by 10% • Training set: • 100,480,507 ratings (1 to 5 stars) • 480,189 users • 17,700 movies
Netflix prize • Winner in 2009, 10.09% improvement • Teams with highest improvements used ensemble learning extensively, joining forces with other teams & combining together different algorithms • http://www.wired.com/epicenter/2009/09/how-the-netflix-prize-was-won/ • http://www.quora.com/Netflix-Prize/Is-there-any-summary-of-top-models-for-the-Netflix-prize
Netflix prize • Finally, let's talk about how all of these different algorithms were combined to provide a single rating that exploited the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.) • In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors. • Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data.
Effect of ensemble size(from early in the competition, in 2007)
Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting
Decision boundaries and overfitting • Simple • Hyperplane (perceptron, SVM) • Complicated • Decision Tree • Neural network • K-nearest neighbors, especially when k is small • Algorithms that learn complicated decision boundaries have a tendency to overfit the training data
Overfitting • Overfitting: algorithm learns about noise and specific data points in the training set, instead of the general pattern • Error rate on training set is minimized, but error rate on test set increases
Validation set • Problem: easy to overfit to training data • Want to minimize error on training in order to learn statistical characteristics of data • But this may cause increased error in testing • Solution: reserve portion of training data as validation set • Apply classifier to validation set during training • Now you have an “external” measure of the quality of the model • Problem: you could overfit on the validation set if you repeatedly optimize your classifier against it
Better solution: cross-validation • K-fold cross-validation: • Split up training data into K equally-sized sets (folds) • Example: 6 folds. Blue: training, Yellow: validation • Produces K different classifiers • Train classifier on K-1 folds, test on the other fold • Calculate average performance over all folds • Less likely to overfit, compared to using a single validation set
Cross validation does not produce a new classifier • It gives a more honest measure of the performance of your classifier, by training and testing over different subset of your data
Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting
Use multiple classifiers • Different classifiers make different errors • Each classifier has its own “opinion” about the data • Obtain results from multiple classifiers trained for the same problem • When combining multiple classifiers trained on the same data set, performance is often better than for a single classifier
Source of classifiers • Could use classifiers constructed by different algorithms • SVM • Perceptron • Neural network • etc. • Or could use classifiers trained on different folds or subsets of the training data
Classifier combination through voting • Simplest way to combine results from multiple classifiers: majority rules • Example • Have 10 classifiers and an instance to classify. • 7 say “Yes” and 3 say “No”. • Choose “Yes” as your classification. • Minor improvements: • Attach a confidence score to each classification • Availability of “confidence” depends on choice of classifier • Skip cases where classifiers tend to disagree • 50% “Yes”, 50% “No”
Voting works better than individual classifiers • Performance of voting is often better than for individual classifiers • Why: more complicated decision boundaries • An algorithm’s learning bias predisposes it to acquire particular types of decision boundaries • Additive effect of multiple classifiers allows one to determine which classifications are more likely to be correct, and which ones are incorrect
Learning bias: classifier is predisposed to learn certain types of decision boundaries • Perceptron: hyperplane • SVM: • Hyperplane with max margin, in kernel feature space • May be nonlinear in original feature space • Decision Tree: • Hierarchical splits • Each is a constant on a feature axis, with a limited range • Neural network: any smooth function • K-nearest neighbors • Shape of decision boundary depends on K and choice of distance function
Classifier A YES NO
Classifier B NO YES
Learn correct decision boundary through majority vote of 2 classifiers 1 vote 2 votes 0 votes 1 vote
Complicated decision boundaries from classifier combination and voting
Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting
Two components of bagging • 1. Classifier combination with majority vote • Previous section • 2. Each classifier is trained on a set of bootstrap samples from the training data • Bootstrap: sample data set with replacement • For example: data set has 1,000 training points. • Sample 1,000 (or some other quantity) points, with replacement • It’s likely that the sampled data set will not contain all of the data in the original data set • It’s likely that the sampled data set will contain multiple instances of data points of the original data set
Train on a set of bootstrap samples • Suppose there are N instances in the training set. • Generate one set of bootstrap samples: • Randomly sample from the training set, with replacement (i.e., data may be repeated) • Sample N times to build a new training set • Then train a classifier on each bootstrap sample • Can use same algorithm, or different algorithms
Weak classifiers • Classifier combination can perform well even if individual classifiers are “weak” (= perform relatively poorly) • Example: stump classifier • Construct the root node of a decision tree by splitting on one feature • Do not construct rest of tree • Generates a constant decision boundary on one feature axis
Last time: full decision tree(augmented with # of training cases) 5 1 2 1 1
Example of a stump • In the full decision tree, there were 3 classes of data under Party=No: • Study (2), Watch (1), Pub (1), TV(1) • Stump: assign majority class as label for a leaf • This stump performs poorly: can’t guess Pub or TV Go to Party Study
How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer?
k: number of succeeding classifiers • T different classifiers • The ensemble gets the right answer if more than half of the individual classifiers succeed, i.e., k = T/2 + 1 (or higher) • Calculations to be done • What is the probability of k classifiers all succeeding? • What is the probability that there will be at least k = T/2 + 1 succeeding classifiers?
1. Probability that k classifiers all succeed • We have T total classifiers. • k succeed, each with probability p • T - k fail, each with probability (1-p) • C(T, k) different combinations for which classifiers succeed and which ones fail • Probability that k classifiers all succeed, out of T: (This is a binomial distribution)
Combinationshttp://en.wikipedia.org/wiki/Combination • C(T, k) is the number of subsets of size k from a set of size T ; answer is given by binomial coefficient • Example: subsets of size 3 from a set of 5 elements
2. Probability that the ensemble succeeds • Between k = T/2+1 and k = T classifiers agree, so sum the probability of agreement for all values of k in this range. • If p > 0.5, this sum approaches 1 as T infinity. • i.e., if even individual classifiers perform badly, the more classifiers we have, the more likely that the ensemble will classify correctly
How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer? • Answer: p > 0.5 !!!
Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting
Boosting: I • Each training example has a weight. • Boosting learns a series of classifiers over a number of iterations. • At each iteration: • Learn a classifier according to weighted examples • Compute misclassified examples • Increase weight for misclassified examples • Gives more priority to these for next iteration of training
Boosting: II • Boosting learns an ensemble of classifiers, one for each training iteration • Each classifier is assigned a weight, according to how well it does • Classification of the ensemble is determined by applying weights to classifications of individual classifiers
Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf • Learns a stump classifier • at each iteration
Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier
Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier
Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier
Final classifier: compute weighted sum of outputs of individual classifiers, using the αt’s Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf