LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 11 part 1 2/18/2013

Recommended reading • Ensemble learning, cross-validation • Marsland Chapter 7, on web page • Hastie 7.10-7.11, 8.7

Outline • Ensemble learning • Cross-validation • Voting • Bagging • Boosting

Ensemble learning • Various techniques to improve performance over results of basic classification algorithms • Use multiple classifiers • Select subset of training set • Weight training set points • Algorithms • Cross-validation • Voting • Bagging • Boosting

Ensemble learning and model selection • Available models • Range of possible parameterizations of model • Choice of learning algorithms to combine • How well the model fits the data • Separate points in training data • Generalize to new data • Balance simplicity of model and fit to data • Noisy data more robustness to noisy data • Separabilitymore complicated decision boundary • Maximum margin • Computational issues • Not substantially more than original algorithms

Example of ensemble learning:Netflix prize • http://en.wikipedia.org/wiki/Netflix_Prize • Collaborative filtering / recommender system: predict which movies a user will like, given ratings that the user assigned to other movies • In 2006, Netflix offered $1,000,000 to anyone who could improve on their system by 10% • Training set: • 100,480,507 ratings (1 to 5 stars) • 480,189 users • 17,700 movies

Netflix prize • Winner in 2009, 10.09% improvement • Teams with highest improvements used ensemble learning extensively, joining forces with other teams & combining together different algorithms • http://www.wired.com/epicenter/2009/09/how-the-netflix-prize-was-won/ • http://www.quora.com/Netflix-Prize/Is-there-any-summary-of-top-models-for-the-Netflix-prize

Netflix prize • Finally, let's talk about how all of these different algorithms were combined to provide a single rating that exploited the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.) • In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors. • Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data.

Effect of ensemble size(from early in the competition, in 2007)

Decision boundaries and overfitting • Simple • Hyperplane (perceptron, SVM) • Complicated • Decision Tree • Neural network • K-nearest neighbors, especially when k is small • Algorithms that learn complicated decision boundaries have a tendency to overfit the training data

Overfitting • Overfitting: algorithm learns about noise and specific data points in the training set, instead of the general pattern • Error rate on training set is minimized, but error rate on test set increases

Validation set • Problem: easy to overfit to training data • Want to minimize error on training in order to learn statistical characteristics of data • But this may cause increased error in testing • Solution: reserve portion of training data as validation set • Apply classifier to validation set during training • Now you have an “external” measure of the quality of the model • Problem: you could overfit on the validation set if you repeatedly optimize your classifier against it

Better solution: cross-validation • K-fold cross-validation: • Split up training data into K equally-sized sets (folds) • Example: 6 folds. Blue: training, Yellow: validation • Produces K different classifiers • Train classifier on K-1 folds, test on the other fold • Calculate average performance over all folds • Less likely to overfit, compared to using a single validation set

Cross validation does not produce a new classifier • It gives a more honest measure of the performance of your classifier, by training and testing over different subset of your data

Use multiple classifiers • Different classifiers make different errors • Each classifier has its own “opinion” about the data • Obtain results from multiple classifiers trained for the same problem • When combining multiple classifiers trained on the same data set, performance is often better than for a single classifier

Source of classifiers • Could use classifiers constructed by different algorithms • SVM • Perceptron • Neural network • etc. • Or could use classifiers trained on different folds or subsets of the training data

Classifier combination through voting • Simplest way to combine results from multiple classifiers: majority rules • Example • Have 10 classifiers and an instance to classify. • 7 say “Yes” and 3 say “No”. • Choose “Yes” as your classification. • Minor improvements: • Attach a confidence score to each classification • Availability of “confidence” depends on choice of classifier • Skip cases where classifiers tend to disagree • 50% “Yes”, 50% “No”

Voting works better than individual classifiers • Performance of voting is often better than for individual classifiers • Why: more complicated decision boundaries • An algorithm’s learning bias predisposes it to acquire particular types of decision boundaries • Additive effect of multiple classifiers allows one to determine which classifications are more likely to be correct, and which ones are incorrect

Learning bias: classifier is predisposed to learn certain types of decision boundaries • Perceptron: hyperplane • SVM: • Hyperplane with max margin, in kernel feature space • May be nonlinear in original feature space • Decision Tree: • Hierarchical splits • Each is a constant on a feature axis, with a limited range • Neural network: any smooth function • K-nearest neighbors • Shape of decision boundary depends on K and choice of distance function

Apply voting to this data

Classifier A YES NO

Classifier B NO YES

Learn correct decision boundary through majority vote of 2 classifiers 1 vote 2 votes 0 votes 1 vote

Complicated decision boundaries from classifier combination and voting

Two components of bagging • 1. Classifier combination with majority vote • Previous section • 2. Each classifier is trained on a set of bootstrap samples from the training data • Bootstrap: sample data set with replacement • For example: data set has 1,000 training points. • Sample 1,000 (or some other quantity) points, with replacement • It’s likely that the sampled data set will not contain all of the data in the original data set • It’s likely that the sampled data set will contain multiple instances of data points of the original data set

Train on a set of bootstrap samples • Suppose there are N instances in the training set. • Generate one set of bootstrap samples: • Randomly sample from the training set, with replacement (i.e., data may be repeated) • Sample N times to build a new training set • Then train a classifier on each bootstrap sample • Can use same algorithm, or different algorithms

Weak classifiers • Classifier combination can perform well even if individual classifiers are “weak” (= perform relatively poorly) • Example: stump classifier • Construct the root node of a decision tree by splitting on one feature • Do not construct rest of tree • Generates a constant decision boundary on one feature axis

Last time: full decision tree(augmented with # of training cases) 5 1 2 1 1

Example of a stump • In the full decision tree, there were 3 classes of data under Party=No: • Study (2), Watch (1), Pub (1), TV(1) • Stump: assign majority class as label for a leaf • This stump performs poorly: can’t guess Pub or TV Go to Party Study

Geometric view: Classifier A is a stump YES NO

Geometric view: Classifier B is a stump NO YES

How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer?

k: number of succeeding classifiers • T different classifiers • The ensemble gets the right answer if more than half of the individual classifiers succeed, i.e., k = T/2 + 1 (or higher) • Calculations to be done • What is the probability of k classifiers all succeeding? • What is the probability that there will be at least k = T/2 + 1 succeeding classifiers?

1. Probability that k classifiers all succeed • We have T total classifiers. • k succeed, each with probability p • T - k fail, each with probability (1-p) • C(T, k) different combinations for which classifiers succeed and which ones fail • Probability that k classifiers all succeed, out of T: (This is a binomial distribution)

Combinationshttp://en.wikipedia.org/wiki/Combination • C(T, k) is the number of subsets of size k from a set of size T ; answer is given by binomial coefficient • Example: subsets of size 3 from a set of 5 elements

2. Probability that the ensemble succeeds • Between k = T/2+1 and k = T classifiers agree, so sum the probability of agreement for all values of k in this range. • If p > 0.5, this sum approaches 1 as T infinity. • i.e., if even individual classifiers perform badly, the more classifiers we have, the more likely that the ensemble will classify correctly

How weak can classifiers be? • Suppose there are T classifiers. • Let p be the success rate of a classifier. • Assume each classifier has the same success rate. • What value does p need to be for the entire ensemble of T classifiers to get the correct answer? • Answer: p > 0.5 !!!

Boosting: I • Each training example has a weight. • Boosting learns a series of classifiers over a number of iterations. • At each iteration: • Learn a classifier according to weighted examples • Compute misclassified examples • Increase weight for misclassified examples • Gives more priority to these for next iteration of training

Boosting: II • Boosting learns an ensemble of classifiers, one for each training iteration • Each classifier is assigned a weight, according to how well it does • Classification of the ensemble is determined by applying weights to classifications of individual classifiers

Decision boundary of each classifierat each iteration

Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf • Learns a stump classifier • at each iteration

Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf εt : weighted error rate of classifier αt: weight assigned to classifier

Final classifier: compute weighted sum of outputs of individual classifiers, using the αt’s Source: cbio.mskcc.org/~aarvey/BoostingLightIntro.pdf

AdaBoost: adaptive boosting

LING / C SC 439/539 Statistical Natural Language Processing