1 / 42

Chapter 9

Chapter 9. Given a range of classification algorithms, which is the best ? Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,….

jerrod
Download Presentation

Chapter 9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 9 • Given a range of classification algorithms, which is the best? • Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,…. • Principle of Occam’s Razor: given two classifiers that perform equally well on the training set, it is asserted that the simpler classifier may do better on test set • This chapter focuses on mathematical foundations that do not depend on a particular classifier or learning algorithm • Bias and variance dilemma • Ensemble of Classifiers (classifier combination) • Cross validation • Resampling

  2. No Free Lunch Theorem • Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior or inferior overall? • No Free Lunch Theorem: Answer to above question: NO • If the goal is to obtain good generalization performance, there is no context-independent or usage-independent reasons to favor one algorithm over others • If one algorithm seems to outperform another in a particular situation, it is a consequence of its fit to a particular pattern recognition problem • For a new classification problem, what matters most: prior information, data distribution, size of training set, cost fn.

  3. No Free Lunch Theorem • It is the assumptions about the learning algorithm that are important • Even popular algorithms will perform poorly on some problems, where the learning algorithm and data distribution do not match well • In practice, experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems

  4. Ugly Duckling Theorem • In the absence of assumptions, there is no best feature representation for all problems • The similarity between patterns is fundamentally based on implicit assumptions about the problem domain • Consider the example where features x and y represent blind_in_right_eye and blind_in_left_eye, respectively. If we base similarity on shared features, person P1 = {1, 0} (blind only in the right eye} is maximally different from person P2 = {0, 1} (blind only in the left eye). In this scheme P1 is more similar to a totally blind person and to a normally sighted person than he is to P2! • There may be situations where we want P1 to be more similar to P2—such persons may be able to drive an automobile!

  5. No “best classifier” in general Necessity for exploring a variety of methods How to evaluate if the learning algorithm “matches” the classification problem Bias: measures the quality of the match High-bias implies poor match Variance: measures the specificity of the match High-variance implies a weak match Bias and variance are not independent of each other Bias and Variance

  6. Given true function F(x) Estimated function g(x; D) from a training set D Dependence of function g on training set D. Each training set gives an estimate of error in the fit Taking average over all training sets of size n, MSE is Bias and Variance • Low bias: on average, we will accurately estimate F from D • Low variance: Estimate of F does not change much with different D Average error that g(x;D) makes in fitting F(x) Difference between expected value and the true value Difference between observed value and expected value

  7. Bias-Variance Dilemma in Regression Each column is a different model. Col 1:Poor fixed linear model; High bias, zero variance Each row is a different dataset of 6 points. Col 2:Slightly better fixed linear model; Lower (but high) bias, zero variance. Col 3:Learned cubic model; Low bias, moderate variance. Col 4:Learned linear model; Intermediate bias and variance. Histograms of mean-squared error of the fit.

  8. Procedures with increased flexibility to adapt to training data have lower bias, but higher variance Large number of parameters Fits well and have low bias, but high variance Inflexible procedures have higher bias, but lower variance Fewer number of parameters May not fit well to data: have high bias, but low variance A large amount of training data generally helps improve performance of estimation if the model is sufficiently general to represent the target function Bias/Variance considerations recommend that we gather as much prior information about the problem as possible to find a best match for the classifier, and as large a dataset as possible to reduce the variance We can virtually never get zero bias and zero variance. Bias Variance Dilemma

  9. Bias and Variance for Classification • Low variance is more important for accurate classification than low boundary bias • Classifiers with large flexibility to adapt to training data (more free parameters) tend to have low bias but high variance • 2-class problem; 2D Gaussian distribution with diagonal covariances. Small no. of training data to estimate parameters of 3 different models • For best classification given small training data, need to match model to the true distributions

  10. Ensemble-based Systems in Decision Making • For many tasks, we often seek second opinion before making a decision, sometimes many more • Consulting different doctors before a major surgery • Reading reviews before buying a product • Requesting references before hiring someone • We consider decisions of multiple experts in our daily lives • Why not follow the same strategy in automated decision making? • Multiple classifier systems, committee of classifiers, mixture of experts, ensemble based systems • Polikar R., “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol.6, no. 3, pp. 21-45, 2006 • Polikar R., “Bootstrap Inspired Techniques in Computational Intelligence,” IEEE Signal Processing Magazine, vol.24, no. 4, pp. 56-72, 2007 • Polikar R., “Ensemble Learning,” Scholarpedia, 2008.

  11. Ensemble-based Classifiers • Ensemble based systems provide favorable results compared to single-expert systems for a broad range of applications & under a variety of scenarios • How to (i) generate individual components of the ensemble systems (base classifiers), and (ii) how to combine the outputs of individual classifiers? • Popular ensemble based algorithms • Bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts • Commonly used combination rules • Algebraic combination of outputs, voting methods, behavior knowledge space & decision templates

  12. Why Ensemble Based Systems? • Statistical reasons • A set of classifiers with similar training performances may have different generalization performances • Combining outputs of several classifiers reduces the risk of selecting a poorly performing classifier • Large volumes of data • If the amount of data to be analyzed is too large, a single classifier may not be able to handle it; train different classifiers on different partitions of data • Too little data • Ensemble systems can also be used when there is too little data; resampling techniques

  13. Why Ensemble Based Systems? • Divide and Conquer • Divide data space into smaller & easier-to-learn partitions; each classifier learns only one of the simpler partitions

  14. Why Ensemble Based Systems? • Data Fusion • Given several sets of data from various sources, where the nature of features is different (heterogeneous features), training a single classifier may not be appropriate (e.g., MRI data, EEG recording, blood test,..) • Applications in which data from different sources are combined are called data fusion applications • Ensembles have successfully been used for fusion • All ensemble systems must have two key components: • Generate component classifiers of the ensemble • Method for combining the classifier outputs

  15. Brief History of Ensemble Systems • Dasarathy and Sheela (1979) partitioned the feature space using two or more classifiers • Schapire (1990) proved that a strong classifier can be generated by combining weak classifiers through boosting; predecessor of AdaBoost algorithm • Two types of combination: • classifier selection • Each classifier is trained to become an expert in some local area of the feature space; one or more local experts can be nominated to make the decision • classifier fusion • All classifiers are trained over the entire feature space; fusion involves merging the individual (weaker) classifiers to obtain a single (stronger) expert of superior performance

  16. Diversity of Ensemble • Objective: create many classifiers, and combine their outputs to improve the performance of a single classifier • Intuition: if each classifier makes different errors, then their strategic combination can reduce the total error! • Need base classifiers whose decision boundaries are adequately different from those of others • Such a set of classifiers is said to be diverse • How to achieve classifier diversity? • Use different training sets to train individual classifiers • How to obtain different training sets? • Resampling techniques: bootstrappingor bagging, training subsets are drawn randomly, usually with replacement, from the entire training set

  17. Sampling with Replacement • Random & overlapping training sets to train three classifiers; they are combined to obtain a more accurate classification

  18. Sampling without Replacement • Jackknife or k-fold data split: • Entire data is split into k blocks; each classifier is trained only on different subset of (k-1) blocks

  19. Other Approaches to Achieve Diversity • Use different training parameters for a classifier • A series of MLP can be trained using different weight initializations, number of layers/nodes, etc. • Adjusting these parameters controls the instability of such classifiers (local minima) • Similar strategy can be used to generate different decision trees for the same problem • Different types of classifiers (MLPs, decision trees, NN classifiers, SVM) can be combined for added diversity • Diversity can also be achieved by using random feature subsets, called random subspace method

  20. Creating An Ensemble • Two questions: • How will the individual classifiers be generated? • How will they differ from each other? • Answer determines the diversity of classifiers & fusion performance • Seek to improve ensemble diversity by some heuristic methods

  21. Bagging • Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms • It is also one of the most intuitive and simplest to implement, with a surprisingly good performance • Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data • Each resampled training set is used to train a different classifier of the same type • Individual classifiers are combined by taking a majority vote of their decisions • Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

  22. Bagging

  23. Variations of Bagging Random Forests • so-called because it is constructed from decision trees • A random forest is created from individual decision trees, whose training parameters vary randomly • Such parameters can be bootstrapped replicas of the training data, as in bagging • But they can also be different feature subsets as in random subspace methods

  24. Boosting • Boost the performance of a weak learner to the level of a strong one • Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting • resampling is strategically geared to provide the most informative training data for each consecutive classifier • Boosting creates three weak classifiers: • First classifier C1 is trained with a random subset of the available training data • Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 • Third classifier C3 is trained on instances on which both C1 & C2 disagree

  25. Boosting

  26. AdaBoost • AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems • AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses • Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set • This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier • Consecutive classifiers are trained on increasingly hard-to-classify samples

  27. AdaBoost • A weight distribution Dt(i) on training instances xi , i=1,…,Nfrom which training data subsets St are chosen for each consecutive classifier (hypothesis) ht • A normalized error is then obtained as t , such that for 0<t <1/2, they have 0< t <1 • Distribution update rule: • The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of t , whereas the weights of the misclassified instances are unchanged. • AdaBoost focuses on increasingly difficult instances • AdaBoost raises the weights of instanced misclassified by ht , and lowers the weights of correctly classified instances • AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting • 1/t is therefore a measure of performance, of the tth hypothesis and can be used to weight the classifiers

  28. AdaBoost.M1

  29. AdaBoost.M1 • AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

  30. Boosting

  31. AdaBoost

  32. Performance of AdaBoost • In most practical cases, the ensemble error decreases very rapidly in the first few iterations, and approaches zero or stabilizes as new classifiers are added • AdaBoost does not seem to be affected by overfitting; explained by margin theory

  33. Stacked Generalization • An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes • C1, …,CTare trained using training parameters 1 through T to output hypotheses h1 through hT • The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, CT+1

  34. Mixture-of-Experts • A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C1, …,CTconstitute the ensemble, followed by a second-level classifier CT+1used for assigning weights for the consecutive combiner • The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all • the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network • The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) • Mixture-of-experts can, therefore, be seen as a classifier selection algorithm • Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

  35. Mixture of Experts • The pooling system may use the weights in several different ways. • it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

  36. Combining Classifiers • How to combine classifiers? Combination rules grouped as • (i) trainable vs. non-trainable • Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm • Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules • Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example • (ii) combination rules for class labels vs. class-specific continuous outputs • combination rules that apply to class labels only need the classification decision (that is, one of j , j=1,…,C) • Other rules need continuous-valued outputs of individual classifiers

  37. Combining Class Labels • Assume that only class labels are available from the classifier outputs • Define the decision of the tth classifier as dt,j{0,1} , t=1,…,T and j=1,…,C , where T is the number of classifiers and C is the number of classes • If tth classifier chooses class j , then dt,j=1, 0 otherwise • Majority Voting : • Weighted Majority Voting – • Behavior Knowledge Space (BKS) – look up Table • Borda Count – • each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N − 1 votes, the second-place candidate receives N − 2, with the candidate in ith place receiving N − i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision

  38. Combining Continuous Outputs • Algebraic combiners • Mean Rule: • Weighted Average: • Minimum/Maximum/Median Rule: • Product Rule: • Generalized Mean: • Many of the above rules are in fact special cases of the generalized mean • : minimum rule; :maximum rule; : • : mean rule

  39. Combining Classifier Outputs

  40. Conclusions • Ensemble systems are useful in practice • Diversity of the base classifiers is important • Ensemble generation techniques: bagging, AdaBoost, mixture of experts • Classifier combination strategies: algebraic combiners, voting methods, and decision templates. • No single ensemble generation algorithm or combination rule is universally better than others • Effectiveness on real world data depends on the classifier diversity and characteristics of the data

More Related