AMCS/CS 340: Data Mining

Classification: SVM AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Support Vector Machines Ensemble Methods Classification Techniques 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Find a linear hyperplane(decision boundary) that will separate the data Support Vector Machines 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

One Possible Solution Support Vector Machines 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Support Vector Machines Another Possible Solution 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Support Vector Machines Other Possible Solutions 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Support Vector Machines • Which one is better ? B1 or B2 ? • How do you define the better ? 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM: Margins and Support Vectors • Find hyperplane maximizes the margin  B1 is better than B2 Support Vectors are those data points that the margin pushes up against 8

Support Vector Machines • SVM finds this hyperplane (decision boundary) using support vectors (“essential” training tuples) and margins (defined by the support vectors) • Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s • Used both classificationand prediction • Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) • Applications: handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Margin and hyperplane Separating hyperplane w is a weight vector b is a scalar (bias) Side of margin Side of margin classifier 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Support Vector Machines 11 • We want to maximize: • Which is equivalent to minimizing: • But subjected to the following constraints: • This is a constrained optimization problem • Quadratic objective function and linear constraints Quadratic Programming (QP) Lagrangian multipliers Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM: not linearly separable • What if the problem is not linearly separable? • Introduce slack variables (soft margin, allow training errors) is not on the correct side • Modify the object function to be Cis a cost parameter, can be chosen based on cross-validation 12

Nonlinear SVM • What if the decision boundary is not linear ? 13

Nonlinear SVM • Transform data into higher dimensional space 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM optimization (mapping) 15 • Minimize: • subject to • Nonlinear separation x a higher dimensional feature space e.g., • Minimize: • subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Lagrange function 16 • Minimize: subject to • Lagrange function (generalized Lagrange multipliers with inequality constraints) • Weak duality: minimizing the Lagrange function provides lower bounds to the optimization problem where is the optimal solution • New optimization problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dual form 17 • Find solution for • Optimal conditions , yield • Equivalently maximize in (Convex quadratic optimization problem can be solved using the dual form) subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dual form with soft margin 18 • Find solution for • Optimal conditions , yield • Equivalently maximize in (Convex quadratic optimization problem can be solved using the dual form) subject to Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to find α ? Q is a N*N matrix: depend on training input x, label y Call this problem quadratic programming 20 • Quadratic Programming • Maximize • Minimize • Subject to • There exist algorithms for finding the constrained quadratic optima: • Projected conjugate gradients (Burges, 1998) • Decomposition methods (Osuna et al, 1996) • Sequential minimal optimization (Platt, 1999) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

After solution of α 21 • Having solution of • support vectors, lies on one of the hyperplanes • other • predict the class of given point X Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM: kernel functions • Apply a kernel function to original datax for nonlinear separation • Computing the dot product of ? • Need to explicitly know the definition of ? • NO, use kernel function ! 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM: kernel functions • Learning classifier • Make predication • Typical kernel functions 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of kernel functions • Example of polynomial kernel • If d=2, 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Multi-class SVMs • One-versus-all • Train k binary classifiers, one for each class against all other classes. • Predicted class is the class of the most confident classifier • One-versus-one • Train k(k-1)/2 classifiers, each discriminating between a pair of classes • Several strategies for selecting the final classification based on the output of the binary SVMs • Truly Multi-class SVMs • Generalize the SVM formulation to multiple categories 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Why is SVM effective on high dimensional data ? 27 • The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data • The support vectors are the essential or critical training examples —they lie closest to the decision boundary • If all other training examples are removed and the training is repeated, the same separating hyperplanewould be found • The number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality • Thus, an SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What You Should Know ? 40 • Linear SVMs • The definition of a maximum margin classifier • What QP can do for you (you may need not to know how it does)? • How Maximum Margin can be turned into a QP problem? • How we deal with noisy (non-separable) data? • How we permit non-linear boundaries? • How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis function terms? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Open issues of SVM 41 • Speed up the quadratic programming training method (both time complexity & storage capacity problem are increasing as train data increase) • The choice of kernel function : there are no guidelines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVM Related Links 42 • SVM Website http://www.kernel-machines.org/ • Tutorial C. J. C. Burges 1998. A Tutorial on Support Vector Machines for Pattern Recognition. • Representative implementations • LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ an efficient implementation of SVM, multi-class classifications, nu-SVM (soft margin), one-class SVM, regression, including also various interfaces with java, python, etc. • SVM-light: http://www.cs.cornell.edu/People/tj/svm_light/ simpler but support only binary classification and only C language • More: http://www.kernel-machines.org/software Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Support Vector Machines Ensemble Methods Classification Techniques 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Ensemble Methods • Improve the performance of models for classification and regression? • Combine multiple models together • make the prediction and the discrimination by combining the multiple models instead of just using a single model Two heads are better than one. اثنين أحسن من واحد ( يعني يد واحده مابتصفق ) 三个臭皮匠，顶个诸葛亮 Cuatro ojos ven mas que dos. 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Ensemble Methods • Construct a set of classifiers from the training data • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers • Advantage: often improves predictive performance • Disadvantage: usually produce output that is very hard to analyze • However, there are approaches that aim to produce a single comprehensive structure 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Win a challenge 46 • KDDcup 2009, classification of mobile customer, 10,000 euros • Difficulties: • Large number of training examples: 50,000 • Large number of features: 15,000 • Large number of missing values: 60% • Unbalanced class proportions: fewer than 10% of the exemplars of positive class • Winners: • IBM Research: an ensemble of a wide variety of classifiers • ID Analytics: boosting decision tree and bagging • David Slate & Peter Frey: an ensemble of decision trees • University of Melbourne: boosting with classification trees • Financial Engineering Group: gradient tree classifier boosting • National Taiwan University: AdaBoost with tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Main types of Ensemble Methods • Combine multiple models together • Bagging • Make classification by voting over a collection of classifiers • Boostings • Train multiple models in sequence • Decision tree • Different models are responsible for making predictions in different regions of input space 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bagging • Construct a set of classifiers from the training data • Predict class label of previously unseen records by aggregating predictions made by multiple classifiers 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Bagging • Sampling with replacement (bootstrap), each sample has probability (1 – 1/n)n of being selected • Build classifier on each bootstrap sample Mi • Classification: classify an unknown sample x • Each classifier Mi returns its class prediction • The bagged classifier M* counts the votes and assign the class with the most (averaged) votes to x • Accuracy • Often significant better than a single classifier derived from D • For noise data: not considerably worse, more robust 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Suppose there are 25 base classifiers Each classifier has error rate,  = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction (according to the majority vote of the base-classifiers) The reduced error rate 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Boosting • An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records • Initially, all n records are assigned equal weights w=1/n • For t = 1, 2, …, M, Do • Obtain a classifier yt(x) under {wnt} • Calculate the error of yt(x) and re-weight examples based on the errors {wnt+1} • Output a weighted sum of all the classifiers,where the weight of each classifier's vote is a function of its accuracy 52 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Boosting example • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Boosting vs Bagging • Committees/Bagging • Base classifiers are trained in parallel on samples of data set • Boosting • Base classifiers are trained in sequence by using a weighted form of the data set • weighting coefficient of each data point depends on the performance of the previous classifiers. • misclassified points are given greater weight when used to train the next classifier in the sequence. • boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AdaBoost(Freund and Schapire, 1997) • AdaBoost (adaptive boosting): a popular boosting algo • Given a set of N class-labeled examples, D={(X1, y1), …, (XN, yN)} • Initially, all the weights of tuples are set the same (1/N) • Generate k classifiers in k rounds. At round i, • Examples from D are sampled (with replacement) to form a training set Di of the same size • Each example’s chance of being selected is based on its weight • A classification model Mi is derived from Di • Error rate of Miis calculated using Di as a test set • Weights of training examples are adjusted depending on how they were classified • Correctly classified: Decrease weight • Incorrectly classified: Increase weight 55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Base classifiers: M1, M2, …, MT Error rate of a classifier Mi: Importance of a classifier Mi: Example: AdaBoost 56 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Update weight wiof example xi (from round j to round j+1) If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/N and the resampling procedure is repeated Classification: Example: AdaBoost 57 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustrating AdaBoost • One-dimensional input data: • Base classifiers: decision tree of height two, with one split • Maximal attainable accuracy: 80% + + + - - - - - + + x 0.5 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustrating AdaBoost Data points for training Initial weights for each data point 0.0625 0.0625 0.25 0.6931 http://www.lri.fr/~xlzhang/KAUST/CS340_slides/adaboost-illustration.m 59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustrating AdaBoost 0.0625 0.0625 0.25 0.6931 0.1667 0.0385 0.1538 0.7332 0.1032 0.1 0.0952 0.7175 60 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Random Forests • Ensemble of decision trees • Input set: N tuples, M attributes • Each tree is learned on a reduced training set • Randomly selectm<<M attributes • Sample training data • with replacement • only keep m randomly selected attributes • the best split on these m is used to split the node • m is held constant during the forest growing • Bagging using decision trees is a special case of random forests when m=M 62 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Random Forests 63 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Random Forests Algorithm • Good accuracy without over-fitting, but interpretability decreases • Fast algorithm (can be faster than growing/pruning a single tree); easily parallelized • Handle high dimensional data without much problem • Only one tuning parameter mtry = , usually not sensitive to it 64 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Advanced Topics in Data Mining: Association Rules

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Lecture 1 What is ( Astronomical ) Data Mining

CSE 538 Web Search and Mining Web Crawling

DATA MINING LECTURE 5

Data Mining Tools

Data Mining : Implementations

Data Mining with DB

Medical data mining Linking diseases, drugs, and adverse reactions

Data Mining with CANape 9.0

Data Mining using Fractals and Power laws