1.27k likes | 1.47k Views
CS 277: Data Mining Notes on Classification. Padhraic Smyth Department of Computer Science University of California, Irvine. Review. Models that are linear in parameters b , e.g., y = b 0 + b 1 x 1 + b 2 x 2 + b 12 x 1 x 2
E N D
CS 277: Data MiningNotes on Classification Padhraic Smyth Department of Computer Science University of California, Irvine
Review • Models that are linear in parameters b, e.g., y = b0 + b1x1+ b2 x2+ b12 x1x2 With least squares objective function -> solving a set of linear equations • Models that are non-linear in parameters, e.g., logistic y = 1/ 1 + exp[ - (b0 + b1x1+ b2 x2+ b12 x1x2 ) ] Solution requires non-linear optimization methods, e.g., iterative search
Optimization: Gradient Ascent (or Descent) • Select an initial bheuristically or randomly • Update: • Compute the local gradient of the objective function S(b) DS(b) = [ dS(b0)/dq1 , ………….., dS(bp)/dqp] • Gives direction of maximum increase of S function (at this point in b space) • Move b a small distance in this direction, i.e., uphill b -> b + lDS(b) l = learning rate (e.g., 0.1) • Repeat until convergence • e.g., b is no longer changing, DS ~ 0, i.e., we are a (local) maximum • Repeat from different initial conditions if local maxima exist in S(b) • Many different versions (e.g., batch, sequential/stochastic, etc)
Optimization: Newton’s Method • Use 2nd order derivative information in update rule: • b -> b + H-1(b)DS(b) where H-1(b) is the Hessian matrix, • a p x p matrix of 2nd derivatives of S(b) evaluated at b • Requires O(Np2) computations to compute Hessian, O(p3) to invert • Can approximate with diagonal, yields O(Np) • Gives optimal convergence rate if S(b) is quadratic • May be particularly helpful near minimum (or maximum) (think Taylor’s series expansion) • For more discussion see Section 8.3 in the text
Model Evaluation • Let MSEtest be the mean-square error of our learned predictor function, evaluated on test data • Useful to report MSEtest / MSEbaseline • e.g., where MSEbaseline = Si [y(i)– my]2 (on test data points) where my = mean of y values on the training data • ideally we would like MSEtest / MSEbaseline to be much less than 1. • Can also plot histograms of individual errors: MSE might be dominated by outliers
Classification • Predictive modeling: predict Y given X • Y is real-valued => regression • Y is categorical => classification • Often use C rather than Y to indicate the “class variable” • Classification • Many applications: speech recognition, document classification, OCR, loan approval, face recognition, etc
Classification v. Regression • Similar in many ways… • both learn a mapping from X to C or Y • Both sensitive to dimensionality of X • Generalization to new data is important in both • Test error versus model complexity • Many models can be used for either classification or regression, e.g., • trees, neural networks • Most important differences • Categorical Y versus real-valued Y • Different score functions • E.g., classification error versus squared error
Probabilistic View of Classification • Notation: K classes c1,…..cK • Class probabilities: p(ck) = probability of class k • Class-conditional probabilities p( x | ck ) = probability of x given ck , k = 1,…K • Posterior class probabilities (by Bayes rule) p( ck | x ) = p( x | ck ) p(ck) / p(x) , k = 1,…K where p(x) = S p( x | cj ) p(cj) In theory this is all we need….in practice this may not be best approach.
Bayes Rules for Classification Consider 2 class case c1, c2 Goal of classification: given x, predict c1 or c2 Optimal decision rule: choose c1 if p(c1 | x) > 0.5, otherwise choose c2 => we would like to know p(c1 | x), By Bayes rule, p(c1 | x ) = p(x | c1) p(c1) / p(x) = p(x | c1) p(c1) / ( p(x | c1) p(c1) + p(x | c2) p(c2) ) = p(x , c1) / ( p(x , c1) + p(x , c2) )
Probabilistic Classification for 1-dimensional x p( x , c2 ) p( x , c1 ) Note that p( x , c ) = p(x | c) p(c)
Probabilistic Classification for 1-dimensional x p( x , c2 ) p( x , c1 ) 1 p( c1 | x ) 0.5 0
Probabilistic Classification for 1-dimensional x p( x , c2 ) p( x , c1 ) 1 p( c1 | x ) 0.5 0
Decision Regions and Bayes Error Rate p( x , c2 ) p( x , c1 ) Class c2 Class c2 Class c1 Class c2 Class c1 Optimal decision regions = regions where 1 class is more likely Optimal decision regions optimal decision boundaries
Decision Regions and Bayes Error Rate p( x , c2 ) p( x , c1 ) Class c2 Class c2 Class c1 Class c2 Class c1 Optimal decision regions = regions where 1 class is more likely Optimal decision regions optimal decision boundaries Bayes error rate = fraction of examples misclassified by optimal classifier = shaded area above (see equation 10.3 in text)
Procedure for optimal Bayes classifier • For each class learn a model p( x | ck ) • E.g., each class is multivariate Gaussian with its own mean and covariance • Use Bayes rule to obtain p( ck | x ) => this yields the optimal decision regions/boundaries => use these decision regions/boundaries for classification • Correct in theory…. but practical problems include: • How do we model p( x | ck ) ? • Even if we know the model for p( x | ck ), modeling a distribution or density will be very difficult in high dimensions (e.g., p = 100) • Alternative approach: model the decision boundaries directly
3 Types of Classifiers • Generative (or class-conditional) classifiers: • Learn models forp( x | ck ), use Bayes rule to find decision boundaries • Examples: naïve Bayes models, Gaussian classifiers • Regression-based classifiers: • Learn a model forp( ck | x ) directly • Example: logistic regression, neural networks • Discriminative classifiers • No probabilities • Learn the decision boundaries directly • Examples: • Linear boundaries: perceptrons, linear SVMs • Piecewise linear boundaries: decision trees, nearest-neighbor classifiers • Non-linear boundaries: non-linear SVMs • Note: one can usually “post-fit” class probability estimates p( ck | x ) to a discriminative classifier, e.g., often done with SVMs
Generative Classifier p( x , c2 ) p( x , c1 )
Regression-based Classifier p( x , c2 ) p( x , c1 ) 1 p( c1 | x ) 0.5 0
Discriminative Classifier p( x , c2 ) p( x , c1 ) 1 p( c1 | x ) 0.5 0
What type of cost function is appropriate? • Lets look at the score functions: • c(i) = true class, c(x(i) ; q) = class predicted by the classifier Class-mismatch loss functions: S(q) = 1/n Si Cost [c(i),c(x(i) ; q) ] where cost(i, j) = cost of misclassifying true class i as predicted class j e.g.,cost(i,j) = 0 if i=j, = 1 otherwise (misclassification error or 0-1 loss) and more generally cost(i,j) is a matrix of K x K losses (e.g., surgery, spam email, etc) Class-probability loss functions, c = 0 or 1 S(q) = 1/n Si log p(c(i) | x(i) ; q) (log probability score) or S(q) = 1/n Si[ c(i) – p(c(i) | x(i) ; q) ]2 (Brier score)
Example: cost functions for classifying spam email • 0-1 loss function • Appropriate if we just want to maximize accuracy • Asymmetric cost matrix • Appropriate if missing non-spam emails is more “costly” than failing to detect spam emails • Probability loss • Appropriate if we wanted to rank all emails by p(spam | email features), e.g., to allow the user to look at emails via a ranked list. • In general: don’t solve a harder problem than you need to, or don’t model aspects of the problem you don’t need to
Examples of Classifiers • Generative/class-conditional/probabilistic, based on p( x | ck ), • Naïve Bayes (simple, but often effective in high dimensions) • Parametric generative models, e.g., Gaussian (can be effective in low-dimensional problems: leads to quadratic boundaries in general) • Regression-based, model p( ck | x ) directly • Logistic regression: simple, linear in “odds” space, widely used in industry • Neural network: non-linear extension of logistic, can be difficult to work with • Discriminative models, focus on locating optimal decision boundaries • Linear discriminants, perceptrons: simple, sometimes effective • Support vector machines: generalization of linear discriminants, can be quite effective, computational complexity can be an issue • Nearest neighbor: simple, can scale poorly in high dimensions • Decision trees: often effective in high dimensions, but biased
Generative Classifiers(classifiers that estimate p(x | c) and then use Bayes rule to compute p(c | x)
A Generative Classifier: Naïve Bayes • Generative probabilistic model with conditional independence assumption onp( x | ck ), i.e. p( x | ck ) = Pp( xj | ck ) or, log p( x | ck ) = Slog [p( xj | ck ) ] • Useful in high-dimensional problems • Typically used with nominal or ordinal variables • Real-valued variables discretized to create ordinal versions • e.g., Supervised and unsupervised discretization of continuous features, Dougherty, Kohavi, and Sahami, ICML 1995 • alternative for real-valued x is to model eachp( xj | ck ) with a parametric density model, e.g., Gaussian. Less widely used.
A Generative Classifier: Naïve Bayes Comments: Simple to train (just estimate conditional probabilities for each feature-class pair) Often works surprisingly well in practice e.g., good baseline for text classification, basis of many widely used spam filters Feature selection can be helpful, e.g., select the K best individual features Note that even if independence assumptions are not met, it may still be able to approximate the optimal decision boundaries (seems to happen in practice) See On the optimality of the simple Bayesian classifier under zero-one loss, Domingos and Pazzani, Machine Learning, 2004 However…. on most problems can usually be beaten with a more complex model (plus more work)
Regression-Based Classifiers(classifiers that estimate p(c | x) directly)
Regression-based Classification • Consider regression once again, but where y now takes values 0 or 1 • Regression will try to learn an f function to approximate E[y | x] at each x • For binary y we have E[y | x] = Sy p(y |x) y = p(y=1|x) . 1 + p(y=0|x) . 0 = p(y=1|x) => For binary classification problems a regression model will try to approximate p(y=1|x) (posterior class probabilities) • e.g., this is what logistic regression and neural networks do
Predicting an output between 0 and 1 • We often have a problem where y lies between 0 and 1 • probability that a patient with attributes X will survive 10 years • proportion of people in Zip code X who will buy a product • We could use linear regression, but….. • Instead we can use the logistic function log p(y=1|x)/log p(y=0|x) = b0 + Sbjxj Equivalently, p(y=1|x) = 1/[1 + exp(- b0 - Sbjxj ) ] We model the log-odds as a linear function of the input variables. This is known as logistic regression. (Note: neural networks can be thought of as multi-layer logistic models)
1-dimensional case p(y=1| x ) = 1/[1 + exp(- b0 - b x ) ] For simplicity assume b’s are both >0 As x -> + infinity, p(y=1 | x) -> 1 As x -> - infinity, p(y=1 | x) -> 0 P(y=1|x) = 0.5 when? - b0 - b x = 0 -> x = - b0 / b - location of logistic curve controlled by - b0 / b - steepness of curve controlled by b
Likelihood-based Objective Function • Conditional Log-Likelihood • likelihood = probability of observed data • Select parameters to maximize the (log) likelihood of the y’s given the x’s (“conditional maximum likelihood”) S(b) = Si log p( y(i) | x(i) ; b)= Si y(i) log p( y(i)=1| x(i) ; b) + [1-y(i)] log(1- p( y(i)=1| x(i) ; b))
Fitting a Logistic Regression Model • Iterative Reweighted Least Squares (IRLS) • Can compute the 2nd derivative directly as weighted matrix • Forms the basis for an iterative 2nd order Newton scheme • Each iteration is equivalent to a weighted regression problem, O(p3) • see (e.g.) Komarek and Moore (2005) for speedups for sparse data • Known as iteratively reweighted least-squares • Log-likelihood here is convex: so it is quite stable (only one global maximum!). • Stochastic gradient descent • Often faster for large data sets (large N, large p) • See notes by Charles Elkan for reference • http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf
Link between Logistic Regression and Naïve Bayes Logistic Regression Naïve Bayes
Evaluating Classifiers Evaluate on independent test data (as with regression) Measures of performance on test data: Classification accuracy (or error) or cost function if “costs” of errors are not symmetric Confusion matrices: K x K matrix where entry(i,j) contains number of test examples that were predicted to be class i, and truly belonged to class j Diagonal elements = examples classified correctly Off-diagonal elements = misclassified examples Useful with more than 2 classes for figuring out which classes are most “confused” Log-probability score on test data Useful if we want to measure how good (well-callibrated) p(c|x) estimates are Ranking performance How well does a classifier rank new examples? Receiver-operating characteristics Lift curves
Imbalanced Class Distributions • Common in data mining to have one class be much less likely than the others • e.g., 0.1% of examples are fraudulent or have a disease • If we train a standard classifier on a random sample of data it is very difficult to beat the “majority classifier” in terms of accuracy • Approaches: • Stratified sampling: artificially create training data with 50% of each class being present, and then “correct” for this in prediction • E.g., learn p(x|c) on stratified data and use true p( c ) when predicting with a probabilistic model • Use a different score function: • We are often interested in scoring/screening/ranking cases when using the model • Thus, scores such as “how many of the class of interest are ranked in the top 1% of predictions” may be more relevant than overall accuracy (e.g., in document retrieval)
Ranking and Lift Curves • Many problems where we are interested in ranking examples in terms of how likely they are to the “positive” class • E.g., credit scoring, fraud detection, medical screening, document retrieval • E.g., use classifier to rank N test examples according to p(c|x) and then pick the top K, where K is much smaller than N • Lift curve • n = number of true positives that appear in top K% of ranked list • r = number of true positives that would appear if we ranked randomly • n/r is the “lift” provided by the classifier for top K% • e.g., K = 10%, r = 200, n = 300, lift = 1.5, or 50% increase in lift • Random ranking gives lift = 1, or 0% increase in lift
Target variable = response/no-response from mailing campaign • Training and test sets each of size 250k • Standard model had 80 variables: variable selection reduced this to 7 • Note non-monotonicity in lower curve (undesirable)
Receiver Operating Characteristic (ROC) plots • Rank the N test examples by p(c|x) • or whatever real-number our classifier produces that indicates likelihood of belonging to class 1 • Let k = number of true class 1 examples, and m = number of true class 0 examples, and k+m = N • For all possible thresholds t for this ranked list • count number of true positives kt • true positive rate = kt /k • count number of “false alarms”, mt • false positive rate = mt /m • ROC plot = plot of true positive rate kt v false positive rate mt
ROC Example N = 10 examples, k = 6 true class 1’s, m = 4 class 0’s The first column is a possible ranking from a classifier
ROC Plot • Area under curve (AUC) often used as a metric to summarize ROC • Online example at http://www.anaesthetist.com/mnm/stats/roc/ Diagonal line corresponds to random ranking
Calibration • In addition to ranking we may be interested in how accurate our estimates of p(c|x) are, • i.e., if the model says p(c|x) = 0.9, how accurate is this number? • Calibration: • a model is well-calibrated if its probabilistic predictions match real-world empirical frequencies • i.e., if a classifier predicts p(c|x) = 0.9 for 100 examples, then on average we would expect about 90 of these examples to belong to class c, and 10 not to. • We can estimate calibration curves by binning a classifier’s probabilistic predictions, and measuring how many
Examples of Classifiers • Generative/class-conditional/probabilistic, based on p( x | ck ), • Naïve Bayes (simple, but often effective in high dimensions) • Parametric generative models, e.g., Gaussian (can be effective in low-dimensional problems: leads to quadratic boundaries in general) • Regression-based, model p( ck | x ) directly • Logistic regression: simple, linear in “odds” space, widely used in industry • Neural network: non-linear extension of logistic, can be difficult to work with • Discriminative models, focus on locating optimal decision boundaries • Linear discriminants, perceptrons: simple, sometimes effective • Support vector machines: generalization of linear discriminants, can be quite effective, computational complexity can be an issue • Nearest neighbor: simple, can scale poorly in high dimensions • Decision trees: often effective in high dimensions, but biased
Nearest Neighbor Classifiers • kNN: select the k nearest neighbors to x from the training data and select the majority class from these neighbors • k is a parameter: • Small k: “noisier” estimates, Large k: “smoother” estimates • Best value of k often chosen by cross-validation • Comments • Virtually assumption free • Gives piecewise linear boundaries (i.e., non-linear overall) • Interesting theoretical properties: Bayes error < error(kNN) < 2 x Bayes error (asymptotically) • Disadvantages • Can scale poorly with dimensionality: sensitive to distance metric • Requires fast lookup at run-time to do classification with large n • Does not provide any interpretable “model”
Local Decision Boundaries Boundary? Points that are equidistant between points of class 1 and 2 Note: locally the boundary is (1) linear (because of Euclidean distance) (2) halfway between the 2 class points (3) at right angles to connector 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Finding the Decision Boundaries 1 2 Feature 2 1 2 ? 2 1 Feature 1
Overall Boundary = Piecewise Linear Decision Region for Class 1 Decision Region for Class 2 1 2 Feature 2 1 2 ? 2 1 Feature 1