300 likes | 309 Views
This article discusses discriminant functions and probabilistic generative models in linear classification. It covers topics such as hyperplanes, one-versus-the-rest and one-versus-one classifiers, learning parameters using least square and Fisher's linear discriminant, and Fisher's discriminant for multiple classes.
E N D
Ch 4. Linear Models for Classification (1/2)Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim
Contents • 4.1. Discriminant Functions • 4.2. Probabilistic Generative Models (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Classification Models • Linear classification model • (D-1)-dimensional hyperplane for D-dimensional input space • 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0)T • Discriminant function • Directly assigns each vector x to a specific class. • ex. Fishers linear discriminant • Approaches using conditional probability • Separation of inference and decision states • Two approaches • Direct modeling of the posterior probability • Generative approach • Modeling likelihood and prior probability to calculate the posterior probability • Capable of generating samples (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Discriminant Functions-Two Classes • Classification by hyperplanesor (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Discriminant Functions-Multiple Classes • One-versus-the-rest classifier • K-1 classifiers for a K-class discriminant • Ambiguous when more than two classifiers say ‘yes’. • One-versus-one classifier • K(K-1)/2 binary discriminant functions • Majority voting ambiguousness with equal scores One-versus-the-rest One-versus-one (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Discriminant Functions-Multiple Classes (Cont’d) • K-class discriminant comprising K linear functions • Assigns x to the corresponding class having the maximum output. • The decision regions are always singly connected and convex. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Approaches for Learning Parametersfor Linear Discriminant Functions • Least square method • Fisher’s linear discriminant • Relation to least squares • Multiple classes • Perceptron algorithm (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Least Square Method • Minimization of the sum-of-squares error (SSE) • 1-of-K binary coding scheme for the target vector t. • For a training data set, {xn, tn} where n = 1,…,N.The sum of squares error function is… • Minimizing SSE gives Pseudo inverse (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Least Square Method (Cont’d)-Limit and Disadvantage • The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. • Vulnerable to outliers • Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary. • ML under Gaussian conditional distribution • Unimodal vs. multimodal (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Least Square Method (Cont’d)-Limit and Disadvantage • Lack of robustness comes from… • Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution. • Binary target vectors are far from this assumption. Least square solution Logistic regression (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant • Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. • In case of two classes • Finding w such that the projected data are clustered well. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant (Cont’d) • Maximizing projected mean distance? • The distance between the cluster means, m1 and m2 projected onto w. • Not appropriate when the covariances are nondiagonal. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant (Cont’d) • Integrate the within-class variance of the projected data. • Finding w that maximizes J(w). • J(w) is maximized when • Fisher’s linear discriminant • If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case. SB: Between-class covariance matrix SW: Within-class covariance matrix in the directionof (m2-m1) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Linear Discriminant-Relation to Least Squares- • Fisher criterion as a special case of least squares • When setting target values as: • N/N1 for class C1 and N/N2 for class C2. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple Classes • K > 2 classes • Dimension reduction from D to D’ • D’ > 1 linear features, yk (k = 1,…,D’) • Generalization of SW and SB SB is from the decomposition of total covariance matrix (Duda and Hart, 1997) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple Classes (Cont’d) • Covariance matrices in the projected y-space • Fukunaga’s criterion • Another criterion • Duda et al. ‘Pattern Classification’, Ch. 3.8.3 • Determinant: the product of the eigenvalues, i.e. the variances in the principal directions. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Fisher’s Discriminant for Multiple Classes (Cont’d) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Perceptron Algorithm • Classification of x by a perceptron • Error functions • The total number of misclassified patterns • Piecewise constant and discontinuous gradient is zero almost everywhere. • Perceptron criterion. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Perceptron Algorithm (cont’d) • Stochastic gradient descent algorithm • The error from a misclassified pattern is reduced after each iteration. • Not imply the overall error is reduced. • Perceptron convergence theorem. • If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it. • However… • Learning speed, linearly nonseparable, multiple classes (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Perceptron Algorithm (cont’d) (a) (b) (c) (d) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models • Computation of posterior probabilities using class-conditional densities and class priors. • Two classes • Generalization to K > 2 classes The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Continuous Inputs- • Posterior probabilities when the class-conditional densities are Gaussian. • When sharing the same covariance matrix ∑, • Two classes • The quadratic terms in x from the exponents are cancelled. • The resulting decision boundary is linear in input space. • The prior only shifts the decision boundary, i.e. parallel contour. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Continuous Inputs (cont’d)- • Generalization to K classes • When sharing the same covariance matrix, the decision boundaries are linear again. • If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Maximum Likelihood Solution- • Determining the parameters for using maximum likelihood from a training data set. • Two classes • The likelihood function (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Maximum Likelihood Solution (cont’d)- • Two classes (cont’d) • Maximization of the likelihood with respect to π. • Terms of the log likelihood that depend on π. • Setting the derivative with respect to π equal to zero. • Maximization with respect to μ1. and analogously (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Maximum Likelihood Solution (cont’d)- • Two classes (cont’d) • Maximization of the likelihood with respect to the shared covariance matrix ∑. Weighted average of the covariance matrices associated with each classes. But not robust to outliers. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Discrete Features- • Discrete feature values • General distribution would correspond to a 2D size table. • When we have D inputs, the table size grows exponentially with the number of features. • Naïve Bayes assumption, conditioned on the class Ck • Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Bayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Probabilistic Generative Models-Exponential Family- • For both Gaussian distributed and discrete inputs… • The posterior class probabilities are given by • Generalized linear models with logistic sigmoid or softmax activation functions. • Generalization to the class-conditional densities of the exponential family • The subclass for which u(x) = x. • Linear with respect to x again. Exponential family Two-classes K-classes (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/