610 likes | 1.14k Views
Linear Models for Classification. Berkay Topçu. Linear Models for Classification. Goal: Take an input vector and assign it to one of K classes (C k where k =1,...,K) Linear separation of classes. Generalized Linear Models.
E N D
Linear Models for Classification Berkay Topçu
Linear Models for Classification • Goal: Take an input vector and assign it to one of K classes (Ck where k=1,...,K) • Linear separation of classes
Generalized Linear Models • We wish to predict discrete class labels, or more generally class posterior probabilities that lies in range (0,1). • Classification model as a linear function of the parameters, • Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions
Discriminant Functions • Linear discriminants • If , assign to class C1and to class C2otherwise • Decision boundary is given by • determines the orientation of the decision surface and determines the location • Compact notation:
Multiple Classes • K-class discriminant by combining number of two-class discriminant functions (K>2) • One-versus-the-rest: seperating points in one particular class Ckfrom points not in that class • One-versus-one: K(K-1)/2 binary discriminant functions
Multiple Classes • A single K-class discriminant comprising K linear functions • Assign to class Ck if for all • How to learn the parameters of linear discriminant functions?
Least Squares for Classification • Each class Ckis described by its own linear model • Training data set for n =1,...,N where • Matrix whose nthrow is the vector and whose nth row is
Least Squares for Classification • Minimizing the sum-of-squares error function • Solution : • Discriminant function :
Fisher’s Linear Discriminant • Dimensionality reduction: take the D-dimensional input vector and project to one dimension using • Projection that maximizes class seperation • Two-class problem: N1points of C1and N2points of C2 • Fisher’s idea: • large separation between the projected class means • small variance within each class, minimizing class overlap
Fisher’s Linear Discriminant • The Fisher criterion:
Fisher’s Linear Discriminant • For the two-class problem, Fisher criterion is a special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani) • For multiple classes: • The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of
The Perceptron Algorithm • Input vector is transformed using a nonlinear transformation • Perceptron criterion: • For all training samples • We need to minimize
The Perceptron Algorithm – Stocastic Gradient Descent • Cycle through the training patterns in turn • If the pattern is correctly classified weight vectors remains unchanged, else:
Probabilistic Generative Models • Depend on simple assumptions about the distribution of the data • Logistic sigmoid function • Maps the whole real axis to a finite interval
Continuous Inputs - Gaussian • Assuming the class-conditional densities are Gaussian • Case of two classes
Maximum Likelihood Solution • Likelihood function: • Maximizing log-likelihood
Probabilistic Discriminative Models • Probabilistic generative model • Number of parameters grows quadratically with M (# dim.) • However has M adjustable parameters • Maximum likelihood solution for Logistic Regression • Energy function: negative log likelihood
Iterative Reweighted Least Squares • Newton-Raphson iterative optimization on linear regression • Same as the standard least-squares solution
Iterative Reweighted Least Squares • Newton-Raphson update for negative log likelihood • Weighted least-squares problem
Maximum Margin Classifiers • Support Vector Machines for two-class problem • Assuming linearly seperable data set • There exists at least one set of variables satisfies • That give the smallest generalization error • Margin: the smallest distance between decision boundary and any of the samples
Support Vector Machines • Optimization of parameters, maximizing the margin • Maximizing the margin minimizing : • subject to the constraint: • Introduction of Lagrange multipliers
Support Vector Machines - Lagrange Multipliers • Minimizing with respect to w and b and maximizing with respect to a. • The dual form: • Quadratic programming problem:
Support Vector Machines • Overlapping class distributions (linearly unseparable data) • Slack variable: distance from the boundary • To maximize the margin while penalizing points that lie on the wrong side of the margin boundary
SVM-Overlapping Class Distributions • Identical to separable case • Again represents a quadratic programming problem
Support Vector Machines • Relation to logistic regression • Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE) • Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error