Linear Models for Classification

Linear Models for Classification Berkay Topçu

Linear Models for Classification • Goal: Take an input vector and assign it to one of K classes (Ck where k=1,...,K) • Linear separation of classes

Generalized Linear Models • We wish to predict discrete class labels, or more generally class posterior probabilities that lies in range (0,1). • Classification model as a linear function of the parameters, • Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions

Discriminant Functions • Linear discriminants • If , assign to class C1and to class C2otherwise • Decision boundary is given by • determines the orientation of the decision surface and determines the location • Compact notation:

Multiple Classes • K-class discriminant by combining number of two-class discriminant functions (K>2) • One-versus-the-rest: seperating points in one particular class Ckfrom points not in that class • One-versus-one: K(K-1)/2 binary discriminant functions

Multiple Classes • A single K-class discriminant comprising K linear functions • Assign to class Ck if for all • How to learn the parameters of linear discriminant functions?

Least Squares for Classification • Each class Ckis described by its own linear model • Training data set for n =1,...,N where • Matrix whose nthrow is the vector and whose nth row is

Least Squares for Classification • Minimizing the sum-of-squares error function • Solution : • Discriminant function :

Fisher’s Linear Discriminant • Dimensionality reduction: take the D-dimensional input vector and project to one dimension using • Projection that maximizes class seperation • Two-class problem: N1points of C1and N2points of C2 • Fisher’s idea: • large separation between the projected class means • small variance within each class, minimizing class overlap

Fisher’s Linear Discriminant • The Fisher criterion:

Fisher’s Linear Discriminant • For the two-class problem, Fisher criterion is a special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani) • For multiple classes: • The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of

The Perceptron Algorithm • Input vector is transformed using a nonlinear transformation • Perceptron criterion: • For all training samples • We need to minimize

The Perceptron Algorithm – Stocastic Gradient Descent • Cycle through the training patterns in turn • If the pattern is correctly classified weight vectors remains unchanged, else:

Probabilistic Generative Models • Depend on simple assumptions about the distribution of the data • Logistic sigmoid function • Maps the whole real axis to a finite interval

Continuous Inputs - Gaussian • Assuming the class-conditional densities are Gaussian • Case of two classes

Maximum Likelihood Solution • Likelihood function: • Maximizing log-likelihood

Probabilistic Discriminative Models • Probabilistic generative model • Number of parameters grows quadratically with M (# dim.) • However has M adjustable parameters • Maximum likelihood solution for Logistic Regression • Energy function: negative log likelihood

Iterative Reweighted Least Squares • Newton-Raphson iterative optimization on linear regression • Same as the standard least-squares solution

Iterative Reweighted Least Squares • Newton-Raphson update for negative log likelihood • Weighted least-squares problem

Maximum Margin Classifiers • Support Vector Machines for two-class problem • Assuming linearly seperable data set • There exists at least one set of variables satisfies • That give the smallest generalization error • Margin: the smallest distance between decision boundary and any of the samples

Support Vector Machines • Optimization of parameters, maximizing the margin • Maximizing the margin minimizing : • subject to the constraint: • Introduction of Lagrange multipliers

Support Vector Machines - Lagrange Multipliers • Minimizing with respect to w and b and maximizing with respect to a. • The dual form: • Quadratic programming problem:

Support Vector Machines • Overlapping class distributions (linearly unseparable data) • Slack variable: distance from the boundary • To maximize the margin while penalizing points that lie on the wrong side of the margin boundary

SVM-Overlapping Class Distributions • Identical to separable case • Again represents a quadratic programming problem

Support Vector Machines • Relation to logistic regression • Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE) • Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error

Linear Models for Classification