620 likes | 1.02k Views
Linear Discriminant Functions Chapter 5 (Duda et al.). CS479/679 Pattern Recognition Dr. George Bebis. Discriminant Functions: two-categories case. Decide w 1 if g( x ) > 0 and w 2 if g( x ) < 0 If g( x )=0 , then x lies on the decision boundary and can be assigned to either class.
E N D
Linear Discriminant FunctionsChapter 5 (Duda et al.) CS479/679 Pattern RecognitionDr. George Bebis
Discriminant Functions:two-categories case • Decide w1 if g(x) > 0 and w2 if g(x) < 0 • If g(x)=0, then x lies on the decision boundary and can be assigned to either class. • Classification is viewed as findinga decision boundarythat separates the data belonging to different classes.
Statistical vs Discriminant Approach • Statistical approaches find the decision boundary by first estimating the probability distribution of the patterns belonging to each class. • Discriminant approaches find the decision boundary explicitly without assuming a probability distribution.
Discriminant Approach • Specify parametric form of the decision boundary (e.g., linear or quadratic). • Find the best decision boundary of the specified form using a set of training examples xk. • This is performed by minimizing a criterion function (e.g., “training error” or “sample risk”): correct class predicted class
Linear Discriminant Functions:two-categories case • A linear discriminant function has the following form: • The decision boundary, given byg(x)=0, is a hyperplane where the orientation of the hyperplane is determined by w and its location by w0. • w is the normal to the hyperplane • If w0=0, the hyperplane passes through the origin
Geometric Interpretation of g(x) • g(x) provides an algebraic measure of the distance of x from the hyperplane. x can be expressed as follows: direction of r
Geometric Interpretation of g(x) (cont’d) • Substitute the previous expression in g(x): since and
Geometric Interpretation of g(x) (cont’d) • Therefore, the distance of x from the hyperplane is given by: setting x=0:
Linear Discriminant Functions: multi-category case • There are several ways to devise multi-category classifiers using linear discriminant functions: (1) One against the rest (i.e., c-1 two-class problems) problem: ambiguous regions
Linear Discriminant Functions: multi-category case (cont’d) (2) One against another (i.e., c(c-1)/2 pairs of classes) problem: ambiguous regions
Linear Discriminant Functions: multi-category case (cont’d) • To avoid the problem of ambiguous regions: • Define c linear discriminant functions • Assign x to wi if gi(x) > gj(x) for all j i. • The resulting classifier is called a linear machine (see Chapter 2 too)
Linear Discriminant Functions: multi-category case (cont’d) • A linear machine divides the feature space in c convex decisions regions. • If x is in region Ri, the gi(x) is the largest. c(c-1)/2 pairs of regions but typically less decision boundaries
Linear Discriminant Functions: multi-category case (cont’d) • The boundary between two regions Ri and Rj is a portion of the hyperplane Hij given by: • (wi-wj) is normal to Hij and the signed distance from x to Hij is
Higher Order Discriminant Functions • Can produce more complicated decision boundaries than linear discriminant functions. Linear discriminant: hyperquadric decision boundaries
Generalized discriminants - α is a dimensional weight vector - the functions yi(x) are called φ functions • yi(x) map points from the d-dimensional x-space to the -dimensional y-space (usually >> d )
Generalized discriminants • The resulting discriminant function is not linear in x but it is linear in y. • The generalized discriminant separates points in the transformed space by a hyperplane passing through the origin.
Example d=1, • The corresponding decision regions R1,R2in the x-space are not simply connected! φ functions
Example (cont’d) g(x) maps a line in x-space to a parabola in y-space. The plane αty=0 divides the y-space in two decision regions
Learning: two-category, linearly separable case • Given a linear discriminant function the goal is to learn the parameters w and w0 using a set of n labeled samples xiwhere each xi has a class label ω1 or ω2.
Simplified notation: augmented feature/weight vectors dimensionality: d (d+1)
Classification in augmented space Classification rule: If αtyi>0 assign yi to ω1 else if αtyi<0 assign yi to ω2 g(x)=αty Discriminant:
Learning: two-category, linearly separable case • Given a linear discriminant function the goal is to learn the weights (parameters) α using a set of n labeled samples yi where each yi has a class label ω1 or ω2. g(x)=αty
Effect of training examples on solution • Every training sample yi places a constraint on the weight vector α. • αty=0 defines a hyperplane in parameter space having y as a normal vector. • Given n examples, the solution α must lie on the intersection of n half-spaces. a2 g(x)=αty parameter space (ɑ1, ɑ2) a1
Effect of training examples on solution (cont’d) Solution can be visualized either in the parameteror featurespace. parameter space (ɑ1, ɑ2) feature space (y1, y2) a2 a1
Solution Uniqueness/Constraints Solution vector is usually not unique; we can impose certain constraints to enforce uniqueness, e.g.,: Example 1: Find unit-length weight vector that maximizes the minimum distance from the samples to the separating plane.
Solution Uniqueness/Constraints (cont’d) Example 2: Find minimum-length weight vector satisfying the constraint shown below where b is a positive constant called margin.
Learning: two-category linearly separable case (cont’d) parameter space (ɑ1, ɑ2) Key objective is to move solution to the center of the feasible regionas this solution is more likely to classify new test samples correctly.
Normalized Version Ifyiin ω2, replace yi by -yi Find α such that: αtyi>0 replace yi by -yi
Iterative Optimization • Define an error function J(α) (e.g., based on training samples) that is minimized if α is a solution vector. • Minimize J(α) iteratively: search direction α(k) α(k+1) learning rate
Gradient Descent Method learning rate aα
Gradient Descent (cont’d) solution space ɑ
Gradient Descent (cont’d) • What is the effect of the learning rate? η η slow but converges to solution fast by overshoots solution
Gradient Descent (cont’d) • How to choose the learning rate h(k)? • If J(α) is quadratic, then H is constant which implies that the learning rate is constant! Taylor series approximation Hessian (2nd derivatives) optimum learning rate
Newton’s Method requires inverting H aα
Newton’s method (cont’d) If J(α) is quadratic, Newton’s method converges in one step!
Perceptron rule • Apply Gradient Descent Rule assuming: where Y(α) is the set of samples misclassified by α. • If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0 (normalized version)
Perceptron rule (cont’d) • The gradient of Jp(α) is: • The perceptron update rule is obtained using gradient descent: aα or
Perceptron rule (cont’d) aα YkY(α) all missclassified examples
Perceptron rule (cont’d) • Move the hyperplane so that training samples are on its positive side. a2 a2 Example 1 Example 2 a1 a1
Perceptron rule (cont’d) η(k)=1 one example at a time Perceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.
Perceptron rule (cont’d) order of examples: y2 y3 y1 y3
Perceptron rule (cont’d) • Generalizations: Variable learning rate and a margin
Relaxation Procedures • Note that different criterion functions exist • One possible choice is: • Where Y is again the set of the training samples that are misclassified by a • However, there are two problems with this criterion • The function is too smooth and can converge to a=0 • Jq is dominated by training samples with large magnitude
Relaxation Procedures (cont’d) • A modified version that avoids the above two problems is • Here Y is the set of samples for which its gradient is given by: