350 likes | 505 Views
Lecture 4 Linear machine. Linear discriminant functions Generalized linear discriminant function Fisher ’ s Linear discriminant Perceptron Optimal separating hyperplane. Linear discriminant functions. g(x)=w t x+w 0. If w is unit vector, r is signed distance. Decide class by its sign.
E N D
Lecture 4Linear machine Linear discriminant functions Generalized linear discriminant function Fisher’s Linear discriminant Perceptron Optimal separating hyperplane
Linear discriminant functions g(x)=wtx+w0 If w is unit vector, r is signed distance. Decide class by its sign.
If x1 and x2 are both on the decision surface, Linear discriminant functions From the discriminant function point of view:
Linear discriminant functions More than two classes. #classes=c Dichotomize? c linear discriminants Pairwise? c(c-1)/2 linear discriminants
Linear discriminant functions Remember what we did in the Bayes Decision class? Define c linear discriminant functions: The overall classifier will be to maximize g(x) at every x: if The resulting classifier is a Linear Machine. The space is divided into c regions. The boundary between neighboring regions is linear, because:
Generalized linear discriminant functions When we transform x, linear discriminant functions can lead to non-linear separation in the original feature space.
Generalized linear discriminant functions Here in two class case, g(x)=g1(x)-g2(x) Example: a’=(-3,2,5) g(x)=-3+2x+5x2 g(x)=0 when x=3/5 or x=-1 g(x)>0 when x>3/5 or x<-1, decide R1 g(x)<0 when -1<x<3/5, decide R2
Fisher Linear discriminant The goal: project the data from d dimensions onto a line. Find the line that maximizes the class separation after projection. The magnitude of w is irrelevant, as it just scales y The direction of w is what matters. Projected mean:
Fisher Linear discriminant Then the distance between projected mean: Our goal is to make the distance large relative to a measure of variation in each class. Define the scatter: is an estimate of the pooled variance. Fisher linear discriminant aims at maximizing over all w
Fisher Linear discriminant Let Note, this is the sample version of Sw: within-class scatter matrix SB: between-class scatter matrix Let Then Let Then
Fisher Linear discriminant Because for any w, SBw is always in the direction of m1-m2 Notice this is the same result when the two densities are normal with equal variance matrix, using the Bayes decision rule.
Multiple discriminant analysis Now there are c classes. The goal is to project to c-1 dimensional space and maximize the between-group scatter relative to within-group scatter. Why c-1 ? We need c-1 discriminant functions. Within-class scatter: Total mean:
Multiple discriminant analysis Between group scatter Total scatter Take a d×(c-1) projection matrix W:
Multiple discriminant analysis The goal is to maximize: The solution: every column vector in W is among the first c-1generalized eigen vectors in Since the projected scatter is not class-specific, this is more like a dimension reduction procedure which captures as much class information as possible.
Multiple discriminant analysis Eleven classes. Projected onto the first two eigen vectors:
Multiple discriminant analysis With the increase of the eigen vector rank, the seperability decreases.
Separating hyperplane Let’s do some data augmentation to make things easier. If we have a decision boundary between two classes: Let Then What’s the benefit? The hyperplane always goes through the origin.
Linearly separable case Now we want to use the training samples to find the weight vector a which classifies all samples correctly. If a exists, the samples are linearly separable. for every yi in class 1 for every yi in class 2 If all yiin class 2 are replaced by its negative, then we are trying to find a such that for every sample. Such an a is a “separating vector” or “solution vector”. is a hyperplane through the origin of weight space with yi as a normal vector. The overall solution lies on the positive side of every such hyperplane. Or in the intersection of n half-spaces.
Linearly separable case Every vector in the grey region is a solution vector. The region is called the “solution region”. A vector in the middle looks better. We can impose conditions to select it.
Linearly separable case Maximize the minimum distance from the samples to the plane
Gradient descent procedure How to find a solution vector? A general approach: Define a function J(a) which is minimized if a is a solution vector. Start with an arbitrary vector Find the gradient Move from to the direction of the gradient to find Iterate; stop when the gain is smaller than a threshold.
Perceptron Y(a) is the set of samples mis-classified by a. When Y(a) is empty, define J(a)=0. Because aty <0 when yi is misclassified, J(a) is non-negative. The gradient is simple: The update rule is: Learning rate
Perceptron The perceptron adjusts a only according to misclassified samples; correctly classified samples are ignored. The final a is a linear combination of the training points. To have good testing-sample performance, a large set of training samples is needed; however, it is almost certain that a large set of training samples is not linearly separable. In the case of linearly non-separable, the iteration doesn’t stop. We can let η(k) 0 as k∞. However, how to choose the rate of change?
Optimal separating hyperplane The perceptron finds a separating plane out of infinite possibilities. How do we find the best among them? The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point. • Unique solution • Better test sample performance
Optimal separating hyperplane Notation change!!!! Here we use yi as the class label of sample i. min ||a||2 s.t. a’yi ≥ 1, i=1,…,N We shall visit the support vector machine next time.