Linear Discriminant Functions

Linear Discriminant Functions 主講人：虞台文

Contents • Introduction • Linear Discriminant Functions and Decision Surface • Linear Separability • Learning • Gradient Decent Algorithm • Newton’s Method

Linear Discriminant Functions Introduction

Decision-Making Approaches • Probabilistic Approaches • Based on the underlying probability densities of training sets. • For example, Bayesian decision rule assumes that the underlying probability densities were available. • Discriminating Approaches • Assume we know the proper forms for the discriminant functions were known. • Use the samples to estimate the values of parameters of the classifier.

Linear Discriminating Functions • Easy for computing, analysis and learning. • Linear classifiers are attractive candidates for initial, trial classifier. • Learning by minimizing a criterion function, e.g., training error. • Difficulty: a small training error does not guarantee a small test error.

Linear Discriminant Functions Linear Discriminant Functions and Decision Surface

weight vector bias or threshold Linear Discriminant Functions The two-category classification:

1 w0 1 w2 wn w1 x1 x2 xd Implementation

1 1 w0 w0 1 1 x1 x2 xd Implementation w2 wn w1 x0=1

1 w0 1 x1 x2 xd Decision Surface w2 wn w1 x0=1

xn g(x)=0 x2 x w0/||w|| w x1 Decision Surface

xn g(x)=0 x2 x w0/||w|| w x1 Decision Surface • A linear discriminant function divides the feature space by a hyperplane. • 2. The orientation of the surface is determined by the normal vector w. • 3. The location of the surface is determined by the bias w0.

x2 x1 w Augmented Space d-dimension (d+1)-dimension Let g(x)=w1x1+ w2x2 + w0=0 Decision surface:

x2 x1 w Augmented Space Augmented feature vector Augmented weight vector d-dimension (d+1)-dimension Let g(x)=w1x1+ w2x2 + w0=0 Decision surface: Pass through the origin of augmented space

y2 x2 y1 x1 a w y0 Augmented Space ay = 0 1 g(x)=w1x1+ w2x2 + w0=0

Pass through the origin only when w0=0. Always pass through the origin. By using this mapping, the problem of finding weight vector w and threshold w0 is reduced to finding a single vector a. Augmented Space Decision surface in feature space: Decision surface in augmented space:

Linear Discriminant Functions Linear Separability

2 2 1 1 The Two-Category Case Not Linearly Separable Linearly Separable

2 1 How to find a? The Two-Category Case Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

How to find a? Normalization Given a set of samples y1, y2, …, yn, some labeled 1 and some labeled 2, Withdrawing all labels of samples and replacing the ones labeled 2 by their negatives, it is equivalent to find a vector a such that if there exists a vector a such that if yi is labeled 1 if yi is labeled 2 then the samples are said to be Linearly Separable

Separating Plane: Solution Region in Feature Space Normalized Case

Separating Plane: Solution Region in Weight Space Shrink solution region by margin

Linear Discriminant Functions Learning

Criterion Function • To facilitate learning, we usually define a scalar criterion function. • It usually represents the penalty or cost of a solution. • Our goal is to minimize its value. • ≡Function optimization.

Learning Algorithms • To design a learning algorithm, we face the following problems: • Whether to stop? • In what direction to proceed? • How long a step to take? Is the criterion satisfactory? e.g., steepest decent : learning rate

J(a) Criterion Functions:The Two-Category Case # of misclassified patterns solution state where to go? Is this criterion appropriate?

Criterion Functions:The Two-Category Case Y: the set of misclassified patterns Perceptron Criterion Function solution state where to go? Is this criterion better? What problem it has?

Criterion Functions:The Two-Category Case Y: the set of misclassified patterns A Relative of Perceptron Criterion Function solution state where to go? Is this criterion much better? What problem it has?

Criterion Functions:The Two-Category Case Y: the set of misclassified patterns What is the difference with the previous one? solution state where to go? Is this criterion good enough? Are there others?

Learning Gradient Decent Algorithm

a2 J(a) a2 a1 a1 Our goal is to go downhill. Gradient Decent Algorithm Contour Map (a1, a2)

Our goal is to go downhill. Gradient Decent Algorithm Define aJis a vector

Our goal is to go downhill. Gradient Decent Algorithm

a2 J(a) (a1, a2) a1 Our goal is to go downhill. Gradient Decent Algorithm How long a step shall we take? aJ go this way

(k): Learning Rate Gradient Decent Algorithm a, , k = 0 Initial Setting do k  k + 1 a a  (k)aJ(a) |aJ(a)| <  until

Trajectory of Steepest Decent If improper learning rate (k) is used, the convergence rate may be poor. • Too small: slow convergence. • Too large: overshooting a* Basin of Attraction a2 a0 J(a1) Furthermore, the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent. J(a0) a1

Learning Rate Paraboloid Q: symmetric & positive definite

Learning Rate Paraboloid All smooth functions can be approximated by paraboloids in a sufficiently small neighborhood of any point.

We will discuss the convergence properties using paraboloids. Learning Rate Paraboloid Global minimum (x*): Set

Learning Rate Paraboloid Error Define

Learning Rate Paraboloid Error Define We want to minimize and Clearly,

Learning Rate Suppose that we are at yk. Let The steepest decent direction will be Let learning rate be k. That is, yk+1 = yk + k pk.

Learning Rate We’ll choose the most ‘favorable’ k . That is, setting we get

Learning Rate If Q = I,

Convergence Rate

Convergence Rate [Kantorovich Inequality]Let Q be a positive definite, symmetric, nnmatrix. For any vector there holds where 12  …  n are eigenvalues of Q. The smaller the better

Convergence Rate Condition Number Faster convergence Slower convergence

Convergence Rate Paraboloid Suppose we are now at xk. Updating Rule:

Trajectory of Steepest Decent In this case, the condition number of Q is moderately large. One then see that the best decent direction is not necessary, and in fact is quite rarely, the direction of steepest decent.

Learning Newton’s Method

Linear Discriminant Functions