300 likes | 539 Views
Classification: Linear Models. Oliver Schulte Machine Learning 726. Linear Classification Models. General Idea: Learn linear continuous function y of continuous features x . Classify as positive if y crosses a threshold, typically 0.
E N D
Classification: Linear Models Oliver Schulte Machine Learning 726
Linear Classification Models • General Idea: • Learn linear continuous function y of continuous features x. • Classify as positive if y crosses a threshold, typically 0. • As in linear regression, can use more complicated features defined by basis functions ϕ.
Example: Classifying Digits • Classify input vector as “4” vs. “not 4”. • Represent input image as vector x with 28x28 =784 numbers. • Target t = 1 for “positive”, -1 for “negative”. • Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x). • y:R784R. • Classify x as positive if y(x) >0, negative o.w.
Other Examples • Will the person vote conservative, given age, income, previous votes? • Is the patient at risk of diabetes given body mass, age, blood test measurements? • Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude. surface wave magnitude body wave magnitude Age Income Votes Convervative disaster type
Linear Separation x1 = surface wave magnitude x2 = body wave magnitude white = earthquake black = nuclear explosion Russell and Norvig Figure 18.15
Linear Discriminants • Simple linear model: • Can drop explicit w0 if we assume fixed dummy bias. • Decision surface is line, orthogonal to w. • In 2-D, just try a line between the classes!
Defining an Error Function • General idea: • Encode class label using a real number t. • e.g., “positive” = 1, “negative” = 0 or “negative” = -1. • Measure error by comparing continuous linear output y and class label code t.
The Error Function for linear discriminants • Could use squared error as in linear regression. • Various problems (see book). Basically due to the fact that 1,-1 are not real target values. • Different criterion developed for learning perceptrons. • Perceptrons are a precursor to neural nets. • Analog implementation by Rosenblatt in the 1950s, see Figure 4.8.
The Perceptron Criterion • An example is misclassified if • (Take a moment to verify this.) • Perceptron Error where M is the set of misclassified inputs, the mistakes. • Exercise: find the gradient of the error function wrt a single input xn.
Perceptron Learning Algorithm • Use stochastic gradient descent. • gradient descent for one example at a time, cycle through. • Update Equation:where we set η= 1 (without loss of generality in this case). • Excel Demo.
Perceptron Learning Analysis • Theorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them. • Convergence can be slow. • Sensitive to initialization.
Nonseparability • Linear discriminants can solve problems only if the classes can be separated by a line (hyperplane). • Canonical example of non-separable problem is X-OR. • Perceptron typically does not converge.
Nonseparability: real world example x1 = surface wave magnitude x2 = body wave magnitude white = earthquake black = nuclear explosion Figure Russell and Norvig 18.15 b
Responses to Nonseparability Classes cannot be separated by a linear discriminant use non-linear activation function finds approximate solution separate classes not completely but “well” add hidden features Fisher discriminant(not covered) logistic regression neural network support vector machine
From Values to Probabilities • Key idea: instead of predicting a class label, predict the probability of a class label. • E.g., p+ = P(class is positive|features) p- = P(class is negative|features) • Naturally a continuous quantity. • How to turn a real numbery into a probability p+?
The Logistic Sigmoid Function • Definition: • Squeezes the real line into [0,1]. • Differentiable: (nice exercise)
Soft threshold interpretation • If y> 0, σ(y) goes to 1 very quickly. • If y<0, σ(y) goes to 0 very quickly. Figure Russell and Norvig 18.17
Probabilistic Interpretation • The sigmoid can be interpreted in terms of the class oddsp+/(1-p+). • Exercise: Show the following implication for the class odds: • Therefore the log class odds.
Logistic Regression • In logistic regression, the log-class odds are a linear function of the input features: • Recall that we got the same kind of expression for the naive Bayes classifier. • Learning logistic regression is conceptually similar to linear regression.
Logistic Regression: Maximum Likelihood • Notation: the probability that the n-th input example is positive = which depends on a weight vector w. • Positive example has tn = 1, negative tn = 0. • Then the likelihood assigned to N independent training data is • The cross-entropy error • Equivalent to minimizing the KL divergence between the predicted class probabilities and the observed class frequencies.
Gradient Search • Exercise (on assignment): Using the cross-entropy errorshow that • Hint: recall that • No closed form minimum since is non-linear function of input features. • Can use gradient descent. • Better approach: Use Iterative Reweighted Least Squares (IRLS). See assignment.
Example logistic regression model learned on non-separable data Figure Russell and Norvig 18.17
Logistic Regression With Basis Functions Figure Bishop 4.12
Multi-Class Example • Logistic regression can be extended to multiple classes. • Here’s a picture of what decision boundaries can look like.