180 likes | 351 Views
Discriminative Classifiers. Model the classification / decision surface directly (rather than modeling class membership and deriving the decision) LTU (linear threshold unit, also “perceptron”) LMS (least mean square) algorithm Fisher discriminant SVMs
E N D
Discriminative Classifiers • Model the classification / decision surface directly(rather than modeling class membership and deriving the decision) • LTU (linear threshold unit, also “perceptron”) • LMS (least mean square) algorithm • Fisher discriminant • SVMs • And now: Logistic Regressionhttp://www.cs.cmu.edu/%7Etom/NewChapters.html CS446-Fall ’06
Logistic Regression • Assume binary classification Y with Pr(Y|X) monotonic with features X Is subject likely to • Suffer a heart attack within 1 yr? Given: Number of previous heart attacks • Be over six feet tall? Given: Gender and heights of parents • Receive an A in CS446? Given: Grade in CS273 • Multivariate and Ordinal are also possible • Characterize whether Y=0 or Y=1 is more likely given X • Odds ratio characterizes a possible decision surface • Assign Y= 0 ifAssign Y=1 otherwise CS446-Fall ’06
Y X Logistic Regression • Relative class proportion changes with X • For one-dimensional X, Y might look like: • How to model the decision surface? CS446-Fall ’06
Logit Function • Odds ratio: • Model the log of the odds ratio as a linear function of the features • P(Y=1 | X) = 1 – P(Y=0 | X); Let P be P(Y=0 | X) • ln(odds) = ln(P / (1 – P)) = logit(P) • Assuming the logit is linear: ln(P / (1 – P)) = w0 + w1x1 + w2x2 +…+ wnxn • Exponentiate, Multiply by (1-P), Collect like terms: and (1 – P) Not quite standard (usually these are reversed) – remember for later… CS446-Fall ’06
Form of the Probabilities • Thus, we impute a form to P(Y=0 | X) and P(Y=1 | X) • Consider P(Y=0 | X): • At one extreme the exponent approaches - • P(Y=0 | X) approaches 0 • At the other it approaches + • P(Y=0 | X) approaches 1 • Transitions in the middle as does P(Y=1 | X) which is just 1 - P(Y=0 | X) CS446-Fall ’06
Y X P(Y=1 | X) P(Y=0 | X) Class Membership Probability Functions • We can graph P(Y=0 | X) and P(Y=1 | X) w0=-5w1=15 Classification boundary;Odds ratio is 1 hereLogit is 0: ln(1)=0 CS446-Fall ’06
Modeling Control / Flexibility • The w‘s determine the behavior of the classifier • wi i=1,..n independently control the steepness for each feature • w0 repositions the classification transition • Choose the best w‘s for the training data CS446-Fall ’06
What are the Best w‘s for the Training Data? • The training data are assumed to be independent • So we wantwhere Yl is the class and Xl are the featuresof the l’th training example • Equivilently, to expose the underlying linearity of example independence • Thus, we want the maximum likelihood estimation of w for the training data • Now Mitchell changes representation; so will we: CS446-Fall ’06
Training • Note is invariant under representation change • Consider the sum • We wish to maximize this sum using W • There is no closed form solutionbut we can iterate using the gradient CS446-Fall ’06
Weight Update Rule • Want to maximize • Let be probability of Y=1 given X for current Wso • Also to treat w0 consistently introduce X0 to be 1 • Then the gradient components can be written CS446-Fall ’06
Gradient Interpretation • We can view as the prediction error • This is multiplied by the feature value(this should look familiar) • Weight update rule is a step size / learning rate(this should also look familiar) CS446-Fall ’06
Least Mean Square Iterative Algorithm • Recall from LMS gradient algorithm • Parial of the Error w.r.t. weight i • Yielding the weight update rule: CS446-Fall ’06
Problems - overfitting • What if the training data is linearly separable • What if the margin shrinks but due to a few data points(or just one) • What if the data is not quite linearly separable but due to just a few data points • Recall SVMs • We would like to prefer a large margin • Prefer less steep slopes • Even if it means misclassifying some points CS446-Fall ’06
Regularization • Penalize complexity • What is complexity? The magnitude of W • Optimization problem becomes • Update rule becomes CS446-Fall ’06
The Bayes Optimal ClassifierGetting away from generative modelsOur first ensemble method! • H is a parameterized hypothesis space • hML is the maximum likelihood hypothesis given some data. • Can we do better (higher expected accuracy) than hML? • Yes! We expect hMAP to outperform hML IF… • There is an interesting prior P(h) (ie, not uniform) • Can we do better (higher expected accuracy) than hMAP? • Yes! Bayes Optimal will outperform hMAP IF…(some assumptions) CS446-Fall ’06
Bayes Optimal Classifier Getting another Doctor’s second opinion, another’s third opinion… One doctor is most confident. He is hML One doctor is most reliable / accurate. She is hMAP But she may only be a little more trustworthy than the others. What if hMAP says “+” but *all* other hH say “-”? • If P(hMAP |D) < 0.5, perhaps we should prefer “-” • Think of each hi as casting a weighted vote • Weight each hi by how likely it is to be correct given the training data. • Not just by P(h) which is already reflected in hMAP • Rather by P(h|D) • The most reliable joint opinion may contradict hMAP CS446-Fall ’06
Bayes Optimal Classifier: Example • Assume a space of 3 hypotheses • Given a new instance, assume that h1(x) = 1 h2(x) = 0 h3(x) = 0 • In this case, P(f(x) =1 ) = 0.4 P(f(x) = 0) = 0.6 but hmap(x) =1 • We want to determine the most probable classification by combining the prediction of all hypotheses • We can weight each by its posterior probabilities(there are additional lurking assumptions…) CS446-Fall ’06